Execution Engine

Summary

Given a total order (from the consensus) of transactions (from the mempool), the execution engine updates and stores the "current" state of the virtual machine, using as much concurrency as possible. Proofs from the execution engine allow light clients to read the current state. When the execution engine has finished with a transaction, it communicates to the mempool that the transaction can be garbage-collected from storage.

Vocabulary

  • Shards are processes that store and update state.

    • Different shards may be on different machines. Redistributing state between shards is called Re-Sharding.
    • Each Shard is specific to 1 learner. However, as an optimization, an implementation could conceivably use 1 process to do the work of 2 shards with different learners so long as those shards are identical, and fork that process if / when the learners diverge.
  • Executors are processes that actually run the VM and compute updates. Executors should probably be co-located with shards.

Either:

  • We assume the Mempool is using the Heterogeneous Narwhal setup?
    • In which case Consensus picks leaders in the DAG
  • Mempool is treated as some kind of black-box set of processes that can each transmit transactions to Shards.
    • In which case Consensus produces more detailed ordering information Perhaps we should have some general notion of Timestamp on transactions?

The VM is largely a black box: we assume we can give it a set of input key-value pairs and a transaction, and get output key-value pairs.

State

State is stored as mutable Values (unlimited size blobs of binary), each of which is identified with an immutable Key. If you want to mutate a Key associated with a specific Value, that's equivalent to deleting the Value associated with the old Key, and writing it to the new Key. Keys that have never had a Value written to them are mapped to an empty value.

For each Learner, all processes can map Transactions to a set of Shards whose state they read, and a set of shards whose state they write. This makes Re-Sharding challenging.

One way to implement this is to partition the space of Keys across Shards, and Label each Transaction with a Sub-Space of keys it touches. One possible Key-space would be to arrange Keys in some kind of a tree configuration.

Mempool Interface

We assume, for each Learner, that each transaction has a unique Executor: It would be more efficient if is co-located with a shard in or . As an optimization, we can have one process do the work of multiple learners' executors, so long as those learners are identical.

We assume that each transaction carries a timestamp: We assume that these timestamps have an unknown total order, and that Consensus and the Mempool can update Shards' knowledge of this total order. In particular, we assume that Consensus informs shards of an ever-growing prefix of this total order.

  • One way to accomplish this is simply to have each timestamp be the hash of the transaction, and have consensus stream a totally ordered list of all hashes included in the chain to all Shards. This may not be very efficient.
  • We could instead consider one timestamp to be definitely after another if it is a descendent in the Narwhal DAG. Narwhal workers could transmit DAG information to Shards, and shards would learn some partial ordering information before hearing from Consensus. Consensus could then transmit only the Narwhal blocks it decides on, and shards could determine a total ordering from there.

The Mempool transmits each transaction to its executor as soon as possible, using network primitives. The for each transaction that reads or writes state on shard , the Mempool also transmits to shard :

We assume that each Shard maintains Timestamps bound below which it will no longer receive new transactions. Specifically, a timestamp below which it will no longer receive new transactions that read from its state, and a timestamp below which it will no longer receive new transactions that write to its state. Generally, we expect that , but I don't know that we require this to be true. It should update this bound based on information from the Mempool. For example, it could maintain partial bounds from each mempool worker (updated whenever that mempool worker sends the Shard a message), and implement and as the greatest lower bound of all the partial bounds.

Consensus Interface

Consensus needs to update each Shard's knowledge of the total order of timestamps. In particular, we assume that Consensus informs shards of an ever-growing prefix of this total order.

  • One way to accomplish this is simply to have each timestamp be the hash of the transaction, and have consensus stream a totally ordered list of all hashes included in the chain to all Shards. This may not be very efficient.
  • We could instead consider one timestamp to be definitely after another if it is a descendent in the Narwhal DAG. Narwhal workers could transmit DAG information to Shards, and shards would learn some partial ordering information before hearing from Consensus. Consensus could then transmit only the Narwhal blocks it decides on, and shards could determine a total ordering from there.

Execution

For each learner , for each Transaction , executors wait to receive values for all keys in , then compute the transaction, and transmit to each shard any value stored on in .

Generally, transactions do not have side effects outside of state writes. However, we could in principle encode client reads as read-only transactions whose side-effect is sending a message, or allow for VMs with other side effects.

Executors can save themselves some communication if they're co-located with Shards. As an optimization, we can save on communication by combining messages for multiple learners if their content is identical and their shards are co-located. Likewise, we can save on computation by using one process to execute for multiple learners so long as they are identical.

State Updates

For each key in its state, each shard needs to establish a total order of all writes between and . Reads to each key need to be ordered with respect to writes to that key.

To accomplish this, each Shard maintains a Dependency Multi-Graph of all Shard Summaries they have received, where Summary depends on Summary if the Shard doesn't know that , and can read from a key to which can write. Specifically, if the Shard doesn't know that , then for each key that can write to and can read or write, create an edge labeled with . There can be cycles in the dependency multi-graph, but these will resolve as the Shard learns more about the total order from consensus.

Concurrently, for any Summary that no longer depends on any other Summary, if :

  • transmit the values written most recently before for any key on in to
  • upon receiving the values for any key on in from :
    • record that that value is written to key at .
    • delete edges labeled from the dependency graph.
    • As an optimization, we may want a compact "don't change this value" message.
  • When every value in has been updated, delete from the dependency graph.

Note that read-only transactions can arrive with timestamps before . These need to be added to the dependency graph and processed just like all other transactions.

Garbage Collection

Each Shard can delete all but the most recent value written to each key before .

Once all of a transaction's Executors (for all learners) have executed the transaction, we can garbage collect it. We no longer need to store that transaction anywhere.

Client Reads

Read-only transactions can, in principle, bypass Mempool and Consensus altogether: they only need to arrive at each of the relevant shards , and have a timestamp greater than . They could also be executed with a side effect, like sending a message to a client.

We can use these read-only transactions to construct checkpoints: Merkle roots of portions of state, building up to a Merkle root of the entire state.

Light client reads only need some kind of signed message produced by an executor from each of a weak quorum of validators. They do not, technically, need a Merkle root of the entire state at all. However, it may be more efficient to get a single signed message with a Merkle root of state, and then only one Validator needs to do the read-only transaction. To support this kind of thing, we may want to lag well behind , so we can do reads on recent checkpoints.