Shard¶

The Shards together store and update the state of the replicated state machine and together are a component of the Execution Engines. They provide Executors with input data and update the state according to the results of Executors' computations.

Each shard is responsible for a set of KVSKeys and these sets are disjoint for different shards. For each of the keys that a shard is responsible for, the shard maintains a (partially-ordered) timeline of time‍stamps of transaction candidates that may read or write to keys. For V1, these time‍stamps are TxFingerprints and they are totally ordered, because there is only a single Worker Engine. Shards also keep a history of data written by each TransactionCandidate to each key. This is multi-version concurrent storage.

State (of the shard)¶

For each Worker Engine, the Shard maintains:

A timestamp, such that all write lock requests¹ for transaction candidates with earlier timestamps that this worker curates have already been received. Together, these timestamps represent heardAllWrites.

For each key (assigned to this Shard):

A set of time‍stamps of known transaction candidates that read and/or write that key, and for each, some subset of:
A value written to that key at that time‍stamps by that TransactionCandidate using a KVSWrite message
A marker indicating that this TransactionCandidate may (or will) write to this key, but this Shard has not yet received a corresponding KVSWrite message.
A marker indicating that this TransactionCandidate will read this value, and an ExternalIdentity corresponding to the relevant Executor. This marker is only stored so long as the Shard doesn't know the value. When this value is determined, this Shard must remove this marker and send a KVSRead message to the Executor.
A marker indicating that this TransactionCandidate may read this value, and an ExternalIdentity corresponding to the relevant Executor. If the Executor sends a KVSReadRequest for this key, the Shard updates this marker to a "will read" marker.

Shard Optimizations¶

We want to execute each TransactionCandidate (evaluate the executor function in order to compute the data written) using the idea of serializability: each TransactionCandidate's reads and writes should be as if they were executed in the total order determined by the mempool (and consensus, from V2 onward). In fact, the simplest correct implementation amounts to executing all transaction candidates sequentially, repeatedly applying the executor function in a loop. However, we want to compute concurrently as possible, for minimum latency. We do this using a set of optimizations.

Optimization: Per-Key Ordering¶

Mempool (and after V1, consensus) provides ordering information for the time‍stamps. Thus, relative to each key, transaction candidates can be totally ordered by the Happens Before relationship. Since the only time‍stamps in V1 are TxFingerprints, and V1 has only one worker engine, the set of time‍stamps is totally ordered, so the Happens Before relationship is a total order. With a total ordering of transaction candidates, Shards can send read information (KVSReads) to Executors as soon as the previous TransactionCandidate is complete. However, transaction candidates that access on disjoint sets of keys can be run in parallel. In the diagram above, for example, transaction candidates c and d can run concurrently, as can transaction candidates e and f, and transaction candidates h and j.

Optimization: Order With Respect To Writes¶

In fact, Shards can send read information to an Executor as soon as the previous write's TransactionCandidate has completed (sent a KVSWrite). All Shards really need to keep track of is a total order of writes, and how each read is ordered with respect to writes (which write it precedes and which write preceded it). As soon as the preceding write is complete (the Shard has received a KVSWrite), the reads that depend on it can run concurrently. There are no "read/read" conflicts. In the diagram above, for example, transaction candidates a and b can run concurrently.

Optimization: Only Wait to Read¶

Because we store each version written (multi-version concurrent storage), we do not have to execute writes in order. A Shard does not have to wait to write a later data version to a key just because previous reads have not finished executing yet. In the diagram above, for example, only green happens-before arrows require waiting. transaction candidates a, b, c, and j can all be executed concurrently, as can transaction candidates d, e, and i.

heardAllWrites¶

In order to know which write happens most recently before a given read, the Shard must know that no further writes will be added to the timeline before the read. Mempool and consensus should communicate a lower bound on timestamps to the Shards, called heardAllWrites. The Shard is guaranteed to never receive another KVSAcquireLock with a write operation and a timestamp before heardAllWrites. In general, a Shard cannot send a KVSRead for a timestamp unless the timestamp is before heardAllWrites. For V1, heardAllWrites consists of a TxFingerprint from the sole worker engine such that the worker engine is certain (based on KVSLockAcquireds) that the Shard has already seen all the KVSAcquireLocks it will ever send at or before that TxFingerprint.

Shard Incoming Messages¶

Shards receive and react to the following messages:

KVSAcquireLock¶

from Mempool Worker

Purpose¶

Inform the shard about keys that a transaction may/will read and/or write, at a transaction fingerprint.

Structure¶

Field	Type	Description
`lazy_read_keys`	KVSKey set	Keys this transaction may read (only send values read in response to KVSReadRequests)
`eager_read_keys`	KVSKey set	Keys this transaction will read (send values read as soon as possible)
`will_write_keys`	KVSKey set	Keys this transaction will write. Future reads are dependent only on the KVSWrite for this TxFingerprint.
`may_write_keys`	KVSKey set	Keys this transaction may write. Future reads are dependent on the KVSWrite for this TxFingerprint, or, if that has a `None`, the previous value.
`curator`	ExternalIdentity	the Worker Engine in charge of the corresponding transactions
`executor`	ExternalIdentity	the Executor for this TransactionCandidate
`timestamp`	TxFingerprint	specifies the transaction affiliated with these locks.

The lazy_read_keys and eager_read_keys may not overlap. In the same way, will_write_keys and may_write_keys must be disjoint. There must be one KVSAcquireLock per Shard per TxFingerprint: for a given Shard and TxFingerprint, all information is to be provided in totality or not at all.¹

Note that future versions may use some kind of structured Keys to encode "Sets" containing infinitely many Keys. For V1, however, simple HashSets or similar are fine.

Effects¶

The Shard stores the respective "locks" for all keys in its timeline.
these are the "markers" described in Shard State.
The eager_read_keys will be served as soon as possible (by sending KVSRead-messages to the executor).
The Shard immediately informs the curator that the locks are acquired, using a KVSLockAcquired message, so the curator can prepare UpdateSeenAll messages.

Triggers¶

to Worker Engine: KVSLockAcquired send a KVSLockAcquired message to the curator, signaling that the locks of this message will be accounted for.
to executor: KVSRead for each recorded eager_read_key in this shard's timeline for which the most recent written value is established: send a KVSRead message to the Executor.

KVSReadRequest¶

from Executor

Purpose¶

Informs the Shard about a new read request, which happens in either of the following cases:

An Executor has determined that it actually needs the value at some key for which it has a lazy read (a may_read in the TransactionLabel of the TransactionCandidate). Now the executor is requesting that value from the Shard that stores it.
A Executor has finished and does not need the value for some key for which it has a lazy read (a may_read in the TransactionLabel).

Structure¶

Field	Type	Description
`timestamp`	TxFingerprint	we need the value at this logical timestamp
`key`	KVSKey	the value corresponds to this key
`actual`	`bool`	`true` iff we actually want a response

If actual is false, this just means that there is no dependency on this key in the current execution.

Effects¶

A Shard should delay processing a KVSReadRequest until it has completed processing KVSAcquireLock for the same timestamp.

Then, if actual is false, the Shard is done reading the value, and can remove the may read marker from state.

If actual is true, the Shard replaces the may read marker with a will read marker. If the Shard knows the unique previous value written before this timestamp, it sends that value in a KVSRead to the Executor and removes the will read marker from state. Otherwise, future KVSWrites and/or UpdateSeenAlls will identify this unique previous value written, and trigger the KVSRead.

Triggers¶

to Executor: KVSRead if the Shard has determined the unique value written prior to this "lock" then send a KVSRead-message to the relevant Executor to inform them of the value

KVSWrite¶

from Executor

Purpose¶

Informs the Shard about a new write request, which happens in either of the following two cases:

A TransactionExecutable has determined that it actually will write the value at some key for which it has a write (in its TransactionLabel). Now the Executor is requesting that value from the Shard that stores it.
A TransactionExecutable has finished, and does not actually need to write a value for some key for which it has a lazy write (a may_write in the TransactionLabel).

Structure¶

Field	Type	Description
`timestamp`	TxFingerprint	the logical time at which we are writing this data.
`key`	KVSKey	the key used. With fancy hierarchical keys or suchlike, we could even specify a range of keys
`datum`	KVSDatum option	the new data to be associated with the key. No datum should only be used in a "may_write," and means don't change this value

Effects¶

A Shard should delay processing a KVSWrite until it has completed processing KVSAcquireLock for the same timestamp.

If the datum is None, then remove the may write marker from this timestamp in state. Any reads waiting to read what is written here must instead read from the previous write. - One way to accomplish this is to copy the previous write as a "value written" at this timestamp in state.

If datum is occupied, then remove the may write or will write marker from this timestamp in state, and record the value written at this timestamp in state.

This may trigger a KVSRead if there are any will read markers for which this timestamp is the unique previous write.

Triggers¶

to Executor: KVSRead for each will read lock dependent on this write: send a KVSRead to the relevant Executor with the value written.

UpdateSeenAll¶

from Mempool Engines

Purpose¶

In order to actually serve read requests, the Shard needs to know that it will not receive more write requests before a certain timestamp. These are in general broadcast to all Shards.

It is important that the Worker Engine has received KVSLockAcquired-messages for all KVSAcquireLocks it has sent (or will ever send) at or before the timestamp. In other words, shards know about all possible read and write requests of TransactionCandidates for which the worker is curator and may come earlier.

Todo

rephrase the above paragraph

Each worker engine only needs to send the Shard Engine UpdateSeenAll messages concerning worker-specific ordering (batch number and sequence number within the batch). This means that each Shard Engine needs to hear from every Worker Engine periodically to be sure it is not waiting for any transactions. From there, the Shard uses TimestampOrderingInformation about the Narwhal DAG and Consensus to fill in a total order.

Structure¶

Field	Type	Description
`timestamp`	TxFingerprint	represents a the position in the total order (in V1)
`write`	`bool`	seen all read and seen all write can (and should) be separate.

For V1, we only care about write = true because we don't garbage collect and assume multi-version storage. From V2 onward, the Shard is keeping additional ordering information and we do have garbage collection protocols.

Effects¶

Shards can now identify the unique previous write prior to each read at or before this timestamp.

If that unique previous write has a value written, and the read is marked will read, they can send a KVSRead with that value to the relevant Executor.

Triggers¶

to Executor: KVSRead for each will read for which we have established a unique previous write value send a KVSRead message to the relevant Executor

For the purpose of this discussion, we call a write lock request a KVSAcquireLock message for a key for which a write request will or may be issued. ↩↩