Shard¶
The Shards together store and update the state of the replicated state machine and together are a component of the Execution Engines. They provide Executors with input data and update the state according to the results of Executors' computations.
Different shards may be on different physical machines.
Each shard is responsible for a set of KVSKeys and these sets are disjoint for different shards. For each of the keys that a shard is responsible for, the shard maintains a (partially-ordered) timeline of Timestamps of transaction candidates that may read or write to keys. Shards also keep a history of data written by each TransactionCandidate to each key. This is multi-version concurrent storage.
State (of the shard)¶
For each Worker Engine, the Shard maintains:
- A Timestamp, such that all
write lock requests1 for
transaction candidates with earlier timestamps that this worker curates
have already been received.
Together, these timestamps represent
heardAllWrites
. - Another Timestamp, before which
the Shard will receive no further read requests from this
Worker Engine.
For WorkerEngine, this cannot be after the corresponding
write Timestamps.
We will also maintain these from each Read Backend worker.
Together, these represent
heardAllReads
.
For each key (assigned to this Shard):
- A set of timestamps of known
transaction candidates that read and/or write that key, and for
each, some subset of:
- A value written to that key at that timestamps by that TransactionCandidate using a KVSWrite message
- A marker indicating that this TransactionCandidate may (or will) write to this key, but this Shard has not yet received a corresponding KVSWrite message.
- A marker indicating that this TransactionCandidate will read this value, and an ExternalIdentity corresponding to the relevant Executor. This marker is only stored so long as the Shard doesn't know the value. When this value is determined, this Shard must remove this marker and send a KVSRead message to the Executor.
- A marker indicating that this TransactionCandidate may read this value, and an ExternalIdentity corresponding to the relevant Executor. If the Executor sends a KVSReadRequest for this key, the Shard updates this marker to a "will read" marker.
- If a Timestamp has no corresponding markers or values written, we don't have to store it.
- If a value written is before
heardAllReads
, and there are no pending reads or writes before it, then we can remove all earlier values written.
Additionally, the Shard maintains:
- A complete copy of the DAG structure produced by the
Mempool Engines.
This includes a set of all NarwhalBlockHeaders.
For Timestamps before
SeenAllRead
, if there are no keys with a pending read or write before that Timestamp, we can delete old DAG structure. - A complete copy of the sequence of Anchors chosen
by Consensus Engine.
This is a sequence of consensus decisions.
For Timestamps before
heardAllReads
, if there are no keys with a pending read or write before that Timestamp, we can delete old anchors.
Shard Optimizations¶
We want to execute each TransactionCandidate (evaluate the executor function in order to compute the data written) using the idea of serializability: each TransactionCandidate's reads and writes should be as if they were executed in the total order determined by the mempool (and [[Consensus
However, we want to compute concurrently as possible, for minimum latency. We do this using a set of optimizations.
Optimization: Per-Key Ordering¶
Mempool
and consensus provides ordering
information for the timestamps.
Thus, relative to each key,
transaction candidates can be totally ordered by the
Happens Before
relationship.
With a total ordering of transaction candidates, Shards can send
read information (KVSReads) to Executors as soon as the
previous TransactionCandidate is complete.
However, transaction candidates that access on disjoint sets of
keys can be run in parallel.
In the diagram above, for example, transaction candidates c
and
d
can run concurrently, as can transaction candidates e
and
f
, and transaction candidates h
and j
.
Optimization: Order With Respect To Writes¶
In fact, Shards can send read information to an Executor as soon
as the previous write's TransactionCandidate has completed
(sent a KVSWrite).
All Shards really need to keep track of is a total order of writes,
and how each read is ordered with respect to writes (which write it
precedes and which write preceded it).
As soon as the preceding write is complete (the Shard has received a
KVSWrite), the reads that depend on it can run concurrently.
There are no "read/read" conflicts.
In the diagram above,
for example, transaction candidates a
and b
can run
concurrently.
Optimization: Only Wait to Read¶
Because we store each version written
(multi-version concurrent storage),
we do not have to execute writes in order.
A Shard does not have to wait to write a later data version to a key
just because previous reads have not finished executing yet.
In the diagram above, for example, only green happens-before arrows
require waiting.
transaction candidates a
, b
, c
, and j
can all be executed
concurrently, as can transaction candidates d
, e
, and i
.
Optimization: Execute With Partial Order¶
Some mempools, including Narwhal, can provide partial order information on transactions even before consensus has determined a total order. This allows the Ordering Machine to execute some transactions before a total ordering is known. In general, for a given key, a shard can send read information to an executor when it knows precisely which write happens most recently before the read, and that write has executed.
heardAllWrites¶
In order to know which write happens most recently before a given
read, the Shard must know that no further writes will be added to
the timeline before the read.
Mempool and consensus should
communicate a lower bound on timestamps to the Shards, called
heardAllWrites
.
The Shard is guaranteed to never receive another KVSAcquireLock
with a write operation and
Timestamp before heardAllWrites
.
In general, a Shard cannot send a KVSRead for
a Timestamp unless
the Timestamp is before heardAllWrites
.
heardAllWrites
consists of a TxFingerprint from each
worker engine such that the worker engine is certain
(based on KVSLockAcquireds) that the Shard has already seen all
the KVSAcquireLocks it will ever send at or before that
TxFingerprint.
This can be on a per-key basis or simply a global lower bound.
Occasionally,
heardAllWrites
should be updated with later timestamps.
Each round of consensus should produce a lower bound for heardAllWrites
,
but the mempool may already have sent better bounds.
Each Shard must keep track of heardAllWrites
on
each key's multi-version timeline.
Transactions
(like transaction j
in the diagram below)
containing only write operations
can execute with a timestamp after heardAllWrites
,
but this simply means calculating the data they will write.
Since that does not depend on state,
this can of course be done at any time.
heardAllReads¶
We want to allow Typhon to eventually garbage-collect old state.
mempool and consensus should
communicate a lower bound timestamp to the execution engine,
called heardAllReads
,
before which there will be
no more read transactions send to the execution engine.
Occasionally, heardAllReads
should be updated with later timestamps.
Each Shard must keep track of heardAllReads
on
each key's multi-version timeline, so it can garbage-collect old values.
In the example above, our happens-before arrows have been replaced with may-happen-before arrows, representing partial ordering information from the mempool. Note that not all transactions can be executed with this partial order information.
Conflicts¶
There are three types of conflicts that can prevent a transaction from being executable without more ordering information.
- Write/Write Conflicts
occur when a shard cannot identify the most recent write before a given read.
In the diagram above,
transaction
e
cannot execute because it is not clear whether transactionb
or transactionc
wrote most recently to the yellow key.
- Read/Write Conflicts
occur when shard cannot identify whether a read operation occurs before or
after a write,
so it is not clear if it should read the value from that write or
from a previous write.
In the diagram above,
transaction
g
cannot execute because it is not clear whether it would read the data written to the blue key by transactiond
or transactioni
.
- Transitive Conflicts
occur when a shard cannot get the data for a read because
the relevant write is conflicted.
In the diagram above,
transaction
h
cannot execute because it cannot read the data written to the yellow key by transactiong
, since transactiong
is conflicted.
As the mempool and consensus provide
the execution engine with more and more ordering information, and
the partial order of timestamps is refined,
all conflicts eventually resolve.
In the diagram above,
suppose consensus orders transaction g
before transaction i
.
The Read/Write conflict is resolved:
transaction g
reads the data transaction d
writes to the blue key.
Then the transitive conflict is also resolved:
transaction h
will be able to execute.
-->
Optimization: Client Reads as Read-Only Transactions¶
With the above optimizations, transactions containing only read operations do not affect other transactions (or scheduling) at all.
Therefore, they can bypass mempool and consensus altogether.
Clients can simply send read-only transactions directly to the execution engine (with a label and a timestamp), and if the timestamp is after heardAllReads
, the execution engine can simply place the transaction in the timeline of the relevant shards and execute it when possible.
In the diagram above, transaction f
is read-only.
If client reads produce signed responses, then signed responses from a weak quorum of validators would form a light client proof.
Shard Incoming Messages¶
Shards receive and react to the following messages:
KVSAcquireLock¶
- from Mempool Worker
Purpose¶
Inform the shard about keys that a transaction may/will read and/or write, at a transaction fingerprint.
Structure¶
Field | Type | Description |
---|---|---|
lazy_read_keys |
KVSKey set | Keys this transaction may read (only send values read in response to KVSReadRequests) |
eager_read_keys |
KVSKey set | Keys this transaction will read (send values read as soon as possible) |
will_write_keys |
KVSKey set | Keys this transaction will write. Future reads are dependent only on the KVSWrite for this TxFingerprint. |
may_write_keys |
KVSKey set | Keys this transaction may write. Future reads are dependent on the KVSWrite for this TxFingerprint, or, if that has a None , the previous value. |
curator |
ExternalIdentity | the Worker Engine in charge of the corresponding transactions |
executor |
ExternalIdentity | the Executor for this TransactionCandidate |
timestamp |
TxFingerprint | specifies the transaction affiliated with these locks. |
The lazy_read_keys
and eager_read_keys
may not overlap.
In the same way, will_write_keys
and may_write_keys
must be
disjoint.
There must be one KVSAcquireLock
per Shard
per TxFingerprint: for a given Shard and TxFingerprint,
all information is to be provided in totality or not at all.1
Note that future versions may use some kind of structured Key
s to
encode "Sets" containing infinitely many Key
s.
For V1, however, simple HashSets or similar are fine.
Effects¶
- The Shard stores the respective "locks" for all keys in its timeline.
- these are the "markers" described in Shard State.
- The
eager_read_keys
will be served as soon as possible (by sendingKVSRead
-messages to the executor). - The Shard immediately informs the curator that the locks are acquired, using a KVSLockAcquired message, so the curator can prepare UpdateSeenAll messages.
Triggers¶
- to Worker Engine: KVSLockAcquired send a KVSLockAcquired message to the curator, signaling that the locks of this message will be accounted for.
- to executor: KVSRead
for each
recordedeager_read_key
in this shard's timeline for which the most recent written value is established: send a KVSRead message to the Executor.
KVSReadRequest¶
- from Executor
Purpose¶
Informs the Shard about a new read request, which happens in either of the following cases:
- An Executor has determined that it actually needs the value at some key for which it has a lazy read (a may_read in the TransactionLabel of the TransactionCandidate). Now the executor is requesting that value from the Shard that stores it.
- A Executor has finished and does not need the value for some key for which it has a lazy read (a may_read in the TransactionLabel).
Structure¶
Field | Type | Description |
---|---|---|
timestamp |
TxFingerprint | we need the value at this logical timestamp |
key |
KVSKey | the value corresponds to this key |
actual |
bool |
true iff we actually want a response |
If actual
is false
, this just means that there is no dependency on
this key in the current execution.
Effects¶
A Shard should delay processing a KVSReadRequest until it has completed processing KVSAcquireLock for the same timestamp.
Then, if actual
is false, the Shard is done reading the value, and
can remove the may read marker from state.
If actual
is true, the Shard replaces the may read marker with a
will read marker.
If the Shard knows the unique previous value written before
this timestamp, it sends that value in a KVSRead to
the Executor and removes the will read marker from state.
Otherwise, future KVSWrites and/or UpdateSeenAlls will
identify this unique previous value written, and trigger the
KVSRead.
Triggers¶
- to Executor: KVSRead
if
the Shard has determined the unique value written prior to this "lock"then
send a KVSRead-message to the relevant Executor to inform them of the value
KVSWrite¶
- from Executor
Purpose¶
Informs the Shard about a new write request, which happens in either of the following two cases:
- A TransactionExecutable has determined that it actually will write the value at some key for which it has a write (in its TransactionLabel). Now the Executor is requesting that value from the Shard that stores it.
- A TransactionExecutable has finished, and does not actually need to write a value for some key for which it has a lazy write (a may_write in the TransactionLabel).
Structure¶
Field | Type | Description |
---|---|---|
timestamp |
TxFingerprint | the logical time at which we are writing this data. |
key |
KVSKey | the key used. With fancy hierarchical keys or suchlike, we could even specify a range of keys |
datum |
KVSDatum option | the new data to be associated with the key. No datum should only be used in a "may_write," and means don't change this value |
Effects¶
A Shard should delay processing a KVSWrite until it has completed processing KVSAcquireLock for the same timestamp.
If the datum
is None
, then remove the may write marker from
this timestamp in state.
Any reads waiting to read what is written here must instead read from
the previous write.
- One way to accomplish this is to copy the previous write as a
"value written" at this timestamp in state.
If datum
is occupied, then remove the may write or will write
marker from this timestamp in state, and record the
value written at this timestamp in state.
This may trigger a KVSRead if there are any will read markers for which this timestamp is the unique previous write.
Triggers¶
- to Executor: KVSRead
for each
will read lock dependent on this write: send a KVSRead to the relevant Executor with the value written.
UpdateSeenAll¶
- from Mempool Engines
Purpose¶
In order to actually serve read requests, the Shard needs to know that it will not receive more write requests before a certain timestamp. These are in general broadcast to all Shards.
It is important that the Worker Engine has received KVSLockAcquired-messages for all KVSAcquireLocks it has sent (or will ever send) at or before the timestamp. In other words, shards know about all possible read and write requests of TransactionCandidates for which the worker is curator and may come earlier.
Todo
rephrase the above paragraph
Each worker engine only needs to send the Shard Engine UpdateSeenAll messages concerning worker-specific ordering (batch number and sequence number within the batch). This means that each Shard Engine needs to hear from every Worker Engine periodically to be sure it is not waiting for any transactions. From there, the Shard uses TimestampOrderingInformation about the Narwhal DAG and Consensus to fill in a total order.
Structure¶
Field | Type | Description |
---|---|---|
timestamp |
TxFingerprint | represents a the position in the total order (in V1) |
write |
bool |
seen all read and seen all write can (and should) be separate. |
For V1, we only care about write = true
because we don't garbage collect and assume multi-version storage.
From V2 onward,
the Shard is keeping additional ordering information
and we do have garbage collection protocols.
Effects¶
Shards can now identify the unique previous write prior to each read at or before this timestamp.
If that unique previous write has a value written, and the read is marked will read, they can send a KVSRead with that value to the relevant Executor.
Triggers¶
- to Executor: KVSRead
for each
will read for which we have established a unique previous write value send aKVSRead
message to the relevant Executor
TimestampOrderingInformation¶
- from Mempool
Purpose¶
While each transaction comes with a Timestamp
, the shards do not actually know the order of those timestamps until the DAG is built, and consensus decisions are made. This message represents the mempool communicating (partial) timestamp ordering. These are broadcast to all shards.
Structure¶
Todo
One way to convey this is to include the entire DAG structure (albeit without the transaction contents of each worker batch). For now, I do not know what the internal structure of this message looks like.
Todo
check whether, ① worker-timestamp (= tx fingerprint), ② primary-timestamp (= pure DAG structure based on blocks/headers), ③ consensus-timestamp (= total order) are sufficiently many cases or we need yet another intermediate gradual step of ordering. E.g., does is make sense to also take into account local headers that (without integrity votes).
Effects¶
As shards learn more ordering information, they can finally complete reads (since they learn which writes most recently occurred).
Triggers¶
- to Executor:
KVSRead
for each
locked key for which we have established a unique write value, send aKVSRead
message to the appropriate Executor
AnchorChosen¶
- from Consensus
Purpose¶
Inform shards about the most recently decided value by the consensus.
Structure¶
Field | Type | Description |
---|---|---|
chain_id |
ChainId |
the chain Id |
learner |
Learner |
learner (in V2 this is always \(\red\alpha\)) |
height |
Height |
height of the block |
anchor |
NarwhalBlock |
the value that consensus decided upon |
Effects¶
The shard learns more ordering information. In particular, with this and enough TimestampOrderingInformation
messages, it should be able to order all transactions before the new anchor
.
Once we have enough ordering information to establish the unique write preceding a key on which there is a read lock, and we have a value for that write, we can send that value to the relevant Executor.
Triggers¶
- to Executor:
KVSRead
for each
locked key for which we have established a unique write value, send aKVSRead
message to the appropriate Executor.
(Wiki) links on this page
- Executor
- Executor
- KVSKey
- TransactionCandidate
- TransactionCandidate
- Worker Engine
- KVSAcquireLock
- Worker Engine
- TxFingerprint
- TransactionCandidate
- TxFingerprint
- TransactionCandidate
- KVSWrite
- TransactionCandidate
- KVSWrite
- TransactionCandidate
- ExternalIdentity
- Executor
- KVSRead
- Executor
- TransactionCandidate
- ExternalIdentity
- Executor
- Executor
- KVSReadRequest
- Consensus Engine
- TransactionCandidate
- TransactionCandidate
- TransactionCandidate
- Consensus Engine
- TxFingerprint
- TransactionCandidate
- TransactionCandidate
- KVSRead
- Executor
- TransactionCandidate
- TransactionCandidate
- TransactionCandidate
- TransactionCandidate
- TransactionCandidate
- Executor
- TransactionCandidate
- KVSWrite
- KVSWrite
- TransactionCandidate
- TransactionCandidate
- TransactionCandidate
- Consensus Engine
- KVSAcquireLock
- KVSRead
- TxFingerprint
- Worker Engine
- Worker Engine
- KVSLockAcquired
- KVSAcquireLock
- TxFingerprint
- Consensus Engine
- Consensus Engine
- Consensus Engine
- Worker Engine
- Executor
- KVSKey
- KVSReadRequest
- KVSKey
- KVSKey
- KVSWrite
- TxFingerprint
- KVSKey
- KVSWrite
- TxFingerprint
- ExternalIdentity
- Worker Engine
- ExternalIdentity
- Executor
- TransactionCandidate
- TxFingerprint
- Shard
- TxFingerprint
- Shard
- TxFingerprint
- Shard
- Shard
- Executor
- Shard
- Worker Engine
- KVSLockAcquired
- Worker Engine
- UpdateSeenAll
- Worker Engine
- KVSLockAcquired
- KVSLockAcquired
- Worker Engine
- Executor
- KVSRead
- KVSRead
- Executor
- Executor
- Executor
- TransactionLabel
- TransactionCandidate
- Executor
- TransactionLabel
- TxFingerprint
- KVSKey
- Shard
- KVSReadRequest
- KVSAcquireLock
- TxFingerprint
- TxFingerprint
- KVSRead
- Executor
- KVSWrite
- UpdateSeenAll
- KVSRead
- Executor
- KVSRead
- KVSRead
- Executor
- Executor
- TransactionExecutable
- KVSKey
- TransactionLabel
- Executor
- Shard
- TransactionExecutable
- KVSKey
- TransactionLabel
- TxFingerprint
- KVSKey
- KVSDatum
- Shard
- KVSWrite
- KVSAcquireLock
- TxFingerprint
- TxFingerprint
- TxFingerprint
- TxFingerprint
- TxFingerprint
- KVSRead
- TxFingerprint
- Executor
- KVSRead
- KVSRead
- Executor
- TxFingerprint
- Shard
- Worker Engine
- KVSLockAcquired
- KVSAcquireLock
- TxFingerprint
- TransactionCandidate
- Worker Engine
- UpdateSeenAll
- Worker Engine
- TxFingerprint
- TxFingerprint
- KVSRead
- Executor
- Executor
- KVSRead
- Executor
- KVSAcquireLock
- KVSWrite
-
For the purpose of this discussion, we call a write lock request a KVSAcquireLock message for a key for which a write request will or may be issued. ↩↩