5238cda077
Also, needed to understand Byzantine fault tolerant paxos better. Still do not.
1462 lines
71 KiB
Markdown
1462 lines
71 KiB
Markdown
---
|
||
title:
|
||
Merkle-patricia Dac
|
||
# katex
|
||
---
|
||
|
||
# Definition
|
||
|
||
## Merkle-patricia Trees
|
||
|
||
A Merkle-patricia tree is a way of hashing a map, an associative array,
|
||
such that the map can have stuff added to it, or removed from it,
|
||
without having to rehash the entire map, and such that one can prove a
|
||
subset of the map, such as a single mapping, is part of the whole map,
|
||
without needing to have to have the whole map present to construct the
|
||
hash.
|
||
|
||
Its practical application is constructing a global consensus on what
|
||
public keys have the right to control what digital assets (such as
|
||
crypto currencies, and globally defined human readable names) and
|
||
proving that everyone who matters agrees on ownership.
|
||
|
||
If a large group of peers, each peer acting on behalf of a large group
|
||
of clients each of whom have rights to a large number of digital assets,
|
||
agree on what public keys are entitled to control what digitial assets,
|
||
then presumably their clients also agree, or they would not use that
|
||
peer.
|
||
|
||
Thus, for example, we don’t want the Certificate Authority to be able
|
||
to tell Bob that his public key is a public key whose corresponding
|
||
secret key is on his server, while at the same telling Carol that Bob’s
|
||
public key is a public key whose corresponding secret key is in fact
|
||
controlled by the secret police.
|
||
|
||
The Merkle-patricia tree not only allows peers to form a consensus on an
|
||
enormous body of data, it allows clients to efficiently verify that any
|
||
quite small piece of data, any datum, is in accord with that consensus.
|
||
|
||
## Patricia trees
|
||
|
||
A patricia tree is a way of structuring a potentially very large list of
|
||
bitstrings sorted by bitstring, with a binary point implied somewhere in the
|
||
bistring, such that bitstrings can be added or deleted without resorting or
|
||
shifting the whole list of bitstrings.
|
||
|
||
A patricia tree defines a set of keys, the keys of its leaf nodes. Or perhaps
|
||
the keys _are_ its leaf nodes, all the information may well be in the strings
|
||
defined. The keys may be bounded on the left, but unbounded on the right,
|
||
for example strings, unbounded on the left but bounded on the right, for example
|
||
arbitrary precision integers, bounded both left and right, for example sixty
|
||
four bit, integers, or unbounded on either size, for example binary arbitrary
|
||
precision floating point.
|
||
|
||
If unbounded left, then the edge pointing at the root node had to give its
|
||
height above the binary point. The bitstring of the root node is always the
|
||
empty string, but where is that empty string positioned in height with respect
|
||
to the binary point?
|
||
|
||
We cannot reference nodes by bitstring in the canonical form, because the
|
||
number of leading zeroes in bitstring will change over time as the tree gets
|
||
deeper – we have to represent nodes by their height plus the bitstring
|
||
starting at the first non zero bit, or by the key, which is the bitstring
|
||
with a one bit and several zero bits appended, to align the significance of
|
||
bits in different bitstrings, in which case again only the first non zero bit
|
||
matters.
|
||
|
||
When we give a chain of vertexes, starting at the root vertexes, the compact
|
||
representation of the location of the vertex in the tree is its vertical
|
||
position in the tree, plus the bitstring starting at the first nonzero bit. If
|
||
it its a tree of items with fixed right bound, items identified by their
|
||
integer sequence, then we give the height above the leaves, since this will
|
||
not change as the tree grows. If fixed left bound, for example names as utf8
|
||
strings, the depth from the root. If neither bound is fixed, a case we are
|
||
unlikely to have to deal with, the signed height or depth from some arbitrary
|
||
starting point which will never change as the tree grows. Thus we always
|
||
need to implicitly or explicitly define the bit alignment in the bitstring.
|
||
|
||
If a leaf in a patricia tree representing values with fixed right boud, for
|
||
example oids, the usual case, then the bitstring of a leaf is its oid, or its
|
||
oid minus one, which does not need leading zeroes.
|
||
|
||
We need to be able to represent a bitstring containing all zeroes, thus if a
|
||
bitstring contains any ones, we have to represent that one. The height,
|
||
therefore, is the alignment of the right edge of the bitstring, hence we can
|
||
leave out leading zeroes in the bitstring, and indeed must leave them out of
|
||
the canonical form of a tree representing values with no fixed left bound,
|
||
such as indefinite precision integers, so that the canonical form does not
|
||
change when a parent node is placed on top of a former root node
|
||
|
||
The key to a node, whether a vertex or a leaf, is the bitstring aligned by
|
||
padding it with a one bit, followed by as many zero bits as needed.
|
||
|
||
The total number of vertexes equals the twice the number of leaves
|
||
minus one. Each parent node has as its key the bit string, a sequence of
|
||
bits not necessarily aligned on bit boundaries, that both its children
|
||
have in common. This creates the substring problem for patricia trees
|
||
mapping keys that have no right bound, mapping variable length keys. We
|
||
cannot permit one key of the map to be the prefix of another key. If,
|
||
however, the key is is self delimiting, as with null terminated strings,
|
||
no key can be the prefix of another key, and this tends to be the usual
|
||
way that variable length values are used as map keys. There are wide
|
||
variety of too clever by half ways of dealing with prefix keys, but they
|
||
all involve messing up a rather elegant algorithm with considerable
|
||
complexity and surprising code paths in special cases and fencepost
|
||
cases. It is better just to not allow prefix keys, as for example by
|
||
having strings null terminated. If we really wanted to define arbitrary
|
||
bit strings as leaf keys of a patricia tree, which I doubt we will,
|
||
would be better to encode them in self delimiting format. No string in
|
||
self delimiting format can be the prefix of another string.
|
||
|
||
A Merkle-patricia dac is a patricia tree with binary radix (which is
|
||
the usual way patricia trees are implemented) where the hash of each
|
||
node depends on the hash and the skip of its two children; Which means
|
||
that each node contains proof of the entire state of all its descendant
|
||
nodes.
|
||
|
||
The skip of a branch is the bit string that differentiates its bit
|
||
string from its parent, with the first such bit excluded as it is
|
||
implied by being a left or right branch. This is often the empty
|
||
bitstring, which when mapped to a byte string for hashing purposes, maps
|
||
to the empty byte string.
|
||
|
||
It would often be considerably faster and more efficient to hash the
|
||
full bitstring, rather than the skip, and that may sometimes be not
|
||
merely OK, but required, but often we want the hash to depend only on
|
||
the data, and be independent of the metadata, as when the leaf index is
|
||
an arbitrary precision integer representing the global order of a
|
||
transaction, that is going to be constructed at some later time and
|
||
determined by a different authority.
|
||
|
||
Most of the time we will be using the tree to synchronize two ’t of
|
||
pending transactions, so though a count of the number of children of a vertex
|
||
or an edge is not logically part of a Merkle-patricia tree, it will make
|
||
synchronization considerably more efficient, since the peer that has the node
|
||
with fewer children wants information from the peer that has the node with
|
||
more children.
|
||
|
||
# Representation
|
||
|
||
The canonical form will not directly reflect the disk organization.
|
||
|
||
The canonical form of a sparse tree is that each vertex is represented by the
|
||
hash of its two children, and the bitstring of the offset of each child from
|
||
its parent, minus the leading bit of that bitstring. The root node, of
|
||
course, has an empty bitstring.
|
||
|
||
Often it will be more compact to transmit the child itself rather than the
|
||
hash, in reverse polish notation order, from which the hash can be generated.
|
||
|
||
To form the hash of a node, we need the hashes and relative bitstrings of its
|
||
children, but if we already have the children, identified by reverse polish
|
||
position in the stream or by their bitstrings relative to a common ancestor,
|
||
we don’t need and should not represent the hashes, giving them implicitly
|
||
rather than explicitly.
|
||
|
||
Since we cannot hash a bitstring, only a bytestring, the bitstring will be
|
||
hashed in its representation as a bytecount represented by a variable
|
||
precision integer, followed by that many bytes, with the bitstring being
|
||
padded if needed to an integer number of bytes by adding a one bit followed
|
||
by as many zero bits as needed. Thus a twofiftysix bit byte string requires
|
||
a count of thirty three, plus thirty three bytes, the last byte being 0x80.
|
||
|
||
In memory as a ceeplusplus object, the bitstring may conveniently be
|
||
represented by an integer of at least sixty four bits, with the bitstring bit
|
||
aligned so that the significan bits in one bitstring line up with bits of the
|
||
same significance in another bitstring, and padded right with a one bit
|
||
followed by as many zero bits as needed. In the canonical form, however, the
|
||
left edge of the bitstring vertex identifier is the left edge of the
|
||
bitstring, and the length of the bitstring is the depth of the vertex from
|
||
the root. The left edge of a relative bitstring, identifying a child is one
|
||
bit to the right of the bitstring identifier of its parent. The child’s
|
||
bitstring vertex identifier is the parent bitstring vertex identifier, plus a
|
||
zero bit for the left child and a one bit for the right child, plus the bits
|
||
of the relative bitstring. The canonical hash of the parent is the hash of
|
||
its left child, plus the relative bitstring of its left child, and similarly
|
||
for its right child.
|
||
|
||
Conceptually and canonically, it is equivalent to a patricia tree where
|
||
the children of a node are identified by hashes rather than pointers.
|
||
The hashes are taken over the canonical form, and are unaffected by
|
||
location in memory and the representation, which is not necessarily
|
||
canonical. If the actual representation is in a database, it is likely
|
||
to be represented in a way that makes recursive sql statements work.
|
||
|
||
Since we in practice cannot find the thing referred to by its hash, any actual
|
||
representation of the canonical form must contain additional information
|
||
telling us where to find the data referred to, but this additional information
|
||
is likely to vary from one situation to the next, and is not canonical.
|
||
|
||
For this canonical form to work as a direct representation, we would need a
|
||
universal way of finding the pre-image of any hash, which would be costly, and
|
||
would deny us some useful cryptographic capabilities where a party reveals a
|
||
pre-image and a hidden part of the Merkle-patricia tree – typically when a
|
||
transaction goes bad, he would then make public the convesation leading to it.
|
||
Also, the tree will eventually grow enormous, and have numerous side chains
|
||
attached to it, in which case only the party or parties operating the side
|
||
chain can reverse their pre-images.
|
||
|
||
But the forms actually used should be a representation of a Merkle
|
||
patricia tree with hashes and skip fields in place of pointers.
|
||
|
||
## Balanced binary trees of fixed height.
|
||
|
||
We will represent an immutable and ever growing data structure as a collection
|
||
of balanced binary trees, and a balanced binary tree of fixed height makes
|
||
much of the information in this representation redundant, which suggests that
|
||
it may be desirable to use a more efficient and direct canonical form – to
|
||
ensure that the immutable append only data structure is canonically immutable
|
||
and append only.
|
||
|
||
The schere pointing at a balanced binary tree will say that it is a balanced
|
||
binary tree whose leaves are objects of a certain type (have a certain schema)
|
||
and give the height of the root, assuming that they all have the same schema.
|
||
If they have different schemas, the then the leaves will be of type schere.
|
||
|
||
The patricia bit string for each vertex of the balanced binary tree is
|
||
implicitly given by its position within the tree, so we do not represent it in
|
||
the canonical form, though we may well represent it in the actual
|
||
representation.
|
||
|
||
## Hashing
|
||
|
||
Hashing depends on the schema – to hash the bitstream, has to parse
|
||
it into fields and records by the schema, and distinguish between index
|
||
nodes and record nodes, which are hashed and represented differently, a
|
||
record node being self contained, and index node depending on its
|
||
relationships.
|
||
|
||
In a dac, rather than a tree, an index node might be referenced by
|
||
multiple different entities, so in that case we want the hash to only
|
||
depend on the part of the key field that it governs, independent of the
|
||
parent part of its key field.
|
||
|
||
Further, a transaction is a group of records, and we want to represent a
|
||
transaction locally, so that its records are physically close together
|
||
in storage.
|
||
|
||
Which implies a transformation, that the canonical form, which knows
|
||
nothing about storage location, can have portions represented as a
|
||
position relative form, in which a group of records is kept in depth
|
||
first tree order, with the boundaries of the group having hashes linking
|
||
them to the outside, but internally, when converting back into canonical
|
||
form, we recalculate the hashes. For a given schema we might do this one
|
||
way in one context, or another way in another context, or have
|
||
subschemas.
|
||
|
||
Of course, tree order assumes we have a tree. In general, we have a dac,
|
||
not a tree, the most important case here being the tree of names, where
|
||
we are continually issuing new roots for the tree, but we don’t want to
|
||
continually issue new leaves.
|
||
|
||
In the canonical form of the Merkle-patricia tree we act as if hashes
|
||
were reversible. Of course they are not, nor do we have a general
|
||
universal look up table for reversing them. Rather, you have to hit up a
|
||
server that can reverse the hashes you care about, which it may do by
|
||
looking up a ginormous hash table, or more likely do by having the
|
||
Merkle-patricia tree on disk or in memory in the ordinary patricial form
|
||
of links pointing at file relative or absolute locations on disk or in
|
||
memory.
|
||
|
||
# Blockchain
|
||
|
||
Of course we want more than this – a Merkle-patricia block chain,
|
||
meaning *an immutable past and a constantly changing present*.
|
||
|
||
Which represents an immutable and ever growing sequence of transactions,
|
||
and also a large and mutable present state of the present database that
|
||
is the result of those transactions, the database of unspent transaction
|
||
outputs.
|
||
|
||
When we are assembling a new block, the records live in memory as native
|
||
format C++ objects. Upon a new block being finalized, they get written
|
||
to disk in key order, with implementation dependent offsets between
|
||
records and implementation dependent compression, which compression
|
||
likely reflects canonical form. Once written to disk, they are accessed
|
||
by native format records in memory, which access by bringing disk
|
||
records into memory in native format, but the least recently loaded
|
||
entry, or least recetly used entry, gets discarded. Even when we are
|
||
operating at larger scale than visa, a block representing five minutes
|
||
of transactions fits easily in memory.
|
||
|
||
Further, a patricia tree is a tree. But we want, when we have the Merkle
|
||
patricia tree representing registered names organized by names or the
|
||
Merkle-patricia tree represenging as yet unspent transaction outputs its
|
||
Merkle characteristic to represent a directed acyclic graph. If two
|
||
branches have the same hash, despite being at different positions and
|
||
depths in the tree, all their children will be identical. And we want to
|
||
take advantage of this in that block chain will be directed acyclic
|
||
graph, each block being a tree representing the state of the system at
|
||
that block commitment, but that tree points back into previous block
|
||
commitments for those parts of the state of the system that have not
|
||
changed. So the hash of the node in such a tree will identify, probably
|
||
through an OID, a record of the block it was a originally constructed
|
||
for, and its index in that tree.
|
||
|
||
A Merkle-patricia directed acyclic graph, Merkle-patricia dac, is a
|
||
Merkle dac, like a git repository or the block chain, with the patricia
|
||
key representing the path of hashes, and acting as index through that
|
||
chain of hashes to find the data that you want.
|
||
|
||
The key will thread through different computers under the control of
|
||
different people, thus providing a system of witness that the current
|
||
global consensus hash accurately reflects past global consensus hashes,
|
||
and that each entities version of the past agree with the version it
|
||
previously espoused.
|
||
|
||
This introduces some complications when a portion of the tree represents
|
||
a database table with more than one index.
|
||
|
||
[Ethereum has a discussion and
|
||
definition](https://github.com/ethereum/wiki/wiki/Patricia-Tree) of this
|
||
data structure.
|
||
|
||
Suppose, when the system is at scale, we have thousand trillion entries
|
||
in the public, readily accessible, and massively replicated part of the
|
||
blockchain. (I intend that every man and his dog will also have a
|
||
sidechain, every individual, every business. The individual will
|
||
normally not have his side chain publicly available, but in the event of
|
||
a dispute, may make a portion of it visible, so that certain of his
|
||
payments, an the invoice they were payments for, become visible to
|
||
others.)
|
||
|
||
In that case, a new transaction output is typically going to require
|
||
forty thirty two byte hashes, taking up about two kilobytes in total on
|
||
any one peer. And a single person to person payment is typicaly going to
|
||
take ten transaction outputs or so, taking twenty kilobytes in total on
|
||
any one peer. And this is going to be massively replicated by a few
|
||
hundred peers, taking about four megabytes in total.
|
||
|
||
(A single transaction will typically be much larger than this, because
|
||
it will mingle several person to person payments.
|
||
|
||
Right now you can get system with sixty four terabytes of hard disk,
|
||
thirty two gigabytes of ram, under six thousand, for south of a hundred
|
||
dollars per terabyte, so storing everything forever is going to cost
|
||
about a twentieth of a cent per person to person payment. And a single
|
||
such machine will be good to hold the whole blockchain for the first few
|
||
trillion person to person payments, good enough to handle paypal volumes
|
||
for a year.
|
||
|
||
“OK”, I hear you say. “And after the first few trillion transactions?”.
|
||
|
||
Well then, if we have a few trillion transactions a year, and only a few
|
||
hundred peers, then the clients of any one peer will be doing about ten
|
||
billion transactions a year. If he profits half a cent per transaction,
|
||
he is making about fifty million a year. He can buy a few more sixty
|
||
four terabyte computers every year.
|
||
|
||
The target peer machine we will write for will have thirty two gigabytes
|
||
of ram and sixty four terabytes of hard disk, but our software should
|
||
run fine on a small peer machine, four gigabytes of ram and two
|
||
terabytes of hard disk, until the crypto currency surpasses bitcoin.
|
||
|
||
Because we will employ fixed size transaction units – larger currency
|
||
amounts will be broken into tens, twenties, fifties, hundreds, two
|
||
hundreds, five hundreds, thousands, two thousands and so forth, and
|
||
because we will be using a blockchain in the form of a Merkle-patricia
|
||
dac, our transactions will tak up several times as much space a similar
|
||
bitcoin transaction, and currently bitcoin transactions take up several
|
||
hundred megabytes. But this is OK, because the Merkle-patricia dac gives
|
||
client wallets far more power than on the bitcoin system, so we can get
|
||
by with far fewer peer wallets and far more client wallets.
|
||
|
||
------------------------------------------------------------------------
|
||
|
||
# Packed Form
|
||
|
||
Assume we have an ordered sequence of records, as if the result of a
|
||
database query with the index fields first.
|
||
|
||
Suppose we have a potentially very large sequence of records, and we
|
||
want to generate the Merkle-patricia hash, so that we can generate an
|
||
efficient proof that one and only one record with a certain prefix
|
||
appears in this pile.
|
||
|
||
We want to convert from sequence of records form to patricia form.
|
||
|
||
If we look at the difference between each pair of records, we get an
|
||
index node, which is the position of the first bit at which they differ.
|
||
Which is a patricia tree expressed in infix order. We want to convert
|
||
infix order to postfix order, a sequence of records interleaved with a
|
||
sequence of bit positions at which two records differ.
|
||
|
||
Or equivalently, hash them as if we already had them in postfix order.
|
||
|
||
Suppose we want to output self delimiting records, interleaved with
|
||
postfix indicators of the difference position. We want the least index
|
||
output last, so that when we do a second sequential pass, hashing each
|
||
record when we encounter a record, and putting that has on the stack,
|
||
and when we encounter and index, hashing two hashes from the stack with
|
||
the index, and put the resulting hash on the stack.
|
||
|
||
So, we find the difference position between the current record and the
|
||
next, then we output the current record, and make the next record the
|
||
current record. If the difference position is less than the difference
|
||
position on the stack, we output difference positions from the stack
|
||
until the difference position on the stack is greater that the current
|
||
difference position, meaning the node represented by the difference
|
||
position on the stack is a child node of the current node. We then put
|
||
the current difference position on the stack, and repeat for the next
|
||
record.
|
||
|
||
A similar algorithm simply generates the hash as if we had the literal
|
||
nodes.
|
||
|
||
A full path to a leaf node, proving that the leaf node is represented in
|
||
the tree, contains not the hashes of the things in the path, but the off
|
||
path hashes and off path keys.
|
||
|
||
An incomplete tree has the same data structure as full tree, but with
|
||
missing nodes. A full path is a more compact representation of an
|
||
incomplete tree, and is treated as a compressed form of an incomplete
|
||
tree with a single leaf node present. You can regenerate an incomplete
|
||
tree from a full path, and a full path can be generated for any non
|
||
missing leaf node in an incomplete tree.
|
||
|
||
Rather than a chain of blocks, we have a Merkle-patricia dac of blocks,
|
||
where the index is the block number. This means that the state of any
|
||
block can be proved to be part of the global consensus with a proof of
|
||
length logarithmic in the total block number. Thus peers can provide
|
||
clients with short proofs, so that clients do not have to take
|
||
assertions by peers on trust.
|
||
|
||
We also have a small number of Merkle-patricia dacs representing the
|
||
state of the system at any given block time, for example a Merkle
|
||
patricia dac of unspent transaction outputs, and a trie linking globally
|
||
unique human readable names to probabilistically unique public keys.
|
||
These trees change each new block, though their state in past blocks is
|
||
immutable. Each new block contains not the entire new Merkle-patricia
|
||
dac, which is apt to be enormous, but only those nodes that have
|
||
changed. A new block contains the new roots of the new Merkle-patricia
|
||
dacs and their new descendants, which link to unchanged descendants in
|
||
past blocks.
|
||
|
||
Peers synchronize their state by sharing new information to form a new
|
||
block. They efficiently discover what they have in common, and what is
|
||
new, by sharing the root of the Merkle-patricia dac describing the new
|
||
block, and then give each other the new information, after the fashion
|
||
of usenet news.
|
||
|
||
A new block is not just a list of new events, but it is generated from a
|
||
list of new events, which are themselves listed in a Merkle-patricia
|
||
dac. To compare nodes in the tree of new events, a peer sends its
|
||
neighbours an offset into the hash, the leading part of the key of the
|
||
node, a one bit flag to indicate if the node has children, a hundred and
|
||
twenty eight bits of the hash of the node, and for each of its two
|
||
children, the leading parts of their keys,a one bit flag to indicate if
|
||
the node has children, and sixty bits of their hash, and for each of the
|
||
four grandchilden, the leading parts of their keys, a one bit flag to
|
||
indicate if the node has children, twenty eight bits of hash, for each
|
||
of the eight great grandchildren, the leading parts of their keys, a one
|
||
bit flag to indicate if the node has children, and thirteen bits of
|
||
hash, for each of the sixteen great great grandchildren, the leading
|
||
parts of their keys, a one bit flag to indicate if the node has
|
||
children, and six bits of hash, for each of the thirty two great great
|
||
great grandchildren, the leading parts of their keys, a one bit flag to
|
||
indicate if the node has children, and three bits of hash, and for each
|
||
of the sixty four great great great great grandchildren, the leading
|
||
parts of their keys, a one bit flag to indicate if the node has
|
||
children, and two bits of hash.
|
||
|
||
This tells it which subtrees are definitely different, and which are
|
||
definitely new, and for each subtree definitely different, it sends more
|
||
comparison data for that node, and for each subtree definitely missing,
|
||
it sends that subtree. Once there are no more subtrees to be sent,
|
||
repeats the process starting at the root once again.
|
||
|
||
When it has some information about which subtrees are definitely
|
||
missing, which are probably the same, and which are definitely
|
||
different, it then sends much the same, but for each one missing, sends
|
||
the full subtree, for each probably the same, skips, for each definitely
|
||
different, doubles the level of detail – repeats with a different part
|
||
of the hash, and twice as many bits.
|
||
|
||
So now we need more than a one bit flag. Need to distinguish between the
|
||
cases:
|
||
|
||
1. no children
|
||
2. Sending a full item – a leaf node.
|
||
3. skipping a child subtree because we are likely in agreement
|
||
4. dropping down to a lower level of detail, for example from thirteen
|
||
bits of hash to six bits of hash. If down to two bits of hash,
|
||
always going to leave out the children.
|
||
5. not dropping down to a lower level of detail, keeping the same
|
||
number of bits in the hash.
|
||
|
||
The intent is to discover what parts of the tree we have agreement on,
|
||
and send an image of the tree with those parts skipped over. Rinse and
|
||
repeat. We annotate our model of the tree with the probability that a
|
||
subtree is identical. If the other guy sent us a leaf node, we know the
|
||
node is identical, and every hash fragment that agrees creates
|
||
exponential probability that a subtree is identical. If we recently got
|
||
a leaf node from source, not from sharing, we know for sure that the
|
||
other guy does not have it, and send it to him.
|
||
|
||
Ok, that covers information gathering, but what about the final stages,
|
||
when we are going to throw out data that was late in coming, for the
|
||
sake of consensus?
|
||
|
||
A proposed final hash of all items to be in a block for a certain period
|
||
is announced. And now, the job is to get those items that are missing,
|
||
and tag those items that the are not in the proposed final hash, and
|
||
exclude them from a version of the tree. So instead of “definitely not
|
||
present in the other guy’s hash” means you send the guy your item, it
|
||
now means you exclude it, and see if that gets your root hash to agree.
|
||
|
||
A peer in good standing endorses the proposed final hash, the root of
|
||
the block if it can get its trees to agree
|
||
|
||
When building a block, the peers share these new events. When coming to
|
||
a consensus, the peers attempt go get agreement on the new events in the
|
||
block. But the block will also contain diffs on the Merkle-patricia dac
|
||
of unspent transaction outputs, and the Merkle-patricia dac of spent
|
||
transaction inputs. The peers need to maintain these trees so that
|
||
clients can see proof of consensus on the tries, so that a peer cannot
|
||
mislead a client, and a peer should only vote for a consensus if it can
|
||
generate the same root hash for the the block thus has the same tries
|
||
describing the entire block chain and providing access to a block chain
|
||
for clients.
|
||
|
||
After agreement on Merkle-patricia dac of all new items for the new
|
||
block, a peer generates the other revisions of the other Merkle-patricia
|
||
dacs, and add them to the block, the generated items that go into the
|
||
block, and you should get final agreement. If a peer gets final
|
||
agreement – all block items are valid, then it votes for the consensus.
|
||
|
||
The block contains the root of a Merkle-patricia dac that contains all
|
||
previous blocks (but not the current block) – thus not so much a block
|
||
chain, as a block trie, which means that the proof of any fact about the
|
||
state of the block trie is reasonably short, of order log N hashes,
|
||
where N is the number of items in the block trie.
|
||
|
||
|
||
# Node identifiers
|
||
|
||
We will call the thing that identifies a node a node infix order, and
|
||
the member of the subset that the partricia tree identifies a key -
|
||
because we are generally using it as a map key. Part of the map key is
|
||
part of the node infix order.
|
||
|
||
But the canonical form of the map key is a bitstring, which is what we are going to hash.
|
||
|
||
The node infix order is the representation of the bitstring with significant
|
||
bits in the different bitstrings aligned, padded on the right with a one bit
|
||
and as many zero bits as needed, and for a tree of quantities unbounded on the
|
||
left, padded with a one bit. and as many zero bits as needed on the left.
|
||
|
||
In practice all integers have some finite bound, but one does not want
|
||
the computer word length to affect the data on the wire or the result of
|
||
the hash, so it usually preferable to structure the data and algorithms
|
||
so that the actual left bound has no effect provided it is sufficiently
|
||
large, as if the integers had unbounded precision. We will not actually
|
||
need numbers larger than 2^64^ until Anno Domini ten thousand or so, but
|
||
the when we do, it will have no effect on blockchain format or previous
|
||
hash results, hence for integers we will use a tree with right bound but
|
||
no left bound.
|
||
|
||
But a patricia tree is bit string oriented. So for integer indexes, in
|
||
order that we can ascertain which nodes correspond to the leaf nodes,
|
||
need to have, associated with the root node, the the length of the bit
|
||
string for the leaf node. The length needs to be run time value, rather
|
||
than a compile time value. But for hashes, can be a compile time value.
|
||
|
||
To map from bit strings to byte strings, and to have a bit string index
|
||
for leaf nodes and tree nodes, we either append 10000\... or 011111\...
|
||
to bitstring, to get a one to one map from byte strings to bit strings,
|
||
and from bit strings to integers.
|
||
|
||
Compiler intrinsics are generally ffs. For Microsoft compilers use
|
||
`_BitScanForward` & `_BitScanReverse`. For GCC use `__builtin_ffs`,
|
||
`__builtin_clz`, `__builtin_ctz`.([find first
|
||
set](https://infogalactic.com/info/Find_first_set)) or ctz (count
|
||
trailing zeros) so is probably faster to represent a bit string as a
|
||
word string by appending 100000\... than 0111111\.... See [Microsoft
|
||
`__lzcnt()`](https://docs.microsoft.com/en-us/cpp/intrinsics/lzcnt16-lzcnt-lzcnt64?view=vs-2019)
|
||
and [gcc
|
||
`__builtin_clz()`.](https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html)
|
||
. `BitScanReverse` is portable between processers, and lzcnt is not, so
|
||
you need a runtime check at the start of program to see if your code can
|
||
run.
|
||
|
||
If the leaf nodes correspond to the integers 1 to N, where N is at most
|
||
m bits long, then the bit strings of the leaf nodes are m+1 bits,
|
||
becaust we have 2N-1 vertexes in the tree, and 2N edges.
|
||
|
||
# Sparse or sequential, complete or partial, Merkle-patricia trees.
|
||
|
||
## Sparse and complete
|
||
|
||
As, for example, a Merkle-patricia tree of a map that maps globally
|
||
unique human readable and writeable names to public keys.
|
||
|
||
A patricia tree with no right bound on its keys is necessarily sparse.
|
||
Thus, for example, in a patricia tree of strings terminated by a null
|
||
byte, the null byte ensures a gap between the address spaces governned
|
||
by two successive keys.
|
||
|
||
If a node in the tree has direct children (no skips) or the leaf nodes
|
||
are sequential and contiguous its has its hash is the hash of the hashes
|
||
of children. If the tree is sparse part of what we know is the gaps,
|
||
hence the parent node has to hash the skip links. If the tree is
|
||
sequential and contiguous, then the root node has to directly hash its
|
||
direct descendants, and also the size of the tree.
|
||
|
||
A block is a sparse tree, while a blockchain is a sequential tree of
|
||
sparse trees. The sparse tree in a block will itself contain sequential
|
||
trees, and the root hashes of numerous sequential trees.
|
||
|
||
In a sequential Merkle-patricia tree, the hash of the parent node is
|
||
simply the hash of the hashes of its two child nodes, but in a sparse
|
||
tree, we want two maps from keys to objects with the same sequence of
|
||
the same objects, but different key values, to have different hashes, so
|
||
hash of a parent node has to be hash of the hashes of its two child
|
||
nodes, plus the hash of the portions of each child’s key that govern
|
||
the skip links to those two children, that portion of each child’s key.
|
||
The portion of the childs key that govern the skip link is (Level
|
||
difference -1) bits long.
|
||
|
||
But computers do not handle bit fields easily, and databases do not
|
||
handle them at all, plus, how do you hash a bit string, such as the leaf
|
||
indicator?
|
||
|
||
Any variable length field creates ambiguity in the hash, so that two
|
||
values could be hashed as the same stream of bytes. To avoid this
|
||
outcome, the canonical format for a bit string will be an integer in
|
||
stream format specifying the number of bytes, followed by the bit
|
||
string, followed by a zero bit, followed by enough one bits to fill to
|
||
the next byte boundary. If the representation of a bit string as an
|
||
integral number of bytes has some 0x00 bytes at the end, it is not in
|
||
canonical format and gets truncated before hashing till it no longer has
|
||
any 0xff trailing bytes.
|
||
|
||
If a bit string has a start position that is not necessarily byte
|
||
aligned, and is known from context, we left pad it with zeroes. If its
|
||
start position is not known from context, we provide the starting bit
|
||
position as an integer in stream format. At this point in the code we
|
||
are deep in the bitbashing weeds, and are no longer worried about
|
||
passing the bit string around as a regular byte string.
|
||
|
||
Assuming no prefix problem, one way or another way, then the index of a
|
||
node can be two fields, the bit string, and the number of bits in the
|
||
bit string. But since we already have to represent the number of bytes
|
||
in the field representing the bit string, we might use the 01111..
|
||
canonical format trick, so that we can use the more familiar, standard,
|
||
and convenient infix order of byte string, in which case we will have to
|
||
pad the map key to form the node infix order.
|
||
|
||
However we do this, it is an implementation detail that should not
|
||
affect the canonical form or the root hash, and the appended 0111\...
|
||
form simplifies fencepost problems on interpreting what is on the wire.
|
||
Otherwise we are always going to be bothered by distinguishing the bit
|
||
string 0101000 from the bit string 0101. One less `if` to screw up.
|
||
|
||
Since a sequential Merkle-patricia tree always maps the integers from
|
||
zero to n-1 to the objects, hashing link information is redundant, we do
|
||
not actually need any link information. It is sufficient that the root
|
||
hash defines the objects and the sequence.
|
||
|
||
## Sequential and complete
|
||
|
||
The tree bears a simple and natural relationship to a linear vector of
|
||
leaves and a linear vector, one smaller, of nodes.
|
||
|
||
Each node appears with a position in the linear vector one less than the
|
||
position of the leaf that required it to be added to the tree.
|
||
|
||
Size eight Merkle-patricia tree by:
|
||
|
||
- patricia bitstring
|
||
- right padded patricia bitstring (key)
|
||
|
||
::: {style="text-align: center;"}
|
||
“”\
|
||
1000\
|
||
|
||
::: {.clear}
|
||
:::
|
||
|
||
::: {style="float: left; text-align: center; width: 45%;"}
|
||
“0”\
|
||
0100\
|
||
|
||
::: {.clear}
|
||
:::
|
||
|
||
::: {style="float: left; text-align: center; width: 45%;"}
|
||
“00”\
|
||
0010\
|
||
|
||
::: {.clear}
|
||
:::
|
||
|
||
::: {style="float: left; text-align: center; width: 45%;"}
|
||
"000"\
|
||
0001\
|
||
:::
|
||
|
||
::: {style="float: right; text-align: center; width: 45%;"}
|
||
"001"\
|
||
0001\
|
||
:::
|
||
:::
|
||
|
||
::: {style="float: right; text-align: center; width: 45%;"}
|
||
"01"\
|
||
0110\
|
||
|
||
::: {.clear}
|
||
:::
|
||
|
||
::: {style="float: left; text-align: center; width: 45%;"}
|
||
"010"\
|
||
0101\
|
||
:::
|
||
|
||
::: {style="float: right; text-align: center; width: 45%;"}
|
||
"011"\
|
||
0111\
|
||
:::
|
||
:::
|
||
:::
|
||
|
||
::: {style="float: right; text-align: center; width: 45%;"}
|
||
"1"\
|
||
1100\
|
||
|
||
::: {.clear}
|
||
:::
|
||
|
||
::: {style="float: left; text-align: center; width: 45%;"}
|
||
"10"\
|
||
1010\
|
||
|
||
::: {.clear}
|
||
:::
|
||
|
||
::: {style="float: left; text-align: center; width: 45%;"}
|
||
"100"\
|
||
1001\
|
||
:::
|
||
|
||
::: {style="float: right; text-align: center; width: 45%;"}
|
||
"101"\
|
||
1011\
|
||
:::
|
||
:::
|
||
|
||
::: {style="float: right; text-align: center; width: 45%;"}
|
||
"11"\
|
||
1110\
|
||
|
||
::: {.clear}
|
||
:::
|
||
|
||
::: {style="float: left; text-align: center; width: 45%;"}
|
||
"110"\
|
||
1101\
|
||
:::
|
||
|
||
::: {style="float: right; text-align: center; width: 45%;"}
|
||
"111"\
|
||
1111\
|
||
:::
|
||
:::
|
||
:::
|
||
:::
|
||
|
||
::: {.clear}
|
||
:::
|
||
|
||
\
|
||
\
|
||
\
|
||
\
|
||
\
|
||
|
||
\
|
||
\
|
||
\
|
||
\
|
||
\
|
||
|
||
We want to collapse the tree into a linear
|
||
list, so that we can find the correct node without walking one bit at a
|
||
time through a binary tree, both in order to represent relatively small
|
||
blocks of hashes, and also to represent the top of very large trees.
|
||
|
||
We extend each bit string to a fixed and uniform size by appending a one
|
||
bit, followed by as many one zero bits as are necessary fill to the standard
|
||
size, so that we can use uniform length bit strings, instead of variable sized
|
||
bit strings, so that we can use them directly to access an array in memory,
|
||
or OIDs in a database.
|
||
|
||
The resulting padded bit strings (keys) are in infix order. But for an immutable append only file or database, we want postfix order, which is
|
||
harder. And we don't want to be restricted to only having a power of two objects. We want to be have an arbitrary number of objects, and add an arbitrary number of objects without changing the existing tree.
|
||
|
||
A binary postfix tree, power of two, is going to look like this:
|
||
|
||
For an append only structure, the position of a leaf node is number of prior
|
||
leaf vertexes, plus number of prior internal vertexes, and for an sql append
|
||
only database, the oid of a leaf node is number or prior leaf nodes, plus one.
|
||
|
||
So a leaf oid is:
|
||
$$\displaystyle\frac{key}{2}+1$$
|
||
where $key$ is the patricia bitstring right padded to a fixed size by a one
|
||
bit, followed by as many zero bits as needed.
|
||
|
||
Let $\displaystyle{C_{key}}$ be the count of bits in the key. (which in C++
|
||
is `bitset.count()`,
|
||
But C++ provides no access to cool intrinsic assembly instructions.)
|
||
|
||
The number of internal vertices prior to a leaf is
|
||
$$\displaystyle{\frac{key}{2}+1-C_{key}}$$
|
||
|
||
So the leaf position in postfix order is:
|
||
$$\displaystyle{size_{leaf}*\frac{key}{2}+size_{vertex}*({\frac{key}{2}+1-C_{key}})}$$
|
||
|
||
Let the height of a vertex be $h_{key}$, the number of trailing zeroes in the key.
|
||
|
||
So the postfix vertex position, supposing vertexes are in a separate data structure, is
|
||
$\displaystyle{size_{vertex}*\frac{key}{2}+2^{h_{key}}-C_{key}}$\
|
||
Is that right?\
|
||
Need to check it.
|
||
|
||
I think we could also get the postfix vertex oid by
|
||
$\displaystyle{\frac{{(key-1)} | {key}}{2}+1-C_{key}}$, but again, needs checking.
|
||
|
||
Given the vertex oid and the leaf oid, the absolute position in the direct immutable append only file is easy to calculate.
|
||
|
||
Everyone seems to wind up using regular [C bit twiddling hacks], because
|
||
hardware intrinsics are erratically available, and because the efficiency
|
||
improvement of hardware intrinsics is seldom worth the thought.
|
||
|
||
[C bit twiddling hacks]:
|
||
http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
|
||
"Bit Twiddling Hacks"
|
||
|
||
[uint64_t bitcount(uint64_t c)](../bit_hacks.h)
|
||
|
||
To do the reverse operation, finding the key (the padded patricia index)
|
||
from the postfix position make the starting guess that the $C_{key}$ adjustment
|
||
was zero, find the corresponding patricia key, and then walk the tree from
|
||
you where guessed that you were are, to where you should be. You find
|
||
the predicted postfix position of your guess, find the order of the highest
|
||
order bit where they differ, and walk the postfix position and padded
|
||
patricia key (infix position) in parallel.
|
||
|
||
## Adding to a sequential and complete Merkle-patricia tree
|
||
|
||
Well that solves the problem of a postfix tree, but how do we apply this to
|
||
solve the problem of the number of items not being a power of two?
|
||
|
||
### A sequential append only collection of postfix binary trees
|
||
|
||
<svg
|
||
xmlns="http://www.w3.org/2000/svg"
|
||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||
width="29em" height="17em"
|
||
viewBox="8 209 164 102"
|
||
style="background-color:ivory" stroke-width=".6" stroke-linecap="round">
|
||
<g font-family="'Times New Roman'" font-size="6" font-weight="400" fill-rule="evenodd">
|
||
<g id="height_3_tree" fill="none" >
|
||
<path stroke="#b00000"
|
||
d="
|
||
M71.36 234.686s2.145-.873 3.102 0c1.426 1.303 14.645 21.829 16.933 23.136 1.302.745 4.496.45 5-2.3
|
||
M145.916 220c0-.93.124-.992.992-1.364.869-.373 2.42-.373 3.04.558.62.93-2.852-4.94 18.607 38.394.715 1.443 2.348 1.186 4-2
|
||
M147.218 218.5c1.303-.124 1.675.062 2.11.93.434.868.558 3.846.558 3.846-.25 2.496.31 3.597-1.365 19.166-1.675 15.568-1.54 21.825-.744 24.872.744 3.853 3.0 2.853 5.2 .295
|
||
M71.36 234.686c2.42-.434 2.916-.93 6.079-.186 3.163.745 4.466 1.551 12.715 5.52 8.25 3.97 37.774 3.66 41.31 2.606C134.999 241.57 136 240 137 239
|
||
M71.36 234.686s1.551-.558 2.171.186c.62.745 2.481 4.528 1.8 10.545-.683 6.016-2.854 20.719-2.854 22.577 0 2.171 1.116 2.482 2.543 1.8C76.447 269.11 76 268 77 264"/>
|
||
<path stroke="#80e080"
|
||
d="
|
||
M10 253c-3.536 5.335-4.29 7.765-4.466 12.095-.139 3.405.677 6 3.3 2.094
|
||
M18 245c1.799-1.55 20.903-7.319 35.603-10.855 14.7-3.535 18.918-7.69 52.35-8.621 33.432-.93 36.037-.869 38-3
|
||
M21 248c2.17-.372 6.744-1.343 10.792-1.736 5.858-.57 18.925-2.228 29-4.094 3.36-.728 7.673-1.492 7.618-2.812
|
||
M16.963 253.232c-.0 0 3 4 5.2 4.2 1.7 .7 4.093 .6 7-2.3"/>
|
||
<path stroke="#000000"
|
||
d="
|
||
M70.077 236c-.162-6.288 74.008-3.04 76-12
|
||
m-7.417 12c.161-6.127 9.306-6.425 8.5-12"/>
|
||
<g id="height_2_tree">
|
||
<path stroke="#b00000"
|
||
d="
|
||
M31.3 250.83c1.054-.372 2.046-.744 2.357.31.31 1.055-.571 11.044.682 15.569C36.013 272.766 38 269.675 39 266"/>
|
||
<path stroke="#000000"
|
||
d="
|
||
M29.767 252c1.774-4 38.858-5.904 39.18-12
|
||
M60 252c0-4.875 10.41-6.871 11.205-12"/>
|
||
<g id="height_1_tree">
|
||
<path
|
||
style="stroke:#B00000"
|
||
d="m 10,264.1 c 1,-2.2 3.2,-2 3.85,-0.4 C 15,267 20,273 21,266"
|
||
id="prev_leaf_link"/>
|
||
<path stroke="#000"
|
||
d="M10.09 264c0-4 18.868-.062 19.174-8.062M21.866 264c0-3.008 8.893-1.544 9.513-8"/>
|
||
<g id="leaf_vertex" >
|
||
<g style="stroke:#000000;">
|
||
<path
|
||
d="M 11.7,265 8,271
|
||
M 11.7,265 9.5,271
|
||
M 11.7,265 11,271"
|
||
id="path1024" />
|
||
</g>
|
||
<rect id="merkle_vertex" width="4" height="4" x="8" y="264" fill="#00f"/>
|
||
</g>
|
||
<use width="100%" height="100%" transform="translate(12)" xlink:href="#leaf_vertex"/>
|
||
<use width="100%" height="100%" transform="translate(20 -12)" xlink:href="#merkle_vertex"/>
|
||
</g>
|
||
<use width="100%" height="100%" transform="translate(30)" xlink:href="#height_1_tree"/>
|
||
<use width="100%" height="100%" transform="translate(60 -28)" xlink:href="#merkle_vertex"/>
|
||
</g>
|
||
<g width="100%" height="100%" >
|
||
<use transform="translate(68)" xlink:href="#height_2_tree"/>
|
||
<use transform="translate(136 -44)" xlink:href="#merkle_vertex"/>
|
||
<use transform="translate(144)" xlink:href="#height_1_tree"/>
|
||
</g>
|
||
</g>
|
||
<g id="blockchain_id" >
|
||
<ellipse cx="14" cy="249" fill="#80e080" rx="8" ry="5"/>
|
||
<text>
|
||
<tspan x="11.08" y="251.265">id</tspan>
|
||
</text>
|
||
</g>
|
||
<rect width="168" height=".4" x="8" y="276" fill="#000"/>
|
||
<text y="278">
|
||
<tspan dy="8" x="6" >Immutable append only file as a Merkle chain</tspan>
|
||
</text>
|
||
<use transform="translate(0,50)" xlink:href="#blockchain_id"/>
|
||
<path
|
||
style="fill:none;stroke:#80e080;"
|
||
d="m 18,297 c 4,-6 4,-6 5.6,3 C 25,305 28,304 28.5,300"/>
|
||
<g id="4_leaf_links">
|
||
<g id="2_leaf_links">
|
||
<g id="leaf_link">
|
||
<path
|
||
style="fill:none;stroke:#000000;"
|
||
d="m 29,299 c 4,-6 4,-6 5.6,3 C 35,305 38,304 38.5,300"/>
|
||
<use transform="translate(20,33)" xlink:href="#leaf_vertex"/>
|
||
</g>
|
||
<use transform="translate(10,0)" xlink:href="#leaf_link"/>
|
||
</g>
|
||
<use transform="translate(20,0)" xlink:href="#2_leaf_links"/>
|
||
</g>
|
||
<use transform="translate(40,0)" xlink:href="#4_leaf_links"/>
|
||
<use transform="translate(80,0)" xlink:href="#4_leaf_links"/>
|
||
<use transform="translate(140,33)" xlink:href="#leaf_vertex"/>
|
||
<text y="208">
|
||
<tspan dy="8" x="6" >Immutable append only file as a collection of</tspan>
|
||
<tspan dy="8" x="6" >balanced binary Merkle trees</tspan>
|
||
<tspan dy="8" x="6" >in postfix order</tspan>
|
||
</text>
|
||
</g>
|
||
</svg>
|
||
|
||
The superstructure of balanced binary Merkle trees allows us to verify any
|
||
part of it with only $O(log)$ hashes, and thus to verify that one version of
|
||
this data structure that one party is using is a later version of the same data
|
||
structure that another party is using.
|
||
|
||
This reduces the amount of trust that clients have to place in peers. When
|
||
the blockchain gets very large there will be rather few peers and a great
|
||
many peers, thus there will be a risk that the peers will plot together to
|
||
bugger the clients. This structure enables a client to verify that any part of
|
||
the blockchain is what his peer say it is, and thus avoids the risk that peer
|
||
may tell different clients different accounts of the consensus. Two clients
|
||
can quickly verify that they are on the same total order and total set of
|
||
transactions.
|
||
|
||
Edges of the graph are represented by hashes, and thus can only travelled
|
||
from right to left, Vertices are represented by their hash, and in their
|
||
canonical form contain child hashes and their full padded Merkle patricia
|
||
key within a tree as an arbitrary precision integer, and thus their implicit
|
||
postfix position within a tree, which identities provide implicit edges that
|
||
can be traversed in any direction with respect to a given total order, while
|
||
hash edges are agnostic of total order, and are what we construct the
|
||
consensus about total order from.
|
||
|
||
The bottom most part of the structure consists of data structures that do
|
||
not have a canonical order. But when we are figuring out how to order
|
||
them, we have to construct vertices on top of them that do have a
|
||
canonical order, where each vertex contains a hash commitment to a total
|
||
past in total order and a patricia key representing its position in the total
|
||
order.
|
||
|
||
When the chain becomes very big, sectors and disks will be failing all the
|
||
time, and we don't want such failures to bring everything to a screaming halt.
|
||
|
||
And, when the chain becomes very big, most people will be operating
|
||
clients, not peers, and they need to be able to ensure that the peers are not
|
||
lying to them.
|
||
|
||
If our initial tree has has a size of zero, this is the same as creating
|
||
a sequential and complete Merkle-patricia tree.
|
||
|
||
Since the whole point of a Merkle tree is immutable entities, we seldom
|
||
want to insert, delete, or update anything but the right hand edge of a
|
||
sequential Merkle-patricia tree, and normally only want to insert on the
|
||
right hand edge.
|
||
|
||
Thus a sequential Merkle-patricia tree is not exactly a block chain,
|
||
since each block does not contain the hash of the previous block. If it
|
||
did, you would potentially have to receive and calculate a lot of hashes
|
||
of hashes. to ascertain that block one thousand did indeed chain to
|
||
block five hundred. This structure means that clients can calculate the
|
||
validity of the block chain for those parts of it that contain
|
||
transactions that concern them, and know that everyone else doing
|
||
similar calculations are getting results that show the same consensus as
|
||
they are getting, thus calculating the validity of block chain is
|
||
distributed to all clients without all clients needing to deal with the
|
||
entire block chain.
|
||
|
||
If all the peers get together to screw over one client or few clients,
|
||
that client is going to have cryptographic proof of misconduct. On
|
||
publishing that proof, large numbers of people are likely to blacklist
|
||
those peers, resulting in a fork in the blockchain. We could automate
|
||
this process, with everyone automatically disregarding the signatures of
|
||
peers for which a proof existed that they had changed the rules in a way
|
||
inconsistent with the rules implemented in a client, so that if nine
|
||
tenths of the peers change their software, and nine tenths of the
|
||
clients do not, we automatically get a fork with nine tenths of the
|
||
clients on one block chain with a tenth of the peers, and one tenth of
|
||
the clients on the blockchain with nine tenths of the peers.
|
||
|
||
This architecture allows a client peer arrangement where to pervert the
|
||
blockchain, you have to synchronously pervert everyone’s software
|
||
everywhere, or most people’s software, whereas with bitcoin, a few big
|
||
miners can pervert the blockchain.
|
||
|
||
Bitcoin will fail because power over the blockchain lies with a few big
|
||
miners, and governments will eventually twist their arms, or themselves
|
||
get into the business of mining. This is inherent in the scaling
|
||
problems of bitcoin. Back when everyone was miner, and everyone had the
|
||
whole blockchain on their machine, power was distributed in a way that
|
||
ensured that the blockchain was conducted according to consensus rules,
|
||
but as the blockchain gets bigger and bigger, fewer and fewer people
|
||
host the complete blockchain, and fewer and fewer people mine. So we are
|
||
back to the situation we wanted to escape, where a rather small number
|
||
of people have power over other people’s money.
|
||
|
||
A client has partial trees for all his transactions in the blockchain,
|
||
and if all clients check their own particular part of the block chain,
|
||
the entire blockchain is checked.
|
||
|
||
Well this is a great layout if we have data structure that fits entirely
|
||
in memory, and it is a great layout if we have an enormous mutable file
|
||
and are appending some bits to it, and have log n chunks mapped into
|
||
memory leading to the part where we are appending stuff.
|
||
|
||
But if we are frequently constructing partial hash trees here and there,
|
||
the trouble is we log n non localbits things to look up. If we are
|
||
putting a big Merkle-patricia tree in a database, better to infix order
|
||
by the size of the bit string, then the content of the bit string filled
|
||
out to the long word size with infix order bits, or if it does not fit
|
||
in a long word, as a blob. (sqlite sorts nulls first, then integers,
|
||
then strings, then blobs, and if you write something as an integer and
|
||
read it as a blob, gets converted a string representing a decimal)
|
||
|
||
Suppose we want a literally immutable append only file, representing a
|
||
sequential patricia Mekle tree, where we append each completed binary
|
||
tree?
|
||
|
||
Well in that case the high nodes have to follow each completed subtree,
|
||
and our postfix order would be:
|
||
|
||
| | | | | | | | |
|
||
|---:|---:|---:|---:|---:|---:|---:|---:|
|
||
|000|001|010|011|100|101|110|111| |
|
||
||00||01||10||11|
|
||
||||0||||1|
|
||
||||||||“”|
|
||
|
||
Thus, to construct the postfix order for the leaf position in the tree
|
||
using right side nodes, leaf `n` has postfix order
|
||
`(n<<1)-std::bitset<64>(n).count()`. Which is considerably more
|
||
complicated than the infix order for leaves with nodes uniformly
|
||
intermixed with them.
|
||
|
||
Further, the not a power of two case is more complicated. To figure out
|
||
which nodes at the right side are skipped, have to have the infix order
|
||
of the node. Then we pack all the non skipped right hand nodes following
|
||
the last leaf node. There is no simple way to identify a skipped node
|
||
from the postfix order.
|
||
|
||
But the postfix order has the huge advantage for enormous data
|
||
structures that when you are constructing, access is sequential and
|
||
forwards only, and if you are making an ever larger file, the additions
|
||
are append only.
|
||
|
||
But this is essentially a plan to build our own database, presumably
|
||
with a second file containing a copy of only higher nodes, and a third
|
||
file containing a copy of higher nodes still, until one gets down to
|
||
copy of the high nodes that fits entirely within memory. X64 has 64 byte
|
||
blocks in its top level cache, and can virtual seventy terabyte files
|
||
into its copious address space. 4kilobyte blocks for disk access tend to
|
||
be fastest but 64 kilobyte blocks are only marginally slower, though in
|
||
some algorithms they waste more memory. Microsoft recommends 64 kilobyte
|
||
disk blocks for servers. This suggests a structure in which every time
|
||
we have a new two byte word in the bitstring, we have a different offset
|
||
corresponding to a different area on disk and memory, a hierarchy of
|
||
files each 256 or 65536 times smaller than the other.
|
||
|
||
A radix 256 Merkle-patricia tree on top of a radix 2 Merkle-patricia
|
||
tree would be referencing sixteen kilobyte blocks, which sounds like it
|
||
is near the sweet spot, in which case the cache file for the equivalent
|
||
of the bitcoin blockchain will fit into a gigabyte of ram.
|
||
|
||
It is premature to think of designing this. After two years at bitcoin
|
||
volumes, our blockchain will be two hundred gigabytes, at which point we
|
||
might want to think of a custom format for sequential patricia trees as
|
||
a collection of immutable files, append only files which grow into
|
||
immutable files, and wal files that get idempotently accumulated into
|
||
the append only files.
|
||
|
||
For the first level nodes, the ones directly above the leaves, the
|
||
postfix order is 000🡒00010, 001🡒00101, 010🡒01001, 011🡒01100, 100🡒01001,
|
||
101🡒10100, 110🡒11000, 111🡒11011. We append two zeroes, subtract the
|
||
count of the bits of the bitstring, and add two.
|
||
|
||
Or equivalently, we take the infix order of first level node, subtract
|
||
the count of the bits of the infix order, and add two.
|
||
|
||
For the second level nodes 00🡒00110, 01🡒01101, 10🡒10101, 11🡒11100, we
|
||
append three zeroes, subtract the count of the bits, and add six to get
|
||
the postfix order.
|
||
|
||
For the third level nodes 0🡒01110,1🡒11101. We append four zeroes, add
|
||
fourteen, and substract the count of the bits to get the postfix order.
|
||
|
||
For the fourth level node, null🡒11110, we append five zeroes, add
|
||
thirty, and presumably subtract the number of bits to get the postfix
|
||
order.
|
||
|
||
The general formula appears to be add 2\^(level+1)-2-count)
|
||
|
||
Reversing the postfix order to get the bitstring and the level seems
|
||
rather hard. You have to truncate, then find `x: (a+x).count()=x`. And
|
||
since there is no clean and elegant way of finding `count()`) it is not
|
||
likely that there is a clean and elegant way of finding `x.` But it is
|
||
very easy, given the level, to find the parent, the children, and the
|
||
sibling, since these are at fixed offsets. Looks like when iterating
|
||
through structures in postfix order, you have to keep the level and the
|
||
bitstring implicitly or explicitly around, whereas with infix order
|
||
there is a clean and simple relationship between the infix order and
|
||
level and bitstring.
|
||
|
||
To reconstruct the bitstring from the postfix order, the fastest way is
|
||
probably to construct it one bit at a time by conceptually walking the
|
||
tree from the root until we match the postfix order, not recalculating
|
||
count every time, but incrementing it every time we move right in the
|
||
tree, and representing the level by the power of two that we will add to
|
||
the bitstring when we move right.
|
||
|
||
## incomplete.
|
||
|
||
We take a list of hashes and their node infix orders, the offset for
|
||
sequential trees, and the bit strings for left trees, and stuff them
|
||
into map mapping node ids to hashes, using a map that allows random and
|
||
sequential access. The representation of sparse and incomplete tree is
|
||
similar to the representation of a sequential and incomplete vector. We
|
||
provide the nodes that allow the construction of the chain of hashes
|
||
from the object to be authenticated, but we do not provide their
|
||
children.
|
||
|
||
## Sparse.
|
||
|
||
In a sparse Merkle-patricia tree, we are not going to do bit bashing on
|
||
the key, because it is likely inconveniently large, and because we are
|
||
likely to be counting the height from from the left. But we might do it
|
||
for code re-use reasons, where the keys are integers of moderate size.
|
||
|
||
It is customary to define the least significant bit as bit zero, which
|
||
is the convention I have followed in the description of sequential
|
||
patricia Markle Trees. So in a patricia tree with node height measured
|
||
from the right hand side, nodes have a height, which is zero for leaf
|
||
nodes, and variable for the root node. In a sparse tree, one is apt to
|
||
measure the from the left, so nodes have a depth, not a height. The
|
||
depth of the root node is always zero, and leaves have a variable depth.
|
||
To make the code templatable, will need a patricia tree type to have a
|
||
member `static constexpr bool left_edge = true;` for trees with node
|
||
height measured from the left edge, and `false` for trees measured from
|
||
the right edge. The depth of bit three of a byte in position seven in an
|
||
array is `8*7 − 3` The node depth is equal to the position of the bit
|
||
that selected that node. If a node is the parent of all keys with the
|
||
most signficant bit set in the start byte of the key it has depth one.
|
||
|
||
The standard algorithm for entering a new item assumes you are using the
|
||
patricia tree as the infix order, but databases and map files do not do
|
||
this. When we are big and successful, will write a version of sqlite
|
||
that supports sharding and Merkle-patricia infix orderes.
|
||
|
||
And because the tree is sparse, it is probably coming in somewhat non
|
||
sequentially, with updates in random locations, though block updates are
|
||
likely to be sequential.
|
||
|
||
Nonetheless, the analysis of sequential blocks suggests we will get more
|
||
locality of reference on disk and memory, if we put all the patricia
|
||
nodes, leaf and vertex, in one big table, with the key of a vertex node
|
||
locating it close to the leaf node that necessitated its creation,
|
||
though we will need the bit offset to be part of this table. The height
|
||
means we don’t need keys of different levels be guaranteed unique,
|
||
because when we are after a particular level, we use the height, likely
|
||
measured from the right hand side, to select the correct candidate in
|
||
the improbable event of a collision.
|
||
|
||
The bitstring of a node infix order is going, in practice to sorted on
|
||
whole bytes or whole words, rather than its exact bit length, resulting
|
||
in collisions, therefore we have sort on the bitstring and height, or,
|
||
as in a sequential patricia tree, pad the bitstring with a
|
||
representation of the height. Which for long keys is going to be a bit
|
||
verbose, though hardly a show stopper. We could pad the bitstring with
|
||
the difference between the height, and height rounded to the nearest
|
||
number of whole words, but this would have the irritating result that
|
||
256 bit hashes become 257 bit hashes, unless we truncated to 255 bits,
|
||
which is harmless but nonstandard and apt to result in complications all
|
||
over the place. And we would still be looking at the height, if only to
|
||
know the number of bytes or words.
|
||
|
||
The analysis of sequential hash trees suggests that it is going to be
|
||
more efficient to have all nodes of all levels in the same table,
|
||
possibly including leaf nodes. We could store the complete keys of
|
||
parent and its children in a node, but this is rather redundant. If we
|
||
take the bit field offset from the left, rather than from the right as
|
||
in sequential hash trees, then we truncate the key, conceptually at the
|
||
bit indicated by the bit offset, and only have the parts of the keys of
|
||
the children that are beyond the bit offset in the node.
|
||
|
||
In a sparse tree, we are just looking up the key to find the thing
|
||
referred to, which is probably referenced by an oid, so it may well be a
|
||
pointless indirection to have leaf nodes in the same table as their
|
||
parent node, which already has both the key to the child and an oid to
|
||
the child. Maybe we just have a leaf flag that says “this oid is not an
|
||
oid to yet another internal node, but an oid to something else,
|
||
somewhere else”.
|
||
|
||
Two maps, one of them being the map that you are making a Merkle
|
||
patricia hash of, and one containing the nodes. The nodes The nodes
|
||
point to a other nodes, or they themselves contain the leaf value, the
|
||
leaf value being the hash of the map key and object being hashed.
|
||
|
||
On the wire, we don’t have a representation of the sparse tree. The
|
||
nodes get generated on the receiving end when we sent the objects
|
||
themselves. But we do have representations of incomplete sparse trees,
|
||
which are just a list of nodes with their hash values and keys, but not
|
||
their child links, that you will need to construct the root node from a
|
||
subset of the objects.
|
||
|
||
In rare and exceptional cases, the key of the root node by not be empty
|
||
bit string, so when specifying the root node of a sparse tree, we have
|
||
to specify its bit string. But when we send a partial tree, we assume
|
||
the recipient does not have, and does not intend to build, the complete
|
||
tree, and that he receives the root node by some special path that
|
||
authenticates it.
|
||
|
||
A node with children will be inherently different to a node without,
|
||
because it is going to need the data to construct the infix orderes of
|
||
its children, which is a good argument for putting them in distinct
|
||
tables. If we put leaf hashes and vertex hashes in the same table, we
|
||
are massively violating normal form. On the other hand, Sqlite3 supports
|
||
this by having variant typing. For sparse trees, not usually a good idea
|
||
to put the children in the same table. For sequential trees, usually the
|
||
way to go.
|
||
|
||
## Merkle-patricia Tree of 256 bit values.
|
||
|
||
Typically these are going to be a tree of public keys and or hashes.
|
||
|
||
Of course a database cannot handle infix order values that are not a
|
||
multiple of eight bits, and is likely to prefer a multiple of sixteen
|
||
bits, so we will make the key field of the infix order value fields the
|
||
prefix bits that all the leaves of a node have in common, with the bit
|
||
string 0 appended, rounded up to whole number of bytes with additional
|
||
one bits.
|
||
|
||
The depth of a node may be determined by taking the number of bytes
|
||
times eight, and substracting the bits from the last zero leftwards.
|
||
leaf nodes will be special cased.
|
||
|
||
If leaf nodes are part of the node map, we need 257 bits – but we are
|
||
also going to need special case handing of leaf nodes, which is
|
||
effectively the 257th bit.
|
||
|
||
Since our infix order has two fields, it is efficient to reference nodes
|
||
by OID, rather than by key, though you frequently have to find a node by
|
||
its key field and its depth field. So each internal node will have its
|
||
depth, (or rather a value reversibly constructed from its depth) and its
|
||
key (or rather a value reversibly constructed from its key and its
|
||
depth), and the OIDs of its two children. For a leaf node, the child OID
|
||
fields have a different meaning.
|
||
|
||
To reconstruct the hash of a node, we look up the two children by OID,
|
||
and hash their hashes and the bit strings that differentiate them from
|
||
their parent and each other, or rather the the bytes containing the bit
|
||
strings that differentiate them from their parent and each other. We
|
||
hash the hash and bit string of one node first, then the hash and bit
|
||
string of the next node, to avoid the possibility that one concatenated
|
||
pair of bit strings might equal another pair of concatenated bit
|
||
strings.
|
||
|
||
If we are inserting a bunch of new map entries with the intent of
|
||
recalculating the new value of the root node once the insertions are
|
||
done, we sort them into a temporary table in order of their map key, and
|
||
after each insertion calculate the new values of the parent nodes up to
|
||
but excluding the parent node that is the parent of both the item
|
||
recently inserted, and the next item to be inserted. If no next item to
|
||
be inserted, up to and including the top.
|
||
|
||
Pretty sure that this guarantees all nodes get recalculated as necessary
|
||
and only as necessary, but we will need debug check to make sure we have
|
||
not missed a fence post problem. Have a forced re-evaluation of the
|
||
entire tree to check for internal consistency.
|
||
|
||
When we are generating blocks on the block chain on an ordinary computer
|
||
with an ordinary internet connection. (16mbps down, 3mbps up), and
|
||
blocks are typically 300 seconds per block, that implies that blocks are
|
||
smaller than a gigabyte, so we can prepare blocks entirely in memory,
|
||
either as an in memory sqlite database, or in a custom format using
|
||
links, rather than OIDs, and key values of the infix order in sixty four
|
||
bit integers, rather than bytes. Once a block is committed to global
|
||
blockchain, it however has to acquire global oids, hence goes in a
|
||
different database, probably in a slightly different database format,
|
||
since all the oids for a given kind item are sequential. But the hash of
|
||
a block is calculated in a way entirely independent of the oids that
|
||
will be assigned to it. Perhaps when it is committed, it gets capped by
|
||
additional data telling us what oids were assigned, but including the
|
||
oids when calculating the root hash of a block under construction, and
|
||
therefore suffering frequent inserts in random positions, would make it
|
||
too costly to insert new data.
|
||
|
||
Such information about the type and number of objects in the block, and
|
||
the block’s position on the block chain, should not, on the dry
|
||
principle, be part of the block or the block root hash, though it might
|
||
well be useful to include such information in the authentication block
|
||
(authentication blocks alternate with blocks containing actual updates
|
||
on the block chain) When combining two proposed blocks, it is good to
|
||
start by merging the larger block into the smaller block, and as a check
|
||
that everything is working as it should, make sure that the counts are
|
||
consistent, this being a low cost check against glitches in calculating
|
||
the hash. We don’t need to incorporate such a consistency check into
|
||
the blockchain itself until the authentication block, which attests to
|
||
the previous root of the block chain, which attests to the previous root
|
||
and previous sequence number of the block chain, thus necessarily
|
||
contains its own sequence number, unlike a data block.
|
||
|
||
The hash of a block should be independent of its oids, but we do not
|
||
want a glitch to result in the oids of a peer getting out of sync with
|
||
its peers and that peer attempting to sail on unawares that it is out of
|
||
consensus, therefore the hash of an authentication block should contain
|
||
a hash that depends on the hash of the final authoritative oids of all
|
||
blocks in the preceding block chain, or all oids affected by the final
|
||
block.
|
||
|
||
On the dry principle, objects inside a block should not contain
|
||
information as to what block they are in, and a block should not contain
|
||
information as to what blockchain it is in nor what block number it is.
|
||
But a block should be accompanied by context information as to what
|
||
block number it is and what block chain it is in, and the root of the
|
||
blockchain should be accompanied by context information saying what
|
||
block chain it is and h1ow many blocks are in it.
|
||
|
||
An authentication block attests to the recent previous block of the
|
||
block chain in isolation from context, and therefore links to its own
|
||
context, but in general, the meaning of a hash is given by the object
|
||
that hashes to it. And we somehow have to find or construct the object
|
||
that hashes to it. The object that hashes to it therefore needs to
|
||
contain information that implies its size and how it is to be hashed,
|
||
what it means. With a hash, we ad hoc provide context information to
|
||
hint where to find the data that generates the hash. But the chain to
|
||
more and more global meanings is on the inside of the information being
|
||
hashed not on the outside. Every transaction provids the oids of the
|
||
unspent transaction outputs that it refers to and the signatures
|
||
authenticaded by their public keys. When an object contains an unhashed
|
||
reference to data outside itself, it must hash the broadest context in
|
||
which that data occurs. Thus a transaction must contain or imply the oid
|
||
of a previous root hash of the blockchain containing all these
|
||
transaction outputs, and the hash of the transaction must hash that
|
||
root, thus the each transaction authenticates and is authenticated by
|
||
previous blockchain.
|
||
|
||
Repeating. Where an object contains a reference to an object outside
|
||
itself which reference is not itself a hash, which is to say, when the
|
||
reference is an oid, then its hash has to chain to that object,
|
||
preferably by chaining to the largest object in which that object
|
||
occurs. The oid is merely a hint to finding data that chains into the
|
||
hash of the object, and the derivation of the correct hash depends on
|
||
correctly identifying the thing that the oid refers to, the data
|
||
structure into which the oid is a pointer. The oid could merely be the
|
||
eleventh hash in a sequential map of seventeen hashes, and to know what
|
||
the object means, you have to know, or correctly guess from context,
|
||
what it is. So for the hash of the object to uniquely determine the
|
||
meaning of the object, it would have to hash the root of the Merkle
|
||
patricia tree of that map that defines the oid. Objects containing oids
|
||
have to chain to the context that gives meaning to their oids, which
|
||
context chains to the larger context, the largest context of them all
|
||
being a recent past root of the global blockchain, which provides unique
|
||
oids global to the blockchain, for each type of oid that is directly in
|
||
the blockchain. And to the extent that the context providing oids
|
||
provides separate oid sequences for each type of object within it, and
|
||
thus an oid can refer to multiple things of multiple types within it,
|
||
then the type of the object or other data within the object has to imply
|
||
the type of the oid.
|
||
|
||
In short, if the object mapped by the map a Merkle-patricia tree
|
||
references another object, it should it should either reference it by
|
||
hash, by the hash of a patricia tree containing that object and the key
|
||
within the referenced patricia Mekle tree containing the referenced
|
||
object, or the hash of the object should depend on the hash of the
|
||
objects that it references, or the hash of the Merkle-patricia tree
|
||
containing that object, in order that the hash of the referencing
|
||
object, and thus the hash of the patricia Markle tree containing the
|
||
object, uniquely identifies everything that the object references.
|
||
|
||
# Implementation of the specific Merkle-patricia trees we will need.
|
||
|
||
We have a sequential tree of attestation blocks containing proof of
|
||
consensus, each of which chains to the previous block, and each of which
|
||
contains the root hash of the tree as it was before this block was
|
||
added.
|
||
|
||
We have a sequential tree of transactions, with each attestation block
|
||
attesting to the state of the sequential tree of transactions as it was
|
||
at that time, so that each participant knows that they agree with the
|
||
consensus on transactions, and they agree on the specific sequence
|
||
number of each transactions. They also agree on the current state after
|
||
all transactions were applied, they agree on the sparse tree of unspent
|
||
transaction outputs.
|
||
|
||
In order to generate that order, each full peer have to construct a
|
||
sparse tree ordered by hash code. When they reach agreement on the root
|
||
of that hash tree, they then proceed to agreement on the sequential list
|
||
of transactions, and the current state of unspent transaction outputs.
|
||
|
||
## Sparse tree of hashes
|
||
|
||
Because the bit string is going to be generally quite short, or else
|
||
exactly two hundred and thirty six bits, we treat this as a list of
|
||
bitstrings, with each list composed of lists of bitstrings, with each
|
||
list and sublist prefigured by a bitstring corresponding to their common
|
||
prefix. We don’t try to save space by omitting the (typically short)
|
||
common part of the prefix, and we pad the bit string to a whole word
|
||
with the padding sequence 1000000\....
|
||
|
||
Each node contains a bit count and a count of its descendants. We
|
||
special case bit strings corresponding to whole words, and we special
|
||
case that special case for the case that bitstring is exactly two fifty
|
||
six bits.
|
||
|
||
We don’t have a representation for the empty list, but we do have a
|
||
representation for a list consisting of a single item. So we build the
|
||
tree by starting with a single item, and then adding items to the tree.
|
||
The case where we have no items is handled separately and specially. We
|
||
never have an empty tree. A single item is itself a tree, and you can
|
||
add a tree to a tree. A tree is, of course composed of trees.
|
||
|
||
Because we are not prefix compressing the items in the tree of hashes,
|
||
our skip links do not contain the skip data, and the hash of a tree with
|
||
more than one item it does not hash the common prefix, but simply hashes
|
||
the hash of the subtrees. The hash of a leaf node, is, of course,
|
||
itself.
|
||
|
||
The root of this tree is recorded in the blockchain, but we do not need
|
||
to store the tree itself in the blockchain. It lives only in memory. The
|
||
sequential tree of transactions that it gives rise to lives on disk.
|