1. It protects clients against malicious peers, since any claim
the peer makes about the total state of the blockchain can
be proven with $\bigcirc(\log_2n)$ hashes.
1. If a block gets lost or corrupted that peer can identify that one specific block that is a problem. Peers have to reload down, or at least re-index, the entire blockchain far too often.
The superstructure of balanced binary Merkle trees allows us to
verify any part of it with only $O(log)$ hashes, and thus to verify that
one version of this data structure that one party is using is a later
version of the same data structure that another party is using.
This reduces the amount of trust that clients have to place in peers.
When the blockchain gets very large there will be rather few peers
and a great many clients, thus there will be a risk that the peers will
plot together to bugger the clients. This structure enables a client
to verify that any part of the blockchain is what his peer say it is, and thus avoids the risk that peer may tell different clients different
accounts of the consensus. Two clients can quickly verify that they
are on the same total order and total set of transactions, and that
any item that matters to them is part of this same total order and
total set.
When the chain becomes very big, sectors and disks will be failing
all the time, and we don't want such failures to bring everything to a
screaming halt. At present, such failures far too often force you to
reindex the blockchain, and redownload a large part of it, which
happens far too often and happens more and more as the
blockchain becomes enormous.
And, when the chain becomes very big, most people will be
operating clients, not peers, and they need to be able to ensure
that the peers are not lying to them.
### storage
We would like to represent an immutable append only data
structure by append only files, and by sql tables with sequential and
ever growing oids.
When we defined the key for a Merkle patricia tree, the key
definition gave us the parent node with a key field in the middle of
its chilren, infix order
For this dag, we would like to define an oid field so that the oid
field of a parent follows the oid fields of its children.
Let us suppose the leaf nodes of the tree depicted above are fixed size $c$, and the interior vertices are fixed size $d$ ($d$ is probably thirty two or sixty four bytes) and they are being physically stored in
memory or a file in sequence.
Let us suppose the leaf nodes are stored with the interior vertices
and are sequentially numbered.
Then the location of leaf node $n$ begins at $n\times c+\big(n-$`std::popcount`$(n)\times d\big)$ (which unfortunately lacks a simple
relationship to the bitstring of a vertex corresponding to a complete
field, which is the field that represents the meaning that we actually
care about).
We can calculate the location of an interior vertex from the number
of the largest numbered leaf node that it could be a parent of:\
To find the oid of a vertex accessed as an sql table pad its bitstring
out to the field width plus one with $1$ bits, (equivalent to
subtracting one from key and oring the result with the key) subtract
the `std::popcount` of the bitstring, and you have the sequential
and always incrementing oid, such that the oid of a parent is always
one greater than the oid of its right child.
If the field is an integer, the block height, the number of blocks in
the blockchain, the oid is one bit larger and approximately twice the
size of that integer, assuming that we are putting vertices and block
roots in the same sql table. (Which we probably won't.)
# Blockchain
A Merkle-patricia block chain represents *an immutable past and a constantly changing present*.
Which represents an immutable and ever growing sequence of transactions,
and also a large and mutable present state of the present database that
is the result of those transactions, the database of unspent transaction
outputs.
When we are assembling a new block, the records live in memory as native
format C++ objects. Upon a new block being finalized, they get written
to disk in key order, with implementation dependent offsets between
records and implementation dependent compression, which compression
likely reflects canonical form. Once written to disk, they are accessed
by native format records in memory, which access by bringing disk
records into memory in native format, but the least recently loaded
entry, or least recetly used entry, gets discarded. Even when we are
operating at larger scale than visa, a block representing five minutes
of transactions fits easily in memory.
Further, a patricia tree is a tree. But we want, when we have the Merkle
patricia tree representing registered names organized by names or the
Merkle-patricia tree represenging as yet unspent transaction outputs its
Merkle characteristic to represent a directed acyclic graph. If two
branches have the same hash, despite being at different positions and
depths in the tree, all their children will be identical. And we want to
take advantage of this in that block chain will be directed acyclic
graph, each block being a tree representing the state of the system at
that block commitment, but that tree points back into previous block
commitments for those parts of the state of the system that have not
changed. So the hash of the node in such a tree will identify, probably
through an OID, a record of the block it was a originally constructed
for, and its index in that tree.
A Merkle-patricia directed acyclic graph, Merkle-patricia dac, is a
Merkle dac, like a git repository or the block chain, with the patricia
key representing the path of hashes, and acting as index through that
chain of hashes to find the data that you want.
The key will thread through different computers under the control of
different people, thus providing a system of witness that the current
global consensus hash accurately reflects past global consensus hashes,
and that each entities version of the past agree with the version it
previously espoused.
This introduces some complications when a portion of the tree represents
a database table with more than one index.
[Ethereum has a discussion and
definition](https://github.com/ethereum/wiki/wiki/Patricia-Tree) of this
data structure.
Suppose, when the system is at scale, we have thousand trillion entries
in the public, readily accessible, and massively replicated part of the
blockchain. (I intend that every man and his dog will also have a
sidechain, every individual, every business. The individual will
normally not have his side chain publicly available, but in the event of
a dispute, may make a portion of it visible, so that certain of his
payments, an the invoice they were payments for, become visible to
others.)
In that case, a new transaction output is typically going to require
forty thirty two byte hashes, taking up about two kilobytes in total on
any one peer. And a single person to person payment is typicaly going to
take ten transaction outputs or so, taking twenty kilobytes in total on
any one peer. And this is going to be massively replicated by a few
hundred peers, taking about four megabytes in total.
(A single transaction will typically be much larger than this, because
it will mingle several person to person payments.
Right now you can get system with sixty four terabytes of hard disk,
thirty two gigabytes of ram, under six thousand, for south of a hundred
dollars per terabyte, so storing everything forever is going to cost
about a twentieth of a cent per person to person payment. And a single
such machine will be good to hold the whole blockchain for the first few
trillion person to person payments, good enough to handle paypal volumes
for a year.
“OK”, I hear you say. “And after the first few trillion transactions?”.
Well then, if we have a few trillion transactions a year, and only a few
hundred peers, then the clients of any one peer will be doing about ten
billion transactions a year. If he profits half a cent per transaction,
he is making about fifty million a year. He can buy a few more sixty
four terabyte computers every year.
The target peer machine we will write for will have thirty two gigabytes
of ram and sixty four terabytes of hard disk, but our software should
run fine on a small peer machine, four gigabytes of ram and two
terabytes of hard disk, until the crypto currency surpasses bitcoin.
# vertex identifiers
We need a canonical form for all data structures, the form which is
hashed, even if it is not convenient to use or manipulate the data in
that form on a particular machine with particular hardware and a
particular complier.
A patricia tree representation of a field and record of fields does
not gracefully represent variable sized records.
If we represented the bitstring that corresponds to the block
number, the block height, has having a large number of leading
zero bits, so that it corresponds to a sixty three bit integer (we need
the additional low order bit for operations translating the bitstring
to its representation as a key field or oid field) a fixed field of sixty
four bits will do us fine for a trillion years or so.
But I have an aesthetic objection to representing things that are not
fixed sized as fixed sized.
Therefore I am inclined to represent bit strings as count of bytes, a
byte string containing the zero padded bitstring, the bitstring being
byte aligned with the field boundary, and count of the distance in
bits between the right edge of the bitstring, and the right edge of
the field, that being the height of the interior vertex above the
leaf vertices containing the actual data that we are interested in, in