finally figured out how to represent numbers and variable

length bitfields to that they will sort correctly in a Merkle Patricia
tree.

Have written no end of rubbish on this with needs to be deleted or
modified
This commit is contained in:
reaction.la 2023-10-20 10:30:31 +00:00
parent 06b9fc4017
commit 3c6ec5283d
No known key found for this signature in database
GPG Key ID: 99914792148C8388
6 changed files with 130 additions and 33 deletions

View File

@ -1320,7 +1320,7 @@ verification. Not sure how long it takes to produce a proof that a large
number of proofs were verified. number of proofs were verified.
What you want is to be able to prove that a final hash is the root of of an What you want is to be able to prove that a final hash is the root of of an
enormous merkle tree, some generalization of a merkle-patricia tree, enormous merkle tree, some generalization of a Merkle-patricia tree,
representing an immutable append only data structure consisting of a representing an immutable append only data structure consisting of a
sequence of piles of transactions, and the state generated by these sequence of piles of transactions, and the state generated by these
transactions, represents a valid branch of a chain of signatures, that the transactions, represents a valid branch of a chain of signatures, that the

View File

@ -80,7 +80,7 @@ The additional bit is flag indicating a final vertex, a leaf vertex of the
index, false (`0`) for interior vertices, true (`1`) for leaf vertices of index, false (`0`) for interior vertices, true (`1`) for leaf vertices of
the index -- so we now have a full field, plus a flag. the index -- so we now have a full field, plus a flag.
A bitstring represents the path through the merkle patricia tree to A bitstring represents the path through the Merkle-patricia tree to
vertex, and we will, for consistency with sql database terminology, vertex, and we will, for consistency with sql database terminology,
call the bitstring padded to one bit past the field boundary the key, call the bitstring padded to one bit past the field boundary the key,
the key being the sql field plus the one additional trailing bit, the the key being the sql field plus the one additional trailing bit, the
@ -104,7 +104,7 @@ a one bit,plus the bits if any associated with that link.
This enables you, given the bitstring you start with, and bitstring of This enables you, given the bitstring you start with, and bitstring of
the vertex you want to find, the path through the patricia tree. the vertex you want to find, the path through the patricia tree.
And, if it is a Merkle patricia tree, this enables you to not only And, if it is a Merkle-patricia tree, this enables you to not only
produce a short efficient proof that proves the presence of a produce a short efficient proof that proves the presence of a
certain datum in an enormous pile of data, but the absence of a datum. certain datum in an enormous pile of data, but the absence of a datum.
@ -209,7 +209,7 @@ the bitstrings of vertices and skip fields as bitstrings. It is likely to
be a good more convenient to represent and manipulate keys, and to be a good more convenient to represent and manipulate keys, and to
represent the skip bits by the key of the target vertex. represent the skip bits by the key of the target vertex.
Fields have meanings for the application using the Merkle patricia Fields have meanings for the application using the Merkle-patricia
dag, bitstrings lack meaning. dag, bitstrings lack meaning.
But to understand what a patricia tree is, and to manipulate it, our But to understand what a patricia tree is, and to manipulate it, our
@ -434,12 +434,12 @@ one of which corresponds to appending a $0$ bit to the bitstring that
identifies the vertex and the path to the vertex, and one of which identifies the vertex and the path to the vertex, and one of which
corresponds to adding a $1$ bit to the bitstring. corresponds to adding a $1$ bit to the bitstring.
In an immutable append only Merkle patricia dag vertices identified In an immutable append only Merkle-patricia dag vertices identified
by bit strings ending in a $0$ bit have a third hash link, that links to a by bit strings ending in a $0$ bit have a third hash link, that links to a
vertex whose bit string is truncated back by removing the trailing $0$ vertex whose bit string is truncated back by removing the trailing $0$
bits back to rightmost $1$ bit and zeroing that $1$ bit. Thus, whereas in bits back to rightmost $1$ bit and zeroing that $1$ bit. Thus, whereas in
blockchain (Merkle chain) you need $n$ hashes to reach and prove blockchain (Merkle chain) you need $n$ hashes to reach and prove
a vertext $n$ blocks back, in a immutable append only Merkle patricia a vertext $n$ blocks back, in a immutable append only Merkle-patricia
dag, you only need $\bigcirc(\log_2n)$ hashes to reach a vertex $n$ blocks back. dag, you only need $\bigcirc(\log_2n)$ hashes to reach a vertex $n$ blocks back.
The vertex $0010$ has an extra link back to the vertex $000$, the The vertex $0010$ has an extra link back to the vertex $000$, the
@ -492,7 +492,7 @@ We would like to represent an immutable append only data
structure by append only files, and by sql tables with sequential and structure by append only files, and by sql tables with sequential and
ever growing oids. ever growing oids.
When we defined the key for a Merkle patricia tree, the key When we defined the key for a Merkle-patricia tree, the key
definition gave us the parent node with a key field in the middle of definition gave us the parent node with a key field in the middle of
its chilren, infix order. For the tree depicted above, we want postfix order. its chilren, infix order. For the tree depicted above, we want postfix order.
@ -682,7 +682,7 @@ represent the vertex depth below the start of field, rather than the
vertex height above the end of field. vertex height above the end of field.
We always start walking the vertexes representing an immutable We always start walking the vertexes representing an immutable
append only Merkle patricia tree knowing the bitstring, so their append only Merkle-patricia tree knowing the bitstring, so their
preimages do not need to contain a vertex bitstring, nor do their preimages do not need to contain a vertex bitstring, nor do their
links need to add bits to the bitstring, because all the bits added links need to add bits to the bitstring, because all the bits added
or subtracted are implicit in the choice of branch to take, so those or subtracted are implicit in the choice of branch to take, so those

View File

@ -9,9 +9,9 @@ in protocols tend to become obsolete. Therefore, for future
upwards compatibility, we want to have variable precision upwards compatibility, we want to have variable precision
numbers. numbers.
Secondly, to represent integers within a patricia merkle tree representing a database index, we want all values to be left field aligned, rather than right field aligned. Secondly, to represent integers within a Merkle-patricia tree representing a database index, we want all values to be left field aligned, rather than right field aligned.
## Merkle patricia dag ## Merkle-patricia dag
We intend to have a vast Merkle dag, and a vast collection of immutable We intend to have a vast Merkle dag, and a vast collection of immutable
append only data structures. Each new block in the append only data append only data structures. Each new block in the append only data
@ -40,7 +40,7 @@ package.
## Compression algorithm preserving sort order ## Compression algorithm preserving sort order
We want to represent integers by byte strings whose lexicographic order reflects their order as integers, which is to say, when sorted as a left aligned field, sort like integers represented as a right aligned field. (Because a Merkle patricia tree has a hard time with right aligned fields) We want to represent integers by byte strings whose lexicographic order reflects their order as integers, which is to say, when sorted as a left aligned field, sort like integers represented as a right aligned field. (Because a Merkle-patricia tree has a hard time with right aligned fields)
To do this we have a field that is a count of the number of bytes, and the size of that field is encoded in unary. To do this we have a field that is a count of the number of bytes, and the size of that field is encoded in unary.
@ -96,6 +96,103 @@ We display a value in the range $0\le n \lt 58/2$ as itself,
a value $n$ in the range $58/2\le n \lt \lfloor 58*2^{-2}\rfloor*58 +58/2$ as the base 58 representation of $n+58*(58/2-1)$ a value $n$ in the range $58/2\le n \lt \lfloor 58*2^{-2}\rfloor*58 +58/2$ as the base 58 representation of $n+58*(58/2-1)$
## Variable length bit fields
To represent variable length bit fields in the postfix sort order,
such that a shorter bit field sorts after all longer bit fields
with same leading bits:
We break it into seven bit fields, with a final field representing zero to six bits.
A seven bit field is represented by a byte ending in a zero low order bit.
A variable length $m$ bit field where m is 0 to 6 (seven possible
values) by is represented by a fixed width eight bit field:
Where if\
$j$ is the bitfield interpreted as a number\
$m$ is the length of the bitfield\
$c$ is a count of the set bits in the bitfield
The value of the eight bit field is:\
$j*(2^{(7-m)}-1)+2*c+1$
----------------------
variable 7 bit
bit field bitfield
--------- ------------
000000 0000 0001
000001 0000 0011
00000 0000 0101
000010 0000 0111
000011 0000 1001
00001 0000 1011
0000 0000 1101
000100 0000 1111
000101 0001 1001
00010 0001 0011
000110 0001 0101
000111 0001 0111
00011 0001 1001
0001 0001 1011
000 0001 1101
... ...
111101 1110 1100
11110 1110 1101
111110 1110 1111
111111 1111 0001
11111 1111 0011
1111 1111 0101
111 1111 0111
11 1111 1001
1 1111 1011
empty 1111 1101
--------------------
### SQL blobs.
In order for blobs in a database representing bitfields to sort
correctly, we do not use seven bit nibbles, but eight bit bytes,
with a final byte representing zero to seven bits as an eight bit byte.
For this we use the mapping:
Where if\
$j$ is the bitfield interpreted as a number\
$m$ is the length of the bitfield\
$c$ is a count of the set bits in the bitfield
The value of the eight bit field is:\
$j*(2^{(7-m)}-1)+c$
The difference is that blob is preceded by a count field
that is not used in the sort order, which is tricky to
do in a Merkle-patricia tree representing an sql index.
## Use case ## Use case
QR codes and prefix free number encoding is useful in cases where we want data to be self describing this bunch of bits is to be interpreted in a certain way, used in a certain action, means one thing, and not another thing. At present there is no standard for self description. QR codes are given meanings by the application, and could carry completely arbitrary data whose meaning and purpose comes from outside, from the context. QR codes and prefix free number encoding is useful in cases where we want data to be self describing this bunch of bits is to be interpreted in a certain way, used in a certain action, means one thing, and not another thing. At present there is no standard for self description. QR codes are given meanings by the application, and could carry completely arbitrary data whose meaning and purpose comes from outside, from the context.
@ -118,7 +215,7 @@ will only be one, and it will be a long time before there are two.
When I say "arbitrarily large" I do not mean arbitrarily large, since this creates the possibility that someone could break something by sending a number bigger than the software can handle. There needs to be an absolute limit, such as sixty four bits, on representable numbers. But the limit should be larger than is ever likely to have a legitimate use. When I say "arbitrarily large" I do not mean arbitrarily large, since this creates the possibility that someone could break something by sending a number bigger than the software can handle. There needs to be an absolute limit, such as sixty four bits, on representable numbers. But the limit should be larger than is ever likely to have a legitimate use.
# Solutions # Other Solutions
## Zero byte encoding ## Zero byte encoding
@ -128,7 +225,7 @@ When I say "arbitrarily large" I do not mean arbitrarily large, since this creat
QUIC expresses a sixty two bit number as one to four sixteen bit numbers. This is the fastest to encode and decode. QUIC expresses a sixty two bit number as one to four sixteen bit numbers. This is the fastest to encode and decode.
## Leading bit as number boundary ## VLQ Leading bit as number boundary
But it seems to me that the most efficient reasonably fast and elegant But it seems to me that the most efficient reasonably fast and elegant
solution is a variant on utf8 encoding, though not quite as fast as the solution is a variant on utf8 encoding, though not quite as fast as the

View File

@ -79,7 +79,7 @@ around and are disinclined to make it available. And if they did make it
available, the same peer would appear in far too many different and available, the same peer would appear in far too many different and
unrelated branches of the tree, creating excessive [Kademlia] lookup costs. unrelated branches of the tree, creating excessive [Kademlia] lookup costs.
## Merkle Patricia tree of signatures ## Merkle-patricia tree of signatures
Suppose that every block of the root primary blockchain contains hash of Merkle-patricia keys of signatures of blobs. Suppose that every block of the root primary blockchain contains hash of Merkle-patricia keys of signatures of blobs.
@ -376,8 +376,8 @@ rather than what our enemies in Ethereum want done.
The key is writing a language that operates on what looks to it like sql The key is writing a language that operates on what looks to it like sql
tables, to produce proof that the current state, expressed as a collection of tables, to produce proof that the current state, expressed as a collection of
tables represented as a Merkle Patricia tree, is the result of valid tables represented as a Merkle-patricia tree, is the result of valid
operations on a collection of transactions, represented as Merkle patricia operations on a collection of transactions, represented as Merkle-patricia
tree, that acted on the previous current state, that allows generic tree, that acted on the previous current state, that allows generic
transactions, on generic tables, rather than Ethereum transactions on transactions, on generic tables, rather than Ethereum transactions on
Ethereum data structures. Ethereum data structures.
@ -454,7 +454,7 @@ rocket and calling it a space plane.
A blockchain is of course a chain of blocks, and at scale, each block would be far too immense for any one peer to store or process, let alone the entire chain. A blockchain is of course a chain of blocks, and at scale, each block would be far too immense for any one peer to store or process, let alone the entire chain.
Each block would be a Merkle patricia tree, or a Merkle tree of a number of Merkle patricia trees, because we want the block to be broad and flat, rather than deep and narrow, so that it can be produced in a massively parallel way, created in parallel by an immense number of peers. Each block would contain a proof that it was validly derived from the previous block, and that the previous blocks similar proof was verified. A chain is narrow and deep, but that does not matter, because the proofs are “scalable”. No one has to verify all the proofs from the beginning, they just have to verify the latest proofs. Each block would be a Merkle-patricia tree, or a Merkle tree of a number of Merkle-patricia trees, because we want the block to be broad and flat, rather than deep and narrow, so that it can be produced in a massively parallel way, created in parallel by an immense number of peers. Each block would contain a proof that it was validly derived from the previous block, and that the previous blocks similar proof was verified. A chain is narrow and deep, but that does not matter, because the proofs are “scalable”. No one has to verify all the proofs from the beginning, they just have to verify the latest proofs.
Each peer would keep around the actual data and actual proofs that it cared about, and the chain of hashes linking the data it cared about to Merkle root of the latest block. Each peer would keep around the actual data and actual proofs that it cared about, and the chain of hashes linking the data it cared about to Merkle root of the latest block.

View File

@ -53,15 +53,15 @@ You find the expected scale height, the amount that causes the probability of a
But if both sides have vast collections of identical or near identical transactions, as is highly likely because they probably just synchronized with the same people, each item in a filter is going to convey very little information. Further, you can never be sure that you are completely synchronized except by setting a lot of bits for each item. But if both sides have vast collections of identical or near identical transactions, as is highly likely because they probably just synchronized with the same people, each item in a filter is going to convey very little information. Further, you can never be sure that you are completely synchronized except by setting a lot of bits for each item.
## Merkle Patricia tree ## Merkle-patricia tree
So, you build a Merkle patricia tree. So, you build a Merkle-patricia tree.
And then you want to transmit a filter that represents the upper portion of the tree where the likelihood of a discrepancy between Bob's tree and Carol's tree is around fifty percent. When you see a discrepancy, you go deeper into that part of the tree on the next sub round. A large part of the time, the discrepancy will be a single transaction. When you have isolated all the discrepancies, rinse and repeat. Eventually the root hashes will agree, so the snapshot the Bob's concurrent process took is now synchronized to Carol, and the snapshot that Carol's concurrent process took is now synchronized to Bob. But new transactions have probably arrived, so time to take the next snapshot. And then you want to transmit a filter that represents the upper portion of the tree where the likelihood of a discrepancy between Bob's tree and Carol's tree is around fifty percent. When you see a discrepancy, you go deeper into that part of the tree on the next sub round. A large part of the time, the discrepancy will be a single transaction. When you have isolated all the discrepancies, rinse and repeat. Eventually the root hashes will agree, so the snapshot the Bob's concurrent process took is now synchronized to Carol, and the snapshot that Carol's concurrent process took is now synchronized to Bob. But new transactions have probably arrived, so time to take the next snapshot.
You discover how deep that is by initially sending the full filter of vertex and leaf hashes for just a portion of the address space covered by the tree. From what shows up, in the next round you will be roughly right for filter depth. You discover how deep that is by initially sending the full filter of vertex and leaf hashes for just a portion of the address space covered by the tree. From what shows up, in the next round you will be roughly right for filter depth.
You do want to use a cryptographically strong hash for the identifier of the each transaction, because that is global public information, and we do not want people to be able to cook up transactions that will force hash collisions, because that would enable them to engage in Byzantine defection. But you want to use Murmur for the vertices of the tree that represents transactions that Bob does not yet know whether Carol already has, since that is bilateral information maintained by concurrent process that is managing Bob's connection with Carol, so Byzantine defection is impossible. When, however, Bob's concurrent process managing the connection with Carol whips up a Merkle patricia tree, it should use Murmur3, because there will be a lot of such processes generating a lot of Merkle patricia trees, but only one cryptographic hashes representing each transaction. Lots of such trees are generated, and lots discarded. You do want to use a cryptographically strong hash for the identifier of the each transaction, because that is global public information, and we do not want people to be able to cook up transactions that will force hash collisions, because that would enable them to engage in Byzantine defection. But you want to use Murmur for the vertices of the tree that represents transactions that Bob does not yet know whether Carol already has, since that is bilateral information maintained by concurrent process that is managing Bob's connection with Carol, so Byzantine defection is impossible. When, however, Bob's concurrent process managing the connection with Carol whips up a Merkle-patricia tree, it should use Murmur3, because there will be a lot of such processes generating a lot of Merkle-patricia trees, but only one cryptographic hashes representing each transaction. Lots of such trees are generated, and lots discarded.
[SMhasher]:https://github.com/aappleby/smhasher [SMhasher]:https://github.com/aappleby/smhasher
@ -80,7 +80,7 @@ where $g=11400714819323198485$, the odd number nearest to $2^{64)} divided by th
Which would be a disastrously weak hash if our starting values were highly ordered, but is likely to suffice because our starting values are strongly random. Needless to say, it has absolutely no resistance to cryptographic attack, but such an attack is pointless, because our starting values are cryptographically strong, our resulting values don't involve any public commitments and we intend to reveal the preimage in due course. Which would be a disastrously weak hash if our starting values were highly ordered, but is likely to suffice because our starting values are strongly random. Needless to say, it has absolutely no resistance to cryptographic attack, but such an attack is pointless, because our starting values are cryptographically strong, our resulting values don't involve any public commitments and we intend to reveal the preimage in due course.
Come to think of it, we can get away with 64 bit hashes, provided we subsample the underlying cryptographically strong 256 bit hashes differently each time, since we do not need to get absolutely perfect synchronization in any one synchronization event. We can live with the occasional rare Merkle patricia tree that gives the same hash for two different sets of transactions. The error will be cleaned up in the next synchronization event. Come to think of it, we can get away with 64 bit hashes, provided we subsample the underlying cryptographically strong 256 bit hashes differently each time, since we do not need to get absolutely perfect synchronization in any one synchronization event. We can live with the occasional rare Merkle-patricia tree that gives the same hash for two different sets of transactions. The error will be cleaned up in the next synchronization event.
Thus the hash of two 64 bit hashes, $U$ and $V$, is $(Ug+V)\%2^{64}$. Thus the hash of two 64 bit hashes, $U$ and $V$, is $(Ug+V)\%2^{64}$.

View File

@ -4,7 +4,7 @@ title: Variable Length Quantity
I originally implemented variable length quantities following the standard. I originally implemented variable length quantities following the standard.
And then I realized that an sql index represented as a merkle-patricia tree inherently sorts in byte string order. And then I realized that an sql index represented as a Merkle-patricia tree inherently sorts in byte string order.
Which is fine if we represent integers as fixed length integers in big endian format, Which is fine if we represent integers as fixed length integers in big endian format,
but does not correctly sort variable length quantities if we follow the standard: but does not correctly sort variable length quantities if we follow the standard:
@ -106,13 +106,13 @@ So no longer using these complicated offset for the number itself,
but are using them for the byte count. but are using them for the byte count.
We use the negative of the count, in order to get the correct We use the negative of the count, in order to get the correct
sort order on the underlying byte strings, so that they can be sort order on the underlying byte strings, so that they can be
represented in a Merkle patricia tree representing and index. represented in a Merkle-patricia tree representing and index.
And so on and so forth in the same pattern for negative signed numbers of unlimited size. And so on and so forth in the same pattern for negative signed numbers of unlimited size.
# bitstrings # bitstrings
Bitstrings in merkle patricia tree representing an sql index Bitstrings in Merkle-patricia tree representing an sql index
are typically very short, so should be represented by a are typically very short, so should be represented by a
variable length quantity, except for the leaf edge, variable length quantity, except for the leaf edge,
which is fixed size and large, so should not be which is fixed size and large, so should not be