finally figured out how to represent numbers and variable

length bitfields to that they will sort correctly in a Merkle Patricia tree. Have written no end of rubbish on this with needs to be deleted or modified
2023-10-20 10:30:31 +00:00 · 2023-10-20 10:30:31 +00:00 · 3c6ec5283d
commit 3c6ec5283d
parent 06b9fc4017
6 changed files with 130 additions and 33 deletions
--- a/docs/libraries.md
+++ b/docs/libraries.md
@ -1320,7 +1320,7 @@ verification.  Not sure how long it takes to produce a proof that a large
 number of proofs were verified.
 What you want is to be able to prove that a final hash is the root of of an
-enormous merkle tree, some generalization of a merkle-patricia tree,
+enormous merkle tree, some generalization of a Merkle-patricia tree,
 representing an immutable append only data structure consisting of a
 sequence of piles of transactions, and the state generated by these
 transactions, represents a valid branch of a chain of signatures, that the
--- a/docs/merkle_patricia_dag.md
+++ b/docs/merkle_patricia_dag.md
@ -80,7 +80,7 @@ The additional bit is flag indicating a final vertex, a leaf vertex of the
 index, false (`0`) for interior vertices, true (`1`) for leaf vertices of
 the index -- so we now have a full field, plus a flag.
-A bitstring represents the path through the merkle patricia tree to
+A bitstring represents the path through the Merkle-patricia tree to
 vertex, and we will, for consistency with sql database terminology,
 call the bitstring padded to one bit past the field boundary the key,
 the key being the sql field plus the one additional trailing bit, the
@ -104,7 +104,7 @@ a one bit,plus the bits if any associated with that link.
 This enables you, given the bitstring you start with, and bitstring of
 the vertex you want to find, the path through the patricia tree.
-And, if it is a Merkle patricia tree, this enables you to not only
+And, if it is a Merkle-patricia tree, this enables you to not only
 produce a short efficient proof that proves the presence of a
 certain datum in an enormous pile of data, but the absence of a datum.
@ -209,7 +209,7 @@ the bitstrings of vertices and skip fields as bitstrings.  It is likely to
 be a good more convenient to represent and manipulate keys, and to
 represent the skip bits by the key of the target vertex.
-Fields have meanings for the application using the Merkle patricia
+Fields have meanings for the application using the Merkle-patricia
 dag, bitstrings lack meaning.
 But to understand what a patricia tree is, and to manipulate it, our
@ -434,12 +434,12 @@ one of which corresponds to appending a $0$ bit to the bitstring that
 identifies the vertex and the path to the vertex, and one of which
 corresponds to adding a $1$ bit to the bitstring.
-In an immutable append only Merkle patricia dag vertices identified
+In an immutable append only Merkle-patricia dag vertices identified
 by bit strings ending in a $0$ bit have a third hash link, that links to a
 vertex whose bit string is truncated back by removing the trailing $0$
 bits back to rightmost $1$ bit and zeroing that $1$ bit.  Thus, whereas in
 blockchain (Merkle chain) you need $n$ hashes to reach and prove
-a vertext $n$ blocks back, in a immutable append only Merkle patricia
+a vertext $n$ blocks back, in a immutable append only Merkle-patricia
 dag, you only need $\bigcirc(\log_2n)$ hashes to reach a vertex $n$ blocks back.
 The vertex $0010$ has an extra link back to the vertex $000$, the
@ -492,7 +492,7 @@ We would like to represent an immutable append only data
 structure by append only files, and by sql tables with sequential and
 ever growing oids.
-When we defined the key for a Merkle patricia tree, the key
+When we defined the key for a Merkle-patricia tree, the key
 definition gave us the parent node with a key field in the middle of
 its chilren, infix order.  For the tree depicted above, we want postfix order.
@ -682,7 +682,7 @@ represent the vertex depth below the start of field, rather than the
 vertex height above the end of field.
 We always start walking the vertexes representing an immutable
-append only Merkle patricia tree knowing the bitstring, so their
+append only Merkle-patricia tree knowing the bitstring, so their
 preimages do not need to contain a vertex bitstring, nor do their
 links need to add bits to the bitstring, because all the bits added
 or subtracted are implicit in the choice of branch to take, so those
--- a/docs/number_encoding.md
+++ b/docs/number_encoding.md
@ -9,9 +9,9 @@ in protocols tend to become obsolete. Therefore, for future
 upwards compatibility, we want to have variable precision
 numbers.
-Secondly, to represent integers within a patricia merkle tree representing a database index, we want all values to be left field aligned, rather than right field aligned.
+Secondly, to represent integers within a Merkle-patricia tree representing a database index, we want all values to be left field aligned, rather than right field aligned.
-## Merkle patricia dag
+## Merkle-patricia dag
 We intend to have a vast Merkle dag, and a vast collection of immutable
 append only data structures.  Each new block in the append only data
@ -40,7 +40,7 @@ package.
 ## Compression algorithm preserving sort order
-We want to represent integers by byte strings whose lexicographic order reflects their order as integers, which is to say, when sorted as a left aligned field, sort like integers represented as a right aligned field.  (Because a Merkle patricia tree has a hard time with right aligned fields)
+We want to represent integers by byte strings whose lexicographic order reflects their order as integers, which is to say, when sorted as a left aligned field, sort like integers represented as a right aligned field.  (Because a Merkle-patricia tree has a hard time with right aligned fields)
 To do this we have a field that is a count of the number of bytes, and the size of that field is encoded in unary.
@ -96,6 +96,103 @@ We display a value in the range  $0\le n \lt 58/2$ as itself,
 a value $n$ in the range $58/2\le n \lt \lfloor 58*2^{-2}\rfloor*58 +58/2$ as the base 58 representation of $n+58*(58/2-1)$
 ## Variable length bit fields
 To represent variable length bit fields in the postfix sort order,
 such that a shorter bit field sorts after all longer bit fields
 with same leading bits:
 We break it into seven bit fields, with a final field representing zero to six bits.
 A seven bit field is represented by a byte ending in a zero low order bit.
 A variable length $m$ bit field where m is 0 to 6 (seven possible
 values) by is represented by a fixed width eight bit field:
 Where if\
 $j$ is the bitfield interpreted as a number\
 $m$ is the length of the bitfield\
 $c$ is a count of the set bits in the bitfield
 The value of the eight bit field is:\
 $j*(2^{(7-m)}-1)+2*c+1$
 ----------------------
 variable  7 bit
 bit field bitfield  
 --------- ------------
 000000    0000 0001
 000001    0000 0011
 00000     0000 0101
 000010    0000 0111
 000011    0000 1001
 00001     0000 1011
 0000      0000 1101
 000100    0000 1111
 000101    0001 1001
 00010     0001 0011
 000110    0001 0101
 000111    0001 0111
 00011     0001 1001
 0001      0001 1011
 000       0001 1101
 ...        ...
 111101    1110 1100
 11110     1110 1101
 111110    1110 1111
 111111    1111 0001
 11111     1111 0011
 1111      1111 0101
 111       1111 0111
 11        1111 1001
 1         1111 1011
 empty     1111 1101
 --------------------
 ### SQL blobs.
 In order for blobs in a database representing bitfields to sort
 correctly, we do not use seven bit nibbles, but eight bit bytes,
 with a final byte representing zero to seven bits as an eight bit byte.
 For this we use the mapping:
 Where if\
 $j$ is the bitfield interpreted as a number\
 $m$ is the length of the bitfield\
 $c$ is a count of the set bits in the bitfield
 The value of the eight bit field is:\
 $j*(2^{(7-m)}-1)+c$
 The difference is that blob is preceded by a count field
 that is not used in the sort order, which is tricky to
 do in a Merkle-patricia tree representing an sql index.
 ## Use case
 QR codes and prefix free number encoding is useful in cases where we want data to be self describing – this bunch of bits is to be interpreted in a certain way, used in a certain action, means one thing, and not another thing. At present there is no standard for self description. QR codes are given meanings by the application, and could carry completely arbitrary data whose meaning and purpose comes from outside, from the context.
@ -118,7 +215,7 @@ will only be one, and it will be a long time before there are two.
 When I say "arbitrarily large" I do not mean arbitrarily large, since this creates the possibility that someone could break something by sending a number bigger than the software can handle. There needs to be an absolute limit, such as sixty four bits, on representable numbers. But the limit should be larger than is ever likely to have a legitimate use.
-# Solutions
+# Other Solutions
 ## Zero byte encoding
@ -128,7 +225,7 @@ When I say "arbitrarily large" I do not mean arbitrarily large, since this creat
 QUIC expresses a sixty two bit number as one to four sixteen bit numbers.  This is the fastest to encode and decode.
-## Leading bit as number boundary
+## VLQ Leading bit as number boundary
 But it seems to me that the most efficient reasonably fast and elegant
 solution is a variant on utf8 encoding, though not quite as fast as the
--- a/docs/scale_clients_trust.md
+++ b/docs/scale_clients_trust.md
@ -79,7 +79,7 @@ around and are disinclined to make it available.  And if they did make it
 available, the same peer would appear in far too many different and
 unrelated branches of the tree, creating excessive [Kademlia] lookup costs.
-## Merkle Patricia tree of signatures
+## Merkle-patricia tree of signatures
 Suppose that every block of the root primary blockchain contains hash of Merkle-patricia keys of signatures of blobs.
@ -376,8 +376,8 @@ rather than what our enemies in Ethereum want done.
 The key is writing a language that operates on what looks to it like sql
 tables, to produce proof that the current state, expressed as a collection of
-tables represented as a Merkle Patricia tree, is the result of valid
+tables represented as a Merkle-patricia tree, is the result of valid
-operations on a collection of transactions, represented as Merkle patricia
+operations on a collection of transactions, represented as Merkle-patricia
 tree, that acted on the previous current state, that allows generic
 transactions, on generic tables, rather than Ethereum transactions on
 Ethereum data structures.
@ -454,7 +454,7 @@ rocket and calling it a space plane.
 A blockchain is of course a chain of blocks, and at scale, each block would be far too immense for any one peer to store or process, let alone the entire chain.
-Each block would be a Merkle patricia tree, or a Merkle tree of a number of Merkle patricia trees, because we want the block to be broad and flat, rather than deep and narrow, so that it can be produced in a massively parallel way, created in parallel by an immense number of peers. Each block would contain a proof that it was validly derived from the previous block, and that the previous block’s similar proof was verified. A chain is narrow and deep, but that does not matter, because the proofs are “scalable”. No one has to verify all the proofs from the beginning, they just have to verify the latest proofs.
+Each block would be a Merkle-patricia tree, or a Merkle tree of a number of Merkle-patricia trees, because we want the block to be broad and flat, rather than deep and narrow, so that it can be produced in a massively parallel way, created in parallel by an immense number of peers. Each block would contain a proof that it was validly derived from the previous block, and that the previous block’s similar proof was verified. A chain is narrow and deep, but that does not matter, because the proofs are “scalable”. No one has to verify all the proofs from the beginning, they just have to verify the latest proofs.
 Each peer would keep around the actual data and actual proofs that it cared about, and the chain of hashes linking the data it cared about to Merkle root of the latest block.
--- a/docs/sharing_the_pool.md
+++ b/docs/sharing_the_pool.md
@ -53,15 +53,15 @@ You find the expected scale height, the amount that causes the probability of a
 But if both sides have vast collections of identical or near identical transactions, as is highly likely because they probably just synchronized with the same people, each item in a filter is going to convey very little information.  Further, you can never be sure that you are completely synchronized except by setting a lot of bits for each item.
-## Merkle Patricia tree
+## Merkle-patricia tree
-So, you build a Merkle patricia tree.
+So, you build a Merkle-patricia tree.
 And then you want to transmit a filter that represents the upper portion of the tree where the likelihood of a discrepancy between Bob's tree and Carol's tree is around fifty percent.  When you see a discrepancy, you go deeper into that part of the tree on the next sub round.  A large part of the time, the discrepancy will be a single transaction.  When you have isolated all the discrepancies, rinse and repeat.  Eventually the root hashes will agree, so the snapshot the Bob's concurrent process took is now synchronized to Carol, and the snapshot that Carol's concurrent process took is now synchronized to Bob.  But new transactions have probably arrived, so time to take the next snapshot.
 You discover how deep that is by initially sending the full filter of vertex and leaf hashes for just a portion of the address space covered by the tree.  From what shows up, in the next round you will be roughly right for filter depth.
-You do want to use a cryptographically strong hash for the identifier of the each transaction, because that is global public information, and we do not want people to be able to cook up transactions that will force hash collisions, because that would enable them to engage in Byzantine defection. But you want to use Murmur for the vertices of the tree that represents transactions that Bob does not yet know whether Carol already has, since that is bilateral information maintained by concurrent process that is managing Bob's connection with Carol, so Byzantine defection is impossible.  When, however, Bob's concurrent process managing the connection with Carol whips up a Merkle patricia tree, it should use Murmur3, because there will be a lot of such processes generating a lot of Merkle patricia trees, but only one cryptographic hashes representing each transaction.  Lots of such trees are generated, and lots discarded.
+You do want to use a cryptographically strong hash for the identifier of the each transaction, because that is global public information, and we do not want people to be able to cook up transactions that will force hash collisions, because that would enable them to engage in Byzantine defection. But you want to use Murmur for the vertices of the tree that represents transactions that Bob does not yet know whether Carol already has, since that is bilateral information maintained by concurrent process that is managing Bob's connection with Carol, so Byzantine defection is impossible.  When, however, Bob's concurrent process managing the connection with Carol whips up a Merkle-patricia tree, it should use Murmur3, because there will be a lot of such processes generating a lot of Merkle-patricia trees, but only one cryptographic hashes representing each transaction.  Lots of such trees are generated, and lots discarded.
 [SMhasher]:https://github.com/aappleby/smhasher
@ -80,7 +80,7 @@ where $g=11400714819323198485$, the odd number nearest to $2^{64)} divided by th
 Which would be a disastrously weak hash if our starting values were highly ordered, but is likely to suffice because our starting values are strongly random.  Needless to say, it has absolutely no resistance to cryptographic attack, but such an attack is pointless, because our starting values are cryptographically strong, our resulting values don't involve any public commitments and we intend to reveal the preimage in due course.
-Come to think of it, we can get away with 64 bit hashes, provided we subsample the underlying cryptographically strong 256 bit hashes differently each time, since we do not need to get absolutely perfect synchronization in any one synchronization event.  We can live with the occasional rare Merkle patricia tree that gives the same hash for two different sets of transactions.  The error will be cleaned up in the next synchronization event.
+Come to think of it, we can get away with 64 bit hashes, provided we subsample the underlying cryptographically strong 256 bit hashes differently each time, since we do not need to get absolutely perfect synchronization in any one synchronization event.  We can live with the occasional rare Merkle-patricia tree that gives the same hash for two different sets of transactions.  The error will be cleaned up in the next synchronization event.
 Thus the hash of two 64 bit hashes, $U$ and $V$, is $(Ug+V)\%2^{64}$.
--- a/docs/variable-length-quantity.md
+++ b/docs/variable-length-quantity.md
@ -4,7 +4,7 @@ title: Variable Length Quantity
 I originally implemented variable length quantities following the standard.
-And then I realized that an sql index represented as a merkle-patricia tree inherently sorts in byte string order.
+And then I realized that an sql index represented as a Merkle-patricia tree inherently sorts in byte string order.
 Which is fine if we represent integers as fixed length integers in big endian format,
 but does not correctly sort variable length quantities if we follow the standard:
@ -106,13 +106,13 @@ So no longer using these complicated offset for the number itself,
 but are using them for the byte count.
 We use the negative of the count, in order to get the correct
 sort order on the underlying byte strings, so that they can be
-represented in a Merkle patricia tree representing and index.
+represented in a Merkle-patricia tree representing and index.
 And so on and so forth in the same pattern for negative signed numbers of unlimited size.
 # bitstrings
-Bitstrings in merkle patricia tree representing an sql index
+Bitstrings in Merkle-patricia tree representing an sql index
 are typically very short, so should be represented by a 
 variable length quantity, except for the leaf edge,
 which is fixed size and large, so should not be