figured out, at least in outline, how to make

a distributed hash table byzantine fault tolerant

Slight clarification on scalability

Figured out how to make variable length integers that
will be represented in correct order in a patricia
merkle tree.
This commit is contained in:
reaction.la 2023-10-13 21:14:31 +10:00
parent 776c18a4a6
commit 06b9fc4017
No known key found for this signature in database
GPG Key ID: 99914792148C8388
7 changed files with 489 additions and 44 deletions

View File

@ -5,6 +5,66 @@ title: Estimating frequencies from small samples
...
# The problem to be solved
## distributed hash table
The Distributed hash table fails horribly in the face of a
significant likelihood of bad behaviour by the participants,
because you do not actually know the state of the network.
The usual procedure (Bittorrent network) is to treat information as
unconditionally valid for two hours, then throw it away,
which is pretty useless if a participant is behind a NAT,
and a disastrous loss of data if he has a long lived network address.
We would like to accumulate on disk very long lived
and rarely changing data about long lived participants,
the backbone of the distributed hash table.
We also want to have an arrangement with peers behind a NAT,
that each will ping the other at certain times with a keep-alive,
and if the expected keep-alive fails to arrive, the ensuing nacks and acks
will re-open the hole in the firewall, and also give us
information on how often each needs to ping the other.
When one concludes the timing of the pings could be improved,
they renegotiate the schedule with each other,
so that peers behind a nat with long lived holes do not need frequent pings.
At present, a random lookup serves the function of a keep-alive, resulting in
excessive churn in the DHT
If we represent the state of the distributed hash table with
metalogistic distributions, the resulting distributed hash table
should be tolerant to Byzantine fault.
(Because a Byzantine faulting peer eventually winds up being rated
as unreliable, and the backbone of the distributed hash table will
be long lived peers with long lived reputations, the reputation
being represented by a metalogistic distribution giving the likelihood
that the information supplied is correct.)
Each peer is identified by its durable public key. For each peer
there is its current network address, and a metalogistic distribution
of the longevity of that network address,
which no one keeps around for very long or distributes very far
if it does not indicate much longevity.
There is also a metalogistic distribution of the likelihood
that hole punching will be needed, and if likely to be needed,
a list of peers that might provide it,
and the likelihood that hole punching will work.
If the first peer in the list is up but fails, the next is not tried.
But if the first peer cannot be contacted, the next is contacted.
And, if hole punching is needed, a metalogistic distribution of
how long the hole is likely to last after punching.
And, most importantly, for our backbone of very long lived peers,
metalogistic distributions of the likelihood of Byzantine fault,
which will provide us with a Byzantine fault tolerant distributed hash table.
## protocol negotiation
We could also apply distributions to protocol negotiation,
though this is likely to be colossal overkill.
Because protocols need to be changed, improved, and fixed from time to
time, it is essential to have a protocol negotiation step at the start of every networked interaction, and protocol requirements at the start of every store
and forward communication.
@ -27,6 +87,48 @@ sends or requests enough bits to reliably identify that protocol.  But this
means it must estimate probabilities from limited data.  If ones data is
limited, priors matter, and thus a Bayesian approach is required.
### should not worry about protocol identifier size for a long time.
The above is massive overkill
A quick solution, far less clever than accurately guessing that
two entities are speaking the same language, is to find an integer such
that both parties have a Dewey decimal protocol identifier that
starts with the same integer, and then go with the smaller of the
two Dewey Decimal protocol identifiers.
Dewey decimal numbers that start with the same integer should be different
versions of the same protocol, and if one party can handle the
higher numbered version,
he has to be able to handle all lower numbered versions of that same protocol.
Dewey decimal numbers that start with different integers
represent unrelated protocols.
So if the client says 7.3.2.2.1, and the server has only been
updated to 7.2.0, he replies 7.2.0, and both parties then go
with 7.2.0,
but if he only knows 6.3.3, 1.6.0 and 219.1.0, he replies
"fail unknown protocol,"
People launching a new protocol pick an integer,
and if they are not sure what integers are in use,
they just pick a fairly large integer.
In time, we will wind up with a whole lot of integers that "in use",
the vast majority of which are no longer in use,
and no one is sure which ones are no longer in use,
so for a new protocol, they pick a sufficiently large random number.
(Assuming we represent these integers by variable length quantities
so that we can go to unlimitedly large integers, or at least integers
in the range [0 to 283 trillion](./variable_length_quantity.html){target="_blank"},
which should be unlimited enough for anyone.
In the unlikely event that there are eventually ten million protocols
floating around the internet
a random number in that range is unlikely to lead to a collision)
If there were ten million protocols floating around,
then the theoretically optimal way of representing
protocols would only be three or four bytes smaller,
so doing it this easy way is not a significant waste of space.
# Bayesian Prior
The Bayesian prior is the probability of a probability, or, if this recursion
@ -68,7 +170,7 @@ take three samples, one of which is X, and two of which are not X, then
our new distribution is the Beta distribution $α+1,β+2$
If our distribution is the Beta distribution α,β, then the probability
that the next sample will be X is $\frac{α}{α+β}$
that the next sample will be X is $$\frac{α}{α+β}$$
If $α$ and $β$ are large, then the Beta distribution approximates a delta
function
@ -78,7 +180,7 @@ equally likely.
That, of course, is a pretty good prior, which leads us to the conclusion
that if we have seen $n$ samples that are green, and $m$ samples that are not
green, then the probability of the next sample being green is $\frac{n+1}{(n+m+2}$
green, then the probability of the next sample being green is $$\frac{n+1}{(n+m+2}$$
Realistically, until we have seen diverse results there is a finite probability
that all samples are X, or all not X, but no beta function describes this
@ -135,6 +237,11 @@ than a boolean.
# A more realistic prior
## The beta distribution
The Beta distribution has the interesting property that for each new test,
the Baysian update of the Beta distribution is also a Beta distribution.
Suppose our prior, before we take any samples from the urn, is that the probability that the proportion of samples in the urn that are X is ρ is
$$\frac{1}{3}P_{11} (ρ) + \frac{1}{3}δ(ρ) + \frac{1}{3}δ(1-ρ)$$
@ -183,3 +290,35 @@ $$\frac{(n+1)}{n+2}$$
Which corresponds to our intuition on the question “all men are mortal” If we find no immortals in one hundred men, we think it highly improbable that we will encounter any immortals in a billion men.
In contrast, if we assume the beta distribution, this implies that the likelihood of the run continuing forever is zero.
## the metalog (metalogistic) distribution
The metalogistic distribution is like the Beta distribution in that
its Bayesian update is also a metalogistic distribution, but has more terms,
as many terms as are required for the nature of the thing being represented.
The Beta distribution plus two delta functions is a metalogistic distribution
if we stretch the definition of the metalogistic distribution slightly.
The Beta distribution represents the probability of a probability
(since we are using it for its Bayesian update capability).
For example we have a collection of urns containing red and blue balls,
and from time to time we draw a ball out of an urn and replace it,
whereupon the Beta distribution is our best guess
about the likelihood that it contains a certain ratio of red and blue balls
(also assuming the urns are enormously large,
and also always contain at least some red and at least some blue balls)
Suppose, however, the jars contain gravel, the size of each piece
of gravel in a jar being normally distributed, and we want to
estimate the size and standard deviation of the gravel in an urn,
rather than the ratio of red balls and blue balls.
(Well, the size $s$ cannot be normally distributed, because $s$ is strictly non negative, but perhaps $\ln(s)$, or $s\ln(s)$, or $(s/a -a/s)$ is normally distributed.)
Whereupon our Baysian updates become more complex,
and our prior has to contain difficult to justify information
(no boulders or dust in the urns), but we are still doing Bayesian updates,
hence the Beta distribution, and its generalization
the metalogistic distribution, still applies.

View File

@ -2,10 +2,259 @@
title: Libraries
...
This discussion is way out of date because a rust recursive snark library
is now available, and making it public would impose a huge burden on me
of keeping it current and accurate, when events would render it continually
out of date.
A review of potentially useful libraries and utilities.
The material here is usually way out of date and frequently wrong.
It should be treated as a bunch of hints likely to point the reader
in the correct direction, so that the reader can do his homework
on the appropriate library. It should not be taken as gospel.
# Recursive snarks
A horde of libraries are rapidly appearing on GitHub,
most of which have stupendously slow performance,
can only generate proofs for absolutely trivial things,
and take a very long time to do so.
[Nova]:https://github.com/microsoft/Nova
{target="_blank"}
[Nova] claims to be fast, is being frequently updated, needs no trusted setup, and other people are writing toy programs using [Nova].
[Nova] claims you can plug in other elliptic curves, though it sounds like you
might need alarmingly considerable knowledge of elliptic curves in order to
do so.
Plonky had a special purpose hash, such that it was
easy to produce recursive proofs about Merkle trees.
I don't know if Nova can do hashes with useful speed, or hashes at all,
without which no recursive snark system is useful.
We need a hash that has a relatively small circuit.
And it appears that no such hash is known.
Nova is built out of commitments, which are about 256 times bigger than a hash.
A Nova proof is a proof about a merkle tree of commitments.
If we build our blockchain out of Nova commitments, it will about couple of
hundred times larger than one built out of regular hashes,
but will still only occupy about ten or twenty gigabytes of storage.
Bandwidth limits restrict us to about twenty transactions a second,
which is still faster than the bitcoin blockchain.
Plus, when we hit ten or twenty transactions per second,
we can shard the blockchain, which we can do because each shard can prove
it is telling the truth about transactions, whereas with bitcoin,
every peer has to evaluate every transaction,
lest one shard conspire to cheat the others.
[Nova] does not appear to have a language.
Representing a proof system as a Turing machine just seems like a bad idea.
It is not a Turing machine.
You don't calculate $a=b*c$ you instead prove that
$a=b*c$, when you already somehow knew $a$, $b$, and $c$.
A Turing machine is a state machine. A proof system is not.
It is often said, and is in a sense true, that prover produces a proof
that for a given computation he knows an input such that after a
correct execution of the computation he obtains a certain public output.
But no he his not. The proof system proves that relationships hold between values.
And because it can only prove certain rather arcane and special things about
relationships between values, you have to compute a very large number
of intermediate values such that the relationship you actually want to prove
between the input and the output corresponds to simple relationships between
these intermediate values. But computing those intermediate values belongs
in an another language, such as C++ or rust.
With Nova, we would get an algorithm such that you start out with your real input.
You create a bunch of intermediate values in a standard language like C++ or rust,
then you call the proof system to produce a data structure
that can be used to prove relationships between your input and those
intermediate values.
Then you produce the next set of intermediate values,
call your proof system to produce a data structure
that can be used to prove the next set of relationships,
fold those two proof generating data structures together,
rinse and repeat,
and at the end you generate a proof that the set of relationships
the fold represents is valid.
That is procedural, but expressing the relationships is not.
Since your fold is the size of the largest hamiltonian circuit so far,
you want the steps to be all of similar size.
This suggests a functional language (sql). There are, in reality,
no purely functional languages for Turing machines.
Haskell has its monads, sql has update, insert, and delete.
But the natural implementation for a proof system would be a truly purely functional language, an sql without update, insert, or delete, without any operations that actually wrote anything to memory or disk, that simply defined relationships without a state machine that changes state to write data into memory consistent with those changes.
The proof script has to be intellible, and the same for prover and verifier,
the difference being that the prover interleaves the proof language with ordinary code
in ordinary language, to produce the values that are going to be proven. The prover
drives the script along with ordinary language code, and verifier drives it along
with different ordinary language code, but the proof definition that is common
to both of them has no concept of being sequential and driven along,
no concept that things are done in any particular order.
It is a graph of relationships.
The proof language, as is typical of purely functional languages,
should consist of assertions of about relationships between immutable
data structures, without expressing the idea that some of these
data structures were created at one time, and destroyed at another.
Some of these values are defined recursively, which means that what
is actually going to happen in practice is that they are going to be
created by a loop, written in the ordinary procedural language
such as Rust or C++, but the proof language should have no concept of that.
if the proof language asserts that $1 \leq 0 \lor n<20 \implies f(n-1)= g(f(n))$,
the ordinary procedural language will likely need to
generate the values of $f(n) for n=1 to 19,
and will need to cause the proof language to generate proofs for each value
of $n$ for 1 to 19, but the resulting proof will be independent
of the order in which these these proofs were generated.
Purely functional languages like sql do not prescribe an algorithm, and need
a code generator that has to make guesses about what a good algorithm would
be, as exemplified by sqlite's likelihood, likely, and unlikely no-ops.
And with a proof system we will, at least at first, have the human
make the algorithm but if he changes the algorithm,
while leaving the proof system language unchanged, will still work.
The open source nature of Plonky is ... complicated.
The repository on Github has been frozen for two years, so likely
does not represent the good stuff.
# Peer to Peer
[p2p]:https://github.com/elenaf9/p2p
{target="_blank"}
[libp2p]:ipns://libp2p.io/
{target="_blank"}
The generically named [p2p] does exactly what you want to do.
It is a thin wrapper around [libp2p] to allow participants in the
Kademlia cloud to find each other and send each other private messages.
Unfortunately Kademlia is broken and extremely vulnerable
to hostile action,
and [libp2p] has only encryption operations around a broken name system
which they are making even more user hostile than it already is
because they don't want anyone using it (because broken),
uses encryption libraries
for which there is strong reason to suspect enemy activity,
and does not support Schnorr keys, nor Ristretto,
which is an obstacle to scriptless scripts and joint signatures.
The reason they do not support Schnorr is that they are using nonprime groups,
and doing a Schnorr signature safely in a non prime group is incomprehensibly hard
and very easy to get subtly wrong.
The great strength of Ristretto is that it is prime,
which makes a whole lot of clever cryptography available to do clever things,
Schnorr signatures, scriptless scripts, lightning locks,
and compact joint signatures among them.
[libp2p] have a small set of name systems and public key systems,
and it should not be too hard to add yet another to that set.
It appears to be extensible,
because it has to support no end of old obsolete stuff.
Obviously new stuff has been added from time to time,
so it should be possible to find the additions in git and follow their example.
[multiple transport schemes]:ipns://docs.libp2p.io/concepts/transports/listen-and-dial/
{target="_blank"}
[libp2p] supports [multiple transport schemes], and can support a set
of peers using heterogeneous transport.
So you just have to add another transport scheme,
and not everyone has to update simultaneously.
They use TCP and web, but there is a plugin point for new transport schemes,
so just plug in UDP under an encryption and reliability layer.
[libp2p] should make it possible to access the IPFS
both to write and read stuff, though that might be far from trivial.
You could perhaps publish stuff on IPFS that looks like a normal html
document, but contains embedded cryptographic data giving it more forms
of interaction when viewed on your browser, than viewed on brave.
Replacing Kademlia for finding peers in the face of
enemy entryist action is a big project, though libp2p seems to have
taken the essential step of identifying peers by their public key,
rather than IP and port address.
[implementations]:http://libp2p-io.ipns.localhost:48084/implementations
{target="_blank"}
[distributed hash table]:https://github.com/libp2p/specs/blob/master/kad-dht/README.md
{target="_blank"}
Their [distributed hash table] (kad-dht) seems to be very much a work in progress. Checking their [implementations] page for the status of various libp2p components, *everything*
seems to be very much a work in progress. Some things are implemented
in some languages which are not implemented in other languages. Rust has
hole punching, C++ does not. But C++ has a whole lot of stuff that Rust
does not. And their documentation on kad-dht has its page blank. On the other hand,
ipfs works, and polkadot works. So, something is usable. libp2p-peer not implemented
in C++, but is implemented in rust, and is implemented in browser javascript.
Their rendevous protocol presupposes a single central and known server
which simply records everone's ID and network address. Unacceptable.
Anyone using this is fake and an enemy. Should be disabled in our fork.
libp2p is a pile of odds and ends and a framework for gluing them together.
But everything they do is crippled by the fact that you don't know the
likely uptime, downtime, or IP stability of an entity. Which cannot
in fact be known, but you can form probability of a probability estimates.
What is needed is that everyone forms their own probability of a probability.
And they compare what they know
(they have a high probability of a probability)
with other party's estimates, and rate the other party's reliability accordingly
If we add to that probability of probability estimate of IP and port stability
and use it to govern ping time and keep around time, that goes a large way
to solving the the problems with Kademlia
We can adapt it to the problem
by having them preferentially keep around the data for peers
that have stable ip and a stable port, and,
somewhat less preferentially, keep around peers that have a
stable nat penetration or tracking relationship with a peer that has a
stable ip and stable port. Selective pinging. You rarely ping peers
that have been around for a very long time with stable IP, and you
ping a peer that has a nat penetration relationship by not pinging it,
and instead asking the gateway peer how the relationship is going,
at infrequent intervals. Thus, a peer with stable IP
or stable relationship becomes very widely known.
Well, becomes widely known assuming shills do not register
one billion addresses that happen to be near him.
libp2p is something between actual code, and set of standards -
which standards you comply with so that you can use other people's code.
Someone writes something ad hoc for his use case, stuffs it into libp2p
somewhere somehow so that he can use other people's code.
Then other people use his code.
It is a pile of standards (many of them irritatingly stupid, incomplete, ad hoc,
or workarounds for using defective tools) that enable a
whole lot of people writing this sort of thing to copy a whole lot of
each other's code.
Their [NAT discovery algorithm](https://github.com/libp2p/specs/tree/master/autonat){target="_blank"}
Is particularly idiotic and broken. It is not a nat discovery algorithm,
but a closed port discovery algorithm, and a ludicrously laborious,
costly, indirect, error prone, and inefficient closed port discover algorithm.
NAT means the other guy sees your network address different from what you see.
In which case your port is probably closed, but could well be open.
If he sees the same network address as you, your port might be open,
but you don't know that,
and talking to the other guy might well temporarily open your port,
with the result that he might tell you that you are not behind a NAT,
when in fact you are, and your ports are normally closed.
The guys writing this stuff are dumb as posts,
and a whole lot of what they write is garbage.
But, nonetheless, a whole lot of people are using libp2p,
and a whole lot of people are doing a whole lot of work on it --
not all of which is ready for prime time.
# Wireguard, Tailwind, and identity

Binary file not shown.

View File

@ -47,9 +47,28 @@ This is in part malicious, the enemy pouring mud into the tech waters. So I need
A zk-snark or a zk-stark proves that someone knows something,
knows a pile of data that has certain properties, without revealing
that pile of data. Such that he has a preimage of a certain hash
and that this preimage has certain properties
such as the property of being a valid transaction.
that pile of data.
The prover produces a proof that for a given computation he knows
an input such that after a correct execution of the computation
he obtains a certain public output - the public output typically
being a hash of a transaction, and certain facts about
the transaction. The verifier can verify this without knowing
the transaction, and the verification takes roughly constant time
even if the prover is proving something about an enormous computation,
an enormous number of transactions.
To use a transaction output as the input to another transaction we need
a proof that this output was committed on the public broadcast channel
of the blockchain to this transaction and no other, and a proof that this
output was itself an output from a transaction whose inputs were committed
to that transaction and no other, and that the inputs and outputs of that
transaction balanced.
So the proof has to recursively prove that all the transactions
that are ancestors of this transaction output were valid all the
way back to the beginning of the blockchain.
You can prove an arbitrarily large amount of data
with an approximately constant sized recursive snark.
So you can verify in a quite short time that someone proved
@ -266,6 +285,13 @@ every block height whose binary representation ends in a one
followed by $m$ zeroes, we use the information in four level $m$
summary blocks, the blocks $2^{m+1}*n + 2^{m-1}- 4*2^{m}$, $2^{m+1}*n + 2^{m-1}- 3*2^{m}$, $2^{m+1}*n + 2^{m-1}- 2*2^{m}$, and $2^{m+1}*n + 2^{m-1}- 1*2^{m}$ to produce an $m+1$ summary block that allows the two oldest remaining level $m$ summary blocks, the blocks $2^{m+1}*n + 2^{m-1}- 4*2^{m}$ and $2^{m+1}*n + 2^{m-1}- 3*2^{m}$ to be dropped.
It is not sufficient to merely forget about old data.
We need to regenerate new blocks because the patricia merkle tree
presented by the public broadcast channel has to prove
that outputs that once were registered as unspent,
and then registered to a commit, or sequence of commits,
are no longer registered at all.
We summarise the data in the earliest two blocks by discarding
every transaction output that was, at the time those blocks were
created, an unspent transaction output, but is now marked as used
@ -345,6 +371,15 @@ height is currently near a hundred thousand, at which height we will
be keeping about fifty blocks around, instead of a hundred thousand
blocks around.
If we are using Nova commitments, which are eight or nine kilobytes,
in place of regular hashes, which are thirty two bytes,
the blockchain will still only occupy ten or twenty gigabytes, but,
if using Nova commitments, bandwidth limits will force us to shard
when we reach bitcoin transaction rates. But with recursive snarks,
you *can* shard, because each shard can produce a concise proof that
it is not cheating the others, while with bitcoin,
everyone has to evaluate every transaction to prove that no one is cheating.
# Bigger than Visa
And when it gets so big that ordinary people cannot handle the

View File

@ -994,6 +994,9 @@ justice. (They now rely on a Taiwanese owned and operated chip fab), and
Disney destroyed to Star Wars franchise, turning it into a lecture on social
justice. Debian broke Gnome3 and cannot fix it because of social justice.
[book]:./triple_entry_accounting.html
"triple entry accounting"
Business needs a currency and [book] keeping system that enables them to
operate a business instead of a social justice crusade.
@ -1080,9 +1083,6 @@ will be traded in a way that gives the developers seigniorage.
[triple entry accounting]:./triple_entry_accounting.html
"triple entry accounting"
[book]:./triple_entry_accounting.html
"triple entry accounting"
Software that enables businesses that can resist political pressure is a
superset of software than enables discussion groups that can resist political
pressure. We start by enabling discussion groups, which will be an
@ -1408,14 +1408,14 @@ supposedly respectable and highly regulated people, which does not help
you much if, as in the Great Minority Mortgage Meltdown, the regulators
are engaged in evil deeds, or if, as with Enron and MF Global, the
accountants are all in the pay of powerful men engaged in evil deeds.
Triple entry [book]keeping with immutable journal entries works in a low
Triple entry [book keeping] with immutable journal entries works in a low
trust world of badly behaved elites, works in the circumstances now
prevailing, and, unlike Sox accounting, it does not require wide sharing of
the books.
## Corporate cohesion
The corporation exists by [book]keeping, which enables the shareholders to
The corporation exists by [book keeping], which enables the shareholders to
keep an eye on the board, and the board to keep an eye on the CEO, and our
current system of bookkeeping is failing for lack of trust and honour.

7
docs/navbar Normal file
View File

@ -0,0 +1,7 @@
<div class="button-bar">
<a href="vision.html">vision</a>
<a href="scalability.html">scalability</a>
<a href="social_networking.html">social networking</a>
<a href="Revelation.html">revelation</a>
</div>

View File

@ -8,50 +8,61 @@ And then I realized that an sql index represented as a merkle-patricia tree inhe
Which is fine if we represent integers as fixed length integers in big endian format,
but does not correctly sort variable length quantities if we follow the standard:
So: To represent variable signed numbers in byte string sortable order:
So: To represent variable length signed numbers in sequential byte string sortable order so that the integer sequence corresponds one to one to the byte string sequence, a strictly sequential sequence of integers with no gaps corresponding to a strictly sequential sequence of byte strings with no gaps:
# For positive signed integers
If the leading bits are $10$, it represents a number in the range\
$0$ ... $2^6-1$ So only one byte
$0$ ... $2^6-1$ So only one byte (two bits of header, six bits to represent $2^{6}$ different
values as the trailing six bits bits of an ordinary eight bit bit
positive integer).
If the leading bits are $110$, it represents a number in the range\
$2^6$ ... $2^6+2^{13}-1$ So two bytes
if the leading bits are $1110$, it represents a number in the range\
$2^6+2^{13}$ ... $2^6+2^{13}+2^{20}-1$ So three bytes long
(four bits of header, twenty bits bits to represent $2^{20}$ different
values as the trailing twenty bits of an ordinary thirty two bit
positive integer in big endian format).
if the leading bits are $b1111\,0$, it represents a number in the range\
$2^6+2^{13}+2^{20}$ ... $2^6+2^{13}+2^{20}+2^{27}-1$ So four bytes long
(five bits of header, twenty seven bits to represent $2^{27}$ different
values as the trailing twenty seven bits of an ordinary thirty two bit
positive integer in big endian format).
if the leading bits are $1111\,0$, it represents a number in the range\
if the leading bits are $1111\,10$, it represents a number in the range\
$2^6+2^{13}+2^{20}+2^{27}$ ... $2^6+2^{13}+2^{20}+2^{27}+2^{34}-1$
So five bytes long.
if the leading bits are $1111\,10$, it represents a number in the range\
if the leading bits are $1111\,110$, it represents a number in the range\
$2^6+2^{13}+2^{20}+2^{27}+2^{34}-1$ ... $2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}-1$
So six bytes long.
if the leading bits are $1111\,110$, it represents a number in the range\
$2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}$ ... $2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}$
if the leading bits are $1111\,1110$, it represents a number in the range\
$2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}$ ... $2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}-1$
So seven bytes long.
if the leading bits are $1111\,1110$, it represents a number in the range\
$2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}$ ... $2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}+2^{55}-1$
So eight bytes long.
The reason for these complicated offsets is to ensure that the byte string are strictly sequential.
if the leading bits are $1111\,1111\,0$, it represents a number in the range\
$2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}+2^{55}$ ... $2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}+2^{55}+2^{62}-1$
So nine bytes long (ten bits of header, sixty two bits to represent $2^{62}$
different values as the trailing sixty two bits of an ordinary sixty four bit positive integer in big endian format).
if the bits of the first byte are $1111\,1111$, we change representations.
Instead that number is represented by a variable
length quantity that is a count of
bytes in the rest of the byte string, which is the number itself in its
natural binary big endian form, with the leading zero bytes discarded.
So no longer using these complicated offsets for the number itself,
but are using them for the byte count.
if the leading bits are $1111\,1111\,10$, it represents a number in the range\
$2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}+2^{55}+2^{62}$ ... $2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}+2^{55}+2^{62}+2^{69}-1$
So ten bytes long.
This change in representation simplifies coding and speeds up the transformation,
but costs an extra byte for numbers larger than $2^{48}$ and less than $2^{55}$.
And so on and so forth in the same pattern for positive signed numbers of unlimited size.
The reason for these complicated offsets is to ensure that the byte string are strictly sequential.
## examples
The bytestring 0xCABC corresponds to the integer 0x0A7C.\
The bytestring 0xEABEEF corresponds to the integer 0x0ABCAF.
# For negative signed integers
@ -86,19 +97,16 @@ if the leading bits are $0000\,0001$, it represents a number in the range\
$-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}$ ... $-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}$
So seven bytes long.
if the leading bits are $0000\,0000\,1$, it represents a number in the range\
$-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}-2^{55}$ ... $-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}-1$
So eight bytes long.
if the leading bits are $0000\,0000\,01$, it represents a number in the range\
$-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}-2^{55}-2^{62}$ ... $-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}-2^{55}-1$
So nine bytes long (ten bits of header, sixty two bits to represent $2^{62}$
different values as the trailing sixty two bits of an ordinary sixty four bit
negative integer in big endian format).
if the leading bits are $0000\,0000\,001$, it represents a number in the range\
$-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}-2^{55}-2^{62}
$ ... $-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}-2^{55}-1$ So ten bytes long.
if the bits of the first byte are $0000\,0000$, we change representations.
Instead that number is represented by a variable length quantity that is
*zero minus the count* of bytes in the rest of the byte string,
which is the negative number itself in its natural binary big endian form,
with the leading minus one bytes discarded.
So no longer using these complicated offset for the number itself,
but are using them for the byte count.
We use the negative of the count, in order to get the correct
sort order on the underlying byte strings, so that they can be
represented in a Merkle patricia tree representing and index.
And so on and so forth in the same pattern for negative signed numbers of unlimited size.
@ -118,3 +126,10 @@ and so on and so forth.
In other words, we represent it as the integer obtained
by prepending a leading one bit to the bit string.
# Dewey decimal sequences.
The only thing we ever want to do with Dewey decimal sequences is $<=>$,
and they are always positive numbers less than $10^{14}$, so we represent them as
a sequence of variable length numbers terminated by the number minus one
and compare them as bytestrings.