wallet/docs/number_encoding.md
reaction.la 3c6ec5283d
finally figured out how to represent numbers and variable
length bitfields to that they will sort correctly in a Merkle Patricia
tree.

Have written no end of rubbish on this with needs to be deleted or
modified
2023-10-20 20:30:32 +10:00

454 lines
21 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
# katex
title: Number encoding
---
# The problem to be solved
As computers and networks grow, any fixed length fields
in protocols tend to become obsolete. Therefore, for future
upwards compatibility, we want to have variable precision
numbers.
Secondly, to represent integers within a Merkle-patricia tree representing a database index, we want all values to be left field aligned, rather than right field aligned.
## Merkle-patricia dag
We intend to have a vast Merkle dag, and a vast collection of immutable
append only data structures. Each new block in the append only data
structure is represented by a hash, whose preimage is a Merkle vertex. A
path through the Merkle dag is represented by sequence of integers, which
are zero terminated if the path is intended to end at that preimage. The
same path in different versions should lead to an object that is in some
sense equivalent to the entity that that path led to in a previous version of
that merkle vertex. Equivalence and relationship of objects between
different merkle vertices in the same immutable append only sequence is
represented by constant paths leading to different objects.
If the sequence corresponds to valid unicode characters that are non control, non whitespace, and not #, @, (), or {}, it is so represented to the user. If one of the integers in the sequence is not, then the user will generally see a # followed by a sequence of base 58 characters representing a sequence of integers, until there is an integer corresponding to a non control, non whitespace, non numeric, non ascii alpha, and not #, @, (), or {}, at which point the human representation switches back to unicode.
If an ascii blank, character 32, occurs within a sequence being represented
to the human as unicode, it is displayed as a blank and all subsequent
values are displayed as dot separated decimal numbers, the displayed
decimal number being the underlying number minus one. If the underlying
value is zero, this terminates the sequence and is therefore never displayed.
A space introduces a sequence of version numbers intended to be
intelligible to both humans and computers as a version number. The
intended usage is package authentication and source control - the package
manager will know this is an updated package, rather than an unrelated
package.
## Compression algorithm preserving sort order
We want to represent integers by byte strings whose lexicographic order reflects their order as integers, which is to say, when sorted as a left aligned field, sort like integers represented as a right aligned field. (Because a Merkle-patricia tree has a hard time with right aligned fields)
To do this we have a field that is a count of the number of bytes, and the size of that field is encoded in unary.
Thus a single byte value, representing integers in the range $0\le n \lt 2^7$ starts with a leading zero bit
A two byte value, representing integers in the range $2^7\le n \lt 2^{13}+2^7$ starts with the bits 10 0
A three byte value, representing integers in the range $2^{13}+2^7 \le n \lt 2^{21}+2^{13}+2^7$ starts with the bits 10 1
A four byte value representing integers in the range $2^{21}+2^{13}+2^7 \le n \lt 2^{27}+2^{21}+2^{13}+2^7$ starts with the bits 110 00
A five byte value representing integers in the range $2^{21}+2^{13}+2^7 \le n \lt 2^{35}+2^{27}+2^{21}+2^{13}+2^7+2^{13}+2^7$ starts with the bits 110 01
A six byte value representing integers in the range $2^{35}+2^{21}+2^{13}+2^7 \le n \lt 2^{43}+2^{35}+2^{27}+2^{21}+2^{13}+2^7+2^{13}+2^7$ starts with the bits 110 10
A seven byte value representing integers in the range $2^{43}+2^{35}+2^{21}+2^{13}+2^7 \le n \lt2^{51}+2^{43}+2^{35}+2^{27}+2^{21}+2^{13}+2^7+2^{13}+2^7$ starts with the bits 110 11
An eight byte value representing integers in the range $2^{51}2^{43}+2^{35}+2^{21}+2^{13}+2^7 \le n \lt2^{57}+2^{51}+2^{43}+2^{35}+2^{27}+2^{21}+2^{13}+2^7+2^{13}+2^7$ starts with the bits 1110 000
A nine byte value representing integers in the range $2^{57}+2^{51}+2^{43}+2^{35}+2^{21}+2^{13}+2^7 \le n \lt2^{65}+2^{57}+2^{51}+2^{43}+2^{35}+2^{27}+2^{21}+2^{13}+2^7+2^{13}+2^7$ starts with the bits 1110 001
Similarly the bits 1110 111 indicate a fifteen byte value representing 113 bit integers.
To represent signed integers so that signed integers sort correctly with each other (but not with unsigned integers) the leading bit indicates the sign, a one bit for positive signed integers, and a zero bit for negative integers, and the if the signed integer is negative, we invert the bits of the byte count. Thus signed integers in the range $-2^6\le n \lt 2^6$ are represented by the corresponding eight bit value with its leading bit inverted.
This is perhaps a little too much cleverness except for the uncommon case
where we actually need a representation of signed integers that sorts
correctly.
## base 58 representation of a sequence of unsigned integers
Values $n$ in the range $0\le n \lt 58/2$ are represented by a single base 58 character.
Values $n$ in the range $58/2\le n \lt \lfloor 58*2^{-2}\rfloor*58 +58/2$ are represented by two base 58 characters starting with a base 58 character $c$ in the range $58/2\le c\lt 58/2+\lfloor 58*2^{-2}\rfloor$
Values $n$ in the range:
$$\lfloor 58*2^{-2}\rfloor*58 +58/2\le n \lt \lfloor 58*2^{-3}\rfloor*58^2 +\lfloor 58*2^{-2}\rfloor*58 +58/2$$
are similarly represented by three base 58 characters starting with a base 58 character $c$ in the range $58/2+\lfloor 58*2^{-2}\rfloor\le c \lt 58/2+\lfloor 58*2^{-2}\rfloor58+\lfloor 58*2^{-3}\rfloor$ .
Values $n$ in the range:
$$\lfloor 58*2^{-3}\rfloor*58^2 +\lfloor 58*2^{-2}\rfloor*58 +58/2\le n \lt \lfloor 58*2^{-4}\rfloor*58^3 +\lfloor 58*2^{-3}\rfloor*58^2 +\lfloor 58*2^{-2}\rfloor*58 +58/2$$
are similarly represented by four base 58 characters.
And so on, for arbitrarily large values. A truly enormous number is going to start with `zzzz....`, `z` being the representation of $58-1$ in base 58.
This amounts to shifting the underlying value to the appropriate range, then displaying it as the shifted base 58 value.
We display a value in the range $0\le n \lt 58/2$ as itself,
a value $n$ in the range $58/2\le n \lt \lfloor 58*2^{-2}\rfloor*58 +58/2$ as the base 58 representation of $n+58*(58/2-1)$
## Variable length bit fields
To represent variable length bit fields in the postfix sort order,
such that a shorter bit field sorts after all longer bit fields
with same leading bits:
We break it into seven bit fields, with a final field representing zero to six bits.
A seven bit field is represented by a byte ending in a zero low order bit.
A variable length $m$ bit field where m is 0 to 6 (seven possible
values) by is represented by a fixed width eight bit field:
Where if\
$j$ is the bitfield interpreted as a number\
$m$ is the length of the bitfield\
$c$ is a count of the set bits in the bitfield
The value of the eight bit field is:\
$j*(2^{(7-m)}-1)+2*c+1$
----------------------
variable 7 bit
bit field bitfield
--------- ------------
000000 0000 0001
000001 0000 0011
00000 0000 0101
000010 0000 0111
000011 0000 1001
00001 0000 1011
0000 0000 1101
000100 0000 1111
000101 0001 1001
00010 0001 0011
000110 0001 0101
000111 0001 0111
00011 0001 1001
0001 0001 1011
000 0001 1101
... ...
111101 1110 1100
11110 1110 1101
111110 1110 1111
111111 1111 0001
11111 1111 0011
1111 1111 0101
111 1111 0111
11 1111 1001
1 1111 1011
empty 1111 1101
--------------------
### SQL blobs.
In order for blobs in a database representing bitfields to sort
correctly, we do not use seven bit nibbles, but eight bit bytes,
with a final byte representing zero to seven bits as an eight bit byte.
For this we use the mapping:
Where if\
$j$ is the bitfield interpreted as a number\
$m$ is the length of the bitfield\
$c$ is a count of the set bits in the bitfield
The value of the eight bit field is:\
$j*(2^{(7-m)}-1)+c$
The difference is that blob is preceded by a count field
that is not used in the sort order, which is tricky to
do in a Merkle-patricia tree representing an sql index.
## Use case
QR codes and prefix free number encoding is useful in cases where we want data to be self describing this bunch of bits is to be interpreted in a certain way, used in a certain action, means one thing, and not another thing. At present there is no standard for self description. QR codes are given meanings by the application, and could carry completely arbitrary data whose meaning and purpose comes from outside, from the context.
Ideally, it should make a connection, and that connection should then launch an interactive environment the url case, where the url downloads a javascript app to address a particular database entry on a particular host.
A fixed length field is always in danger of
running out, so one needs a committee to allocate numbers.
With an arbitrary length field there is always plenty of
headroom, we can just let people use what numbers seem good
to them, and if there is a collision, well, one or both of
the colliders can move to another number.
For example, the hash of a public key structure has to contain an algorithm
identifier as to the hashing algorithm, to accommodate the possibility that
in future the existing algorithm becomes too weak, and we must introduce
new algorithms while retaining compatibility with the old. But there could
potentially be quite a lot of algorithms, though in practice initially there
will only be one, and it will be a long time before there are two.
When I say "arbitrarily large" I do not mean arbitrarily large, since this creates the possibility that someone could break something by sending a number bigger than the software can handle. There needs to be an absolute limit, such as sixty four bits, on representable numbers. But the limit should be larger than is ever likely to have a legitimate use.
# Other Solutions
## Zero byte encoding
Capt' Proto zero compresses out zero bytes, and uses an encoding such that uninformative and predictable fields are zero.
## 62 bit compressed numbers
QUIC expresses a sixty two bit number as one to four sixteen bit numbers. This is the fastest to encode and decode.
## VLQ Leading bit as number boundary
But it seems to me that the most efficient reasonably fast and elegant
solution is a variant on utf8 encoding, though not quite as fast as the
encoding used by QUIC:
Split the number into seven bit fields. For the leading fields, a one bit is
prepended making an eight bit byte. For the last field, a zero bit is prepended.
This has the capability to represent very large values, which is potentially
dangerous. The implementation has to impose a limit, but the limit can
be very large, and can be increased without breaking compatibility, and
without all implementations needing to changing their limit in the same
way at the same time.
The problem with this representation is that the sort order as a bitstring
differs from the sort order of the underlying integers, which is going to
result in problems if these are used to define paths in a Merkle-patricia dag.
## Prefix Free Number Encoding
In this class of solutions, numbers are embedded as variable sized groups of bits within a bitstream, in a way that makes it possible to find the boundary between one number and the next. It is used in data compression, but seldom used in compressed data transmission, because far too slow.
This class of problem is that of a
[universal code for integers](http://en.wikipedia.org/wiki/Universal_code_%28data_compression%29).
The particular coding I propose here is a variation on
Elias encoding, though I did not realize it when I
invented it.
On reflection, my proposed encoding is too clever by half,
better to use Elias δ coding, with large arbitrary
limits on the represented numbers, rather than
clever custom coding for each field. For the intended purpose of wrapping packets, of collecting UDP packets into messages, and messages into channels, limit the range of representable values to the range j: 0 \< j \< 2\^64, and pack all the fields representing the place of this UDP package in a bunch of messages in a bunch of channels into a single bitstream header that is then rounded into an integral number of bytes..
We have two bitstream headers, one of which contains always starts with the number 5 to identify the protocol. (Unknown protocols immediately ignored), and then another number to identify the encryption stream and the position in the encryption stream (no windowing). Then we decrypt the rest of the packet starting on a byte boundary. The decrypted packet then has additional bitstream headers.
For unsigned integers, we restrict the range to less than 2\^64-9. We then add 8 before encoding, and subtract 8 after encoding, so that our Elias δ encoded value always starts with two zero bits, which we always throw away. Thus the common values 0 to 7 inclusive are represented by a six bit value I want to avoid wasting too much implied probability on the relatively low probability value of zero.
The restriction on the range is apt to produce unexpected errors, so I suppose we special case the additional 8 values, so that we can represent every signed integer.
For signed integers, we convert to an unsigned integer\
`uint_fast64_t y; y= 2*((uint_fast64_t)(-x)+1) : 2*(uint_fast64_t)x;`\
And then represent as a positive integer. The decoding algorithm has to know whether to call the routine for signed or unsigned. By using unsigned maths where values must always be positive, we save a bit. Which is a lot of farting around to save on one bit.
We would like a way to represent an arbitrarily large
number, a Huffman style representation of the
numbers.  This is not strictly Huffman encoding,
since we want to be able to efficiently encode and decode
large numbers, without using a table, and we do not have
precise knowledge of what the probabilities of numbers are
likely to be, other than that small numbers are
substantially more probable than large numbers.  In
the example above, we would like to be able to represent
numbers up to O(2^32^), but efficiently represent
the numbers one, and two, and reasonably efficiently
represent the numbers three and four.  So to be
strictly correct, “prefix free number encoding”. As we
shall see at the end, prefix free number encoding always
corresponds to Huffman encoding for some reasonable weights
but we are not worrying too much about weights, so are
not Huffman encoding.
###Converting to and from the representation
Assume X is a prefix free sequence of bit strings that is to say, if we
are expecting a member of this sequence, we can tell where the member
ends. 
Let \[m…n\] represent a sequence of integers m to n-1. 
Then the function X→\[m…n\] is the function that converts a bit string of X
to the corresponding integer of \[m…n\], and similarly for \[m…n\]→X. 
Thus X→\[m…n\] and \[m…n}→X provide us with a prefix free representation of
numbers greater than or equal to m, and less than n. 
Assume the sequence X has n elements, and we can generate and recognize
each element. 
Let (X,k) be a new sequence, constructed by taking the first element of
X, and appending to it the 2^k^ bit patterns of length i, the
next element of X and appending to it the 2^k+1^ bit patterns of
length k+1, and so on and so forth. 
is a function that gives us this new sequence from an existing sequence
and an integer. 
The new sequence (X,k) will be a sequence of prefix free bit patterns
that has 2^n+k+1^ - 2^k^ elements. 
We can proceed iteratively, and define a sequence ((X,j),k), which class
of sequences is useful and efficient for numbers that are typically quite
small, but could often be very large. We will more precisely
prescribe what sequences are useful and efficient for what purposes when
we relate our encoding to Huffman coding.
To generate the m+1[th]{.small} element of (X,k), where X is a
sequence that has n elements:
Let j = m + 2^k^
Let p = floor(log~2~(j)) that is to say, p is the position of
the high order bit of j, zero if j is one, one if j is two
or three, two if j is four, five, six, or seven, and so on and so forth.
We encode p into its representation using the encoding \[k…n+k\]→X, and
append to that the low order p bits of j.
To do the reverse operation, decode from the prefix free representation to
the zero based sequence position, to perform the function (X,k)→\[0…2^n+k+1^-2^k^\],
we extract p from the bit stream using the decoding of X→\[j…n+j\], then
extract the next p bits of the bit stream, construct k from 2^p^-2^j^
plus the number represented by those bits.
Now all we need is an efficient sequence X for small numbers. 
Let (n) be a such a sequence with n values. \
The first bit pattern of (n) is 0\
The next bit pattern of (n) is 10\
The next bit pattern of (n) is 110\
The next bit pattern of (n) is 1110\
…\
The next to last bit pattern of (n) is 11…110, containing n-2 one bits
and one zero bit.\
The last bit pattern of (n) breaks the sequence, for it is 11…11,
containing n-1 one bits and no zero bit.
The reason why we break the sequence, not permitting the
representation of unboundedly large numbers, is that
computers cannot handle unboundedly large numbers one
must always specify a bound, or else some attacker will
cause our code to crash, producing results that we did not
anticipate, that the attacker may well be able to make use
of.
Perhaps a better solution is to waste a bit, thereby
allowing future expansion. We use a representation
that can represent arbitrarily large numbers, but clients
and servers can put some arbitrary maximum on the size of
the number. If that maximum proves too low, future clients
can just expand it without breaking backward compatibility.
This is similar to the fact that different file systems
have different arbitrary maxima for the nesting of
directories, the length of paths, and the length of
directory names. Provided the maxima are generous
it does not matter that they are not the same.
Thus the numbers 1 to 2 are represented by \[1…3\] →
(2), 1 being the pattern “0”, and 2 being the
pattern “1”
The numbers 0 to 5 are represented by \[0…6\] → (6), being the patterns\
“0”, “10”, “110”, “1110”, “11110”, “11111”
Thus \[0…6\] → (6)(3) is a bit pattern that represents the number
3, and it is “1110”
This representation is only useful if we expect our numbers
to be quite small.
\[0…6\] → ((2),1) is the sequence “00”, “01”,
“100”, “101”, “110”, “111” representing the
numbers zero to five, representing the numbers 0 to
less than 2^2+1^ 2^1^
\[1…15\] → ((3),1) is similarly the sequence\
“00”, “01”,\
“1000”, “1001”, “1010 1011”,\
“11000”, “11001”, “11010”, “11011”,“11100”, “11101”, “11110”, “11111”,\
representing the numbers one to fourteen, representing the
numbers 1 to less than 1 + 2^3+1^ 2^1^
We notice that ((n),k) has 2^n+k^ 2^k^
patterns, and the shortest patterns are length 1+k, and the
largest patterns of length 2n+k-2
This representation in general requires twice as many bits
as to represent large numbers as the usual, non self
terminating representation does (assuming k to be small)
We can iterate this process again, to get the bit string sequence:\
(((n),j),k)\
which sequence has 2\^(2^n+j^ - 2^j^ + k) - 2^k^
elements. 
This representation is asymptotically efficient for very
large numbers, making further iterations pointless.
((5),1) has 62 elements, starting with a two bit pattern, and ending
with a nine bit pattern. Thus (((5),1),2) has
2^64^-4 elements, starting with a four bit pattern, and finishing
with a 72 bit pattern. 
### prefix free encoding as Huffman coding
Now let us consider a Huffman representation of the
numbers when we assign the number `n` the
weight `1/(n*(n+1)) = 1/n 1/(n+1)`
In this case the weight of the numbers in the range `n ... m` is `1/n 1/(m+1)`
So our bit patterns are:\
0 (representing 1)\
100 101 representing 2 to 3\
11000 11001 11010 11011 representing 4 to 7\
1110000 1110001 1110010  1110011 1110100 1110101
1110110 1110111 representing 8 to 15
We see that the Huffman coding of the numbers that are
weighted as having probability `1/(n*(n+1))`
Is our friend \[1…\] → ((n),0), where n is very large.
Thus this is good in a situation where we are quite unlikely to encounter
a big number.  However a very common situation, perhaps the most
common situation, is that we are quite likely to encounter numbers smaller
than a given small amount, but also quite likely to encounter numbers
larger than a given huge amount that the probability of encountering a
number in the range 0…5 is somewhat comparable to the probability of
encountering a number in the range 5000…50000000.
We want an encoding that corresponds to a Huffman encoding where numbers are logarithmically distributed up to some enormous limit, corresponding to an encoding where for all n, n bit numbers are represented with an only slightly larger number of bits, n+O(log(n)) bits.
In such case, we should we should represent such values by members of a
prefix free sequence `((,j),k)`