wallet/docs/variable-length-quantity.md
reaction.la 3c6ec5283d
finally figured out how to represent numbers and variable
length bitfields to that they will sort correctly in a Merkle Patricia
tree.

Have written no end of rubbish on this with needs to be deleted or
modified
2023-10-20 20:30:32 +10:00

136 lines
6.2 KiB
Markdown

---
title: Variable Length Quantity
---
I originally implemented variable length quantities following the standard.
And then I realized that an sql index represented as a Merkle-patricia tree inherently sorts in byte string order.
Which is fine if we represent integers as fixed length integers in big endian format,
but does not correctly sort variable length quantities if we follow the standard:
So: To represent variable length signed numbers in sequential byte string sortable order so that the integer sequence corresponds one to one to the byte string sequence, a strictly sequential sequence of integers with no gaps corresponding to a strictly sequential sequence of byte strings with no gaps:
# For positive signed integers
If the leading bits are $10$, it represents a number in the range\
$0$ ... $2^6-1$ So only one byte (two bits of header, six bits to represent $2^{6}$ different
values as the trailing six bits bits of an ordinary eight bit bit
positive integer).
If the leading bits are $110$, it represents a number in the range\
$2^6$ ... $2^6+2^{13}-1$ So two bytes
if the leading bits are $1110$, it represents a number in the range\
$2^6+2^{13}$ ... $2^6+2^{13}+2^{20}-1$ So three bytes long
(four bits of header, twenty bits bits to represent $2^{20}$ different
values as the trailing twenty bits of an ordinary thirty two bit
positive integer in big endian format).
if the leading bits are $b1111\,0$, it represents a number in the range\
$2^6+2^{13}+2^{20}$ ... $2^6+2^{13}+2^{20}+2^{27}-1$ So four bytes long
(five bits of header, twenty seven bits to represent $2^{27}$ different
values as the trailing twenty seven bits of an ordinary thirty two bit
positive integer in big endian format).
if the leading bits are $1111\,10$, it represents a number in the range\
$2^6+2^{13}+2^{20}+2^{27}$ ... $2^6+2^{13}+2^{20}+2^{27}+2^{34}-1$
So five bytes long.
if the leading bits are $1111\,110$, it represents a number in the range\
$2^6+2^{13}+2^{20}+2^{27}+2^{34}-1$ ... $2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}-1$
So six bytes long.
if the leading bits are $1111\,1110$, it represents a number in the range\
$2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}$ ... $2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}-1$
So seven bytes long.
The reason for these complicated offsets is to ensure that the byte string are strictly sequential.
if the bits of the first byte are $1111\,1111$, we change representations.
Instead that number is represented by a variable
length quantity that is a count of
bytes in the rest of the byte string, which is the number itself in its
natural binary big endian form, with the leading zero bytes discarded.
So no longer using these complicated offsets for the number itself,
but are using them for the byte count.
This change in representation simplifies coding and speeds up the transformation,
but costs an extra byte for numbers larger than $2^{48}$ and less than $2^{55}$.
And so on and so forth in the same pattern for positive signed numbers of unlimited size.
## examples
The bytestring 0xCABC corresponds to the integer 0x0A7C.\
The bytestring 0xEABEEF corresponds to the integer 0x0ABCAF.
# For negative signed integers
If the leading bits are $01$, it represents a number in the range\
$-2^6$ ... $-1$ So only one byte (two bits of header,
six bits to represent $2^6$ different values as the
trailing six bits of an ordinary eight bit negative integer).
If the leading bits are $001$, it represents a number in the range\
$-2^{13}-2^6$ ... $2^6-1$ So two bytes (three bits of header,
thirteen bits to represent $2^{13}$ different values as the trailing
thirteen bits of an ordinary sixteen bit negative integer in big endian format).
if the leading bits are $0001$, it represents a number in the range\
$-2^6-2^{13}-2^{20}$ ... $-2^6-2^{13}-1$ So three bytes long.
if the leading bits are $0000\,1$, it represents a number in the range\
$-2^6-2^{13}-2^{20}-2^{27}$ ... $-2^6-2^{13}-2^{20}-1$
So four bytes long (five bits of header, twenty seven bits to represent
$2^{27}$ different values as the trailing twenty seven bits of
an ordinary thirty two bit negative integer in big endian format).
if the leading bits are $0000\,01$, it represents a number in the range\
$-2^6-2^{13}-2^{20}-2^{27}-2^{34}$ ... $-2^6-2^{13}-2^{20}-2^{27}-1$
So five bytes long.
if the leading bits are $0000\,001$, it represents a number in the range\
$-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-1$ ... $-2^6-2^{13}-2^{20}-2^{27}-2^{34}-1$
So six bytes long.
if the leading bits are $0000\,0001$, it represents a number in the range\
$-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}$ ... $-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}$
So seven bytes long.
if the bits of the first byte are $0000\,0000$, we change representations.
Instead that number is represented by a variable length quantity that is
*zero minus the count* of bytes in the rest of the byte string,
which is the negative number itself in its natural binary big endian form,
with the leading minus one bytes discarded.
So no longer using these complicated offset for the number itself,
but are using them for the byte count.
We use the negative of the count, in order to get the correct
sort order on the underlying byte strings, so that they can be
represented in a Merkle-patricia tree representing and index.
And so on and so forth in the same pattern for negative signed numbers of unlimited size.
# bitstrings
Bitstrings in Merkle-patricia tree representing an sql index
are typically very short, so should be represented by a
variable length quantity, except for the leaf edge,
which is fixed size and large, so should not be
represented by variable length quantity.
We use the integer zero to represent this special case,
the integer one to represent the zero length bit string,
integers two and three to represent the one bit bitstring,
integers four to seven to represent the two bit bit string,
and so on and so forth.
In other words, we represent it as the integer obtained
by prepending a leading one bit to the bit string.
# Dewey decimal sequences.
The only thing we ever want to do with Dewey decimal sequences is $<=>$,
and they are always positive numbers less than $10^{14}$, so we represent them as
a sequence of variable length numbers terminated by the number minus one
and compare them as bytestrings.