3c6ec5283d
length bitfields to that they will sort correctly in a Merkle Patricia tree. Have written no end of rubbish on this with needs to be deleted or modified
136 lines
6.2 KiB
Markdown
136 lines
6.2 KiB
Markdown
---
|
|
title: Variable Length Quantity
|
|
---
|
|
|
|
I originally implemented variable length quantities following the standard.
|
|
|
|
And then I realized that an sql index represented as a Merkle-patricia tree inherently sorts in byte string order.
|
|
Which is fine if we represent integers as fixed length integers in big endian format,
|
|
but does not correctly sort variable length quantities if we follow the standard:
|
|
|
|
So: To represent variable length signed numbers in sequential byte string sortable order so that the integer sequence corresponds one to one to the byte string sequence, a strictly sequential sequence of integers with no gaps corresponding to a strictly sequential sequence of byte strings with no gaps:
|
|
|
|
# For positive signed integers
|
|
|
|
If the leading bits are $10$, it represents a number in the range\
|
|
$0$ ... $2^6-1$ So only one byte (two bits of header, six bits to represent $2^{6}$ different
|
|
values as the trailing six bits bits of an ordinary eight bit bit
|
|
positive integer).
|
|
|
|
If the leading bits are $110$, it represents a number in the range\
|
|
$2^6$ ... $2^6+2^{13}-1$ So two bytes
|
|
|
|
if the leading bits are $1110$, it represents a number in the range\
|
|
$2^6+2^{13}$ ... $2^6+2^{13}+2^{20}-1$ So three bytes long
|
|
(four bits of header, twenty bits bits to represent $2^{20}$ different
|
|
values as the trailing twenty bits of an ordinary thirty two bit
|
|
positive integer in big endian format).
|
|
|
|
if the leading bits are $b1111\,0$, it represents a number in the range\
|
|
$2^6+2^{13}+2^{20}$ ... $2^6+2^{13}+2^{20}+2^{27}-1$ So four bytes long
|
|
(five bits of header, twenty seven bits to represent $2^{27}$ different
|
|
values as the trailing twenty seven bits of an ordinary thirty two bit
|
|
positive integer in big endian format).
|
|
|
|
if the leading bits are $1111\,10$, it represents a number in the range\
|
|
$2^6+2^{13}+2^{20}+2^{27}$ ... $2^6+2^{13}+2^{20}+2^{27}+2^{34}-1$
|
|
So five bytes long.
|
|
|
|
if the leading bits are $1111\,110$, it represents a number in the range\
|
|
$2^6+2^{13}+2^{20}+2^{27}+2^{34}-1$ ... $2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}-1$
|
|
So six bytes long.
|
|
|
|
if the leading bits are $1111\,1110$, it represents a number in the range\
|
|
$2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}$ ... $2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}-1$
|
|
So seven bytes long.
|
|
|
|
The reason for these complicated offsets is to ensure that the byte string are strictly sequential.
|
|
|
|
if the bits of the first byte are $1111\,1111$, we change representations.
|
|
Instead that number is represented by a variable
|
|
length quantity that is a count of
|
|
bytes in the rest of the byte string, which is the number itself in its
|
|
natural binary big endian form, with the leading zero bytes discarded.
|
|
So no longer using these complicated offsets for the number itself,
|
|
but are using them for the byte count.
|
|
|
|
This change in representation simplifies coding and speeds up the transformation,
|
|
but costs an extra byte for numbers larger than $2^{48}$ and less than $2^{55}$.
|
|
|
|
And so on and so forth in the same pattern for positive signed numbers of unlimited size.
|
|
|
|
## examples
|
|
|
|
The bytestring 0xCABC corresponds to the integer 0x0A7C.\
|
|
The bytestring 0xEABEEF corresponds to the integer 0x0ABCAF.
|
|
|
|
# For negative signed integers
|
|
|
|
If the leading bits are $01$, it represents a number in the range\
|
|
$-2^6$ ... $-1$ So only one byte (two bits of header,
|
|
six bits to represent $2^6$ different values as the
|
|
trailing six bits of an ordinary eight bit negative integer).
|
|
|
|
If the leading bits are $001$, it represents a number in the range\
|
|
$-2^{13}-2^6$ ... $2^6-1$ So two bytes (three bits of header,
|
|
thirteen bits to represent $2^{13}$ different values as the trailing
|
|
thirteen bits of an ordinary sixteen bit negative integer in big endian format).
|
|
|
|
if the leading bits are $0001$, it represents a number in the range\
|
|
$-2^6-2^{13}-2^{20}$ ... $-2^6-2^{13}-1$ So three bytes long.
|
|
|
|
if the leading bits are $0000\,1$, it represents a number in the range\
|
|
$-2^6-2^{13}-2^{20}-2^{27}$ ... $-2^6-2^{13}-2^{20}-1$
|
|
So four bytes long (five bits of header, twenty seven bits to represent
|
|
$2^{27}$ different values as the trailing twenty seven bits of
|
|
an ordinary thirty two bit negative integer in big endian format).
|
|
|
|
if the leading bits are $0000\,01$, it represents a number in the range\
|
|
$-2^6-2^{13}-2^{20}-2^{27}-2^{34}$ ... $-2^6-2^{13}-2^{20}-2^{27}-1$
|
|
So five bytes long.
|
|
|
|
if the leading bits are $0000\,001$, it represents a number in the range\
|
|
$-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-1$ ... $-2^6-2^{13}-2^{20}-2^{27}-2^{34}-1$
|
|
So six bytes long.
|
|
|
|
if the leading bits are $0000\,0001$, it represents a number in the range\
|
|
$-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}$ ... $-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}$
|
|
So seven bytes long.
|
|
|
|
if the bits of the first byte are $0000\,0000$, we change representations.
|
|
Instead that number is represented by a variable length quantity that is
|
|
*zero minus the count* of bytes in the rest of the byte string,
|
|
which is the negative number itself in its natural binary big endian form,
|
|
with the leading minus one bytes discarded.
|
|
So no longer using these complicated offset for the number itself,
|
|
but are using them for the byte count.
|
|
We use the negative of the count, in order to get the correct
|
|
sort order on the underlying byte strings, so that they can be
|
|
represented in a Merkle-patricia tree representing and index.
|
|
|
|
And so on and so forth in the same pattern for negative signed numbers of unlimited size.
|
|
|
|
# bitstrings
|
|
|
|
Bitstrings in Merkle-patricia tree representing an sql index
|
|
are typically very short, so should be represented by a
|
|
variable length quantity, except for the leaf edge,
|
|
which is fixed size and large, so should not be
|
|
represented by variable length quantity.
|
|
|
|
We use the integer zero to represent this special case,
|
|
the integer one to represent the zero length bit string,
|
|
integers two and three to represent the one bit bitstring,
|
|
integers four to seven to represent the two bit bit string,
|
|
and so on and so forth.
|
|
|
|
In other words, we represent it as the integer obtained
|
|
by prepending a leading one bit to the bit string.
|
|
|
|
# Dewey decimal sequences.
|
|
|
|
The only thing we ever want to do with Dewey decimal sequences is $<=>$,
|
|
and they are always positive numbers less than $10^{14}$, so we represent them as
|
|
a sequence of variable length numbers terminated by the number minus one
|
|
and compare them as bytestrings.
|