that sort as bitstrings
5.6 KiB
title |
---|
Variable Length Quantity |
I originally implemented variable length quantities following the standard.
And then I realized that an sql index represented as a merkle-patricia tree inherently sorts in byte string order. Which is fine if we represent integers as fixed length integers in big endian format, but does not correctly sort variable length quantities if we follow the standard:
So: To represent variable signed numbers in byte string sortable order:
For positive signed integers
If the leading bits are 10
, it represents a number in the range
0
... 2^6-1
So only one byte
If the leading bits are 110
, it represents a number in the range
2^6
... 2^6+2^{13}-1
So two bytes
if the leading bits are 1110
, it represents a number in the range
2^6+2^{13}+2^{20}
... 2^6+2^{13}+2^{20}+2^{27}-1
So four bytes long
(five bits of header, twenty seven bits to represent 2^{27}
different
values as the trailing twenty seven bits of an ordinary thirty two bit
positive integer in big endian format).
if the leading bits are 1111\,0
, it represents a number in the range
2^6+2^{13}+2^{20}+2^{27}
... $2^6+2^{13}+2^{20}+2^{27}+2^{34}-1$
So five bytes long.
if the leading bits are 1111\,10
, it represents a number in the range
2^6+2^{13}+2^{20}+2^{27}+2^{34}-1
... $2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}-1$
So six bytes long.
if the leading bits are 1111\,110
, it represents a number in the range
2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}
... $2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}$
So seven bytes long.
if the leading bits are 1111\,1110
, it represents a number in the range
2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}
... $2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}+2^{55}-1$
So eight bytes long.
if the leading bits are 1111\,1111\,0
, it represents a number in the range
2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}+2^{55}
... 2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}+2^{55}+2^{62}-1
So nine bytes long (ten bits of header, sixty two bits to represent $2^{62}$
different values as the trailing sixty two bits of an ordinary sixty four bit positive integer in big endian format).
if the leading bits are 1111\,1111\,10
, it represents a number in the range
2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}+2^{55}+2^{62}
... $2^6+2^{13}+2^{20}+2^{27}+2^{34}+2^{41}+2^{48}+2^{55}+2^{62}+2^{69}-1$
So ten bytes long.
And so on and so forth in the same pattern for positive signed numbers of unlimited size.
The reason for these complicated offsets is to ensure that the byte string are strictly sequential.
For negative signed integers
If the leading bits are 01
, it represents a number in the range
-2^6
... -1
So only one byte (two bits of header,
six bits to represent 2^6
different values as the
trailing six bits of an ordinary eight bit negative integer).
If the leading bits are 001
, it represents a number in the range
-2^{13}-2^6
... 2^6-1
So two bytes (three bits of header,
thirteen bits to represent 2^{13}
different values as the trailing
thirteen bits of an ordinary sixteen bit negative integer in big endian format).
if the leading bits are 0001
, it represents a number in the range
-2^6-2^{13}-2^{20}
... -2^6-2^{13}-1
So three bytes long.
if the leading bits are 0000\,1
, it represents a number in the range
-2^6-2^{13}-2^{20}-2^{27}
... $-2^6-2^{13}-2^{20}-1$
So four bytes long (five bits of header, twenty seven bits to represent
2^{27}
different values as the trailing twenty seven bits of
an ordinary thirty two bit negative integer in big endian format).
if the leading bits are 0000\,01
, it represents a number in the range
-2^6-2^{13}-2^{20}-2^{27}-2^{34}
... $-2^6-2^{13}-2^{20}-2^{27}-1$
So five bytes long.
if the leading bits are 0000\,001
, it represents a number in the range
-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-1
... $-2^6-2^{13}-2^{20}-2^{27}-2^{34}-1$
So six bytes long.
if the leading bits are 0000\,0001
, it represents a number in the range
-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}
... $-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}$
So seven bytes long.
if the leading bits are 0000\,0000\,1
, it represents a number in the range
-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}-2^{55}
... $-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}-1$
So eight bytes long.
if the leading bits are 0000\,0000\,01
, it represents a number in the range
-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}-2^{55}-2^{62}
... $-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}-2^{55}-1$
So nine bytes long (ten bits of header, sixty two bits to represent $2^{62}$
different values as the trailing sixty two bits of an ordinary sixty four bit
negative integer in big endian format).
if the leading bits are 0000\,0000\,001
, it represents a number in the range
$-2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}-2^{55}-2^{62}
$ ... -2^6-2^{13}-2^{20}-2^{27}-2^{34}-2^{41}-2^{48}-2^{55}-1
So ten bytes long.
And so on and so forth in the same pattern for negative signed numbers of unlimited size.
bitstrings
Bitstrings in merkle patricia tree representing an sql index are typically very short, so should be represented by a variable length quantity, except for the leaf edge, which is fixed size and large, so should not be represented by variable length quantity.
We use the integer zero to represent this special case, the integer one to represent the zero length bit string, integers two and three to represent the one bit bitstring, integers four to seven to represent the two bit bit string, and so on and so forth.
In other words, we represent it as the integer obtained by prepending a leading one bit to the bit string.