7674b879eb
many files updated with trivial fixes. modified: docs/design/TCP.md modified: docs/design/peer_socket.md modified: docs/design/proof_of_share.md modified: docs/estimating_frequencies_from_small_samples.md modified: docs/libraries.md modified: docs/libraries/scripting.md modified: docs/manifesto/May_scale_of_monetary_hardness.md modified: docs/manifesto/bitcoin.md modified: docs/manifesto/consensus.md modified: docs/manifesto/lightning.md modified: docs/manifesto/scalability.md modified: docs/manifesto/social_networking.md modified: docs/manifesto/sox_accounting.md modified: docs/manifesto/triple_entry_accounting.md modified: docs/manifesto/white_paper_YarvinAppendix.md modified: docs/names/multisignature.md modified: docs/names/petnames.md modified: docs/names/zookos_triangle.md modified: docs/notes/big_cirle_notation.md modified: docs/number_encoding.md modified: docs/scale_clients_trust.md modified: docs/setup/contributor_code_of_conduct.md modified: docs/setup/core_lightning_in_debian.md modified: docs/setup/set_up_build_environments.md modified: docs/setup/wireguard.md modified: docs/writing_and_editing_documentation.md
525 lines
25 KiB
Markdown
525 lines
25 KiB
Markdown
---
|
||
#katex
|
||
title: Number encoding
|
||
sidebar: true
|
||
...
|
||
|
||
I have spent far too much time implementing and thinking about
|
||
variable length quantities.
|
||
|
||
And became deeply confused about
|
||
suitable representation of such quantities in patricia trees
|
||
because I lost track of the levels of encapsulation.
|
||
|
||
A patricia vertex represents a prefix with a bitcount.
|
||
A patricia tree represents a prefix free index. A patricia vertex
|
||
encapsulates a bitcount of the prefix, and is encapsulated
|
||
by a bytecount of the vertex.
|
||
|
||
There is mud all over my web pages resulting from this confusion.
|
||
|
||
If we ever have a patricia tree of integers, variable length number encoding
|
||
means that it is still just a bit string and a bit count.
|
||
It will be the common bits of the variable length encoded integers of its children.
|
||
|
||
Because the variable length representation of the integers have normal byte order
|
||
we don't have to worry that some of them are different length, we don't have worry
|
||
about where the binary point is relative to start of the prefix, on which point
|
||
I was endlessly confused.
|
||
|
||
# history of mucking around.
|
||
|
||
I originally implemented variable length quantities following the standard.
|
||
|
||
And then I realized that an sql index represented as a Merkle-patricia tree inherently sorts in byte string order.
|
||
Which is fine if we represent integers as fixed length integers in big endian format,
|
||
but does not correctly sort variable length quantities if we follow the standard:
|
||
|
||
# The problem to be solved
|
||
|
||
As computers and networks grow, any fixed length fields
|
||
in protocols tend to become obsolete. Therefore, for future
|
||
upwards compatibility, we want to have variable precision
|
||
numbers.
|
||
|
||
Secondly, to represent integers within a Merkle-patricia tree representing a database index,
|
||
we want all values to be left field aligned, rather than right field aligned.
|
||
which requires some form of variable length encoding that preserves the order
|
||
relationship between integers, so that the bytestring order is the same as
|
||
the integer order.
|
||
|
||
## Merkle-patricia dag
|
||
|
||
### patricia tree
|
||
|
||
A patricia tree inherently represents a collection of prefix free bitstrings,
|
||
which means the represented bitstrings must be self terminating,
|
||
which for fixed length bitstrings is just reaching a certain number of bits,
|
||
and for cee strings is eight zero bits, byte aligned.
|
||
|
||
But the vertices of a patricia tree represent a collection of prefixes.
|
||
Each vertex represents some left aligned bits and a count of those bits.
|
||
The count of the bits is extrinsic to the bits, they are inherently not self terminating,
|
||
and the vertices are themselves not self terminating, they have an extrinsic byte count or bit count
|
||
which is not itself part of the vertex, but part of the data structure within which
|
||
the vertex is stored or represented.
|
||
|
||
We could *represent* the vertices by a self terminating bytestring or bitstring, but only
|
||
if we wanted to put the vertices of a patricia tree inside *another* patricia tree. Which
|
||
seems like a stupid thing to do under most circumstances.
|
||
And what is being represented
|
||
is itself inherently not self terminating.
|
||
|
||
The leaves of the patricia tree represent a data structure
|
||
whose *index* is in the patricia tree, the index being a fixed length set of fields,
|
||
each field itself a self terminating string of bits. Thus the leaf is not a
|
||
patricia vertex, but an object of a different kind. (Unless of course the
|
||
index fully represents all the information of the object)
|
||
|
||
The links inside a vertex also represent a short string of bits,
|
||
the bits that the vertex pointed to has in addition to the bits
|
||
that vertex point already has.
|
||
|
||
### Merkle dag
|
||
We intend to have a vast Merkle dag, and a vast collection of immutable
|
||
append only data structures. Each new block in the append only data
|
||
structure is represented by a hash, whose preimage is a Merkle vertex. A
|
||
path through the Merkle dag is represented by a consecutive sequence of integers, which
|
||
represent not a Merkle-patricia tree, but a sequence of immutable Merkle-patricia
|
||
trees that represent the mutable Merkle-patricia tree that is the current state of
|
||
the blockchain.
|
||
|
||
## Compression algorithm preserving sort order
|
||
|
||
We want to represent integers by byte strings whose
|
||
lexicographic order reflects their order as integers,
|
||
which is to say, when sorted as a left aligned field, sort like integers
|
||
represented as a right aligned field. (Because a Merkle-patricia
|
||
tree has a hard time with right aligned fields, and we do not
|
||
want to represent integers by a fixed length field, because the fixed
|
||
length will always either be too big or too small, as has given TCP no end
|
||
of grief.
|
||
|
||
To do this we have a field that is a count of the bytes minus one,
|
||
and the size of that field is encoded in unary.
|
||
|
||
### representation of unsigned integers as variable length quantities.
|
||
|
||
An unsigned integer value that fits in a single byte starts with $0$, that being unary for a zero
|
||
width byte count field. This leaves seven bits to represent the unsigned integer.
|
||
|
||
Thus an unsigned integer value in the range $0\le n \lt 2^7$ is the
|
||
unsigned eight bit integer itself.
|
||
|
||
An unsigned integer in the range $2^7\le n \lt 2^{13}$ starts with the bits $10\,0$,
|
||
$10$ being unary for a one bit wide field, and $0$ being that one bit, indicating
|
||
two bytes. Thus is represented by the integer itself $+0\text{x}\,8000$
|
||
in big endian format.
|
||
|
||
There are intrinsics and efficient library code to do endian conversions. We want
|
||
big endian format for bytestring sort order to sort correctly in the tree.
|
||
|
||
bits.h declares the gcc intrinsics:\
|
||
`uint32_t __builtin_bswap32 (uint32_t x)`\
|
||
`uint64_t __builtin_bswap64 (uint64_t x)`
|
||
|
||
16 bit swap is just a bit-rotate.
|
||
|
||
intrin.h declare the equivalent functions:\
|
||
`uint16_t _byteswap_ushort(uint16_t value);`\
|
||
`uint32_t _byteswap_ulong(uint32_t value);`\
|
||
`uint64_t _byteswap_uint64(uint64_t value);`
|
||
|
||
|
||
If the representation is less than $0\text{x}8080$ then it does not represent
|
||
an integer value, and reading such data should result in an exception that
|
||
ends processing of the data or in special case handling for non integers.
|
||
|
||
An unsigned integer in the range $2^{13}\le n \lt 2^{21}$ starts with the bits $10\,1$,
|
||
$10$ being unary for a one bit wide field, and $1$ being that one bit, indicating
|
||
three bytes. Thus is represented by the integer itself $+0\text{x}\,88\,0000$
|
||
in big endian format.
|
||
|
||
If the representation is less than $0\text{x}\,88\,2000$ then it does not represent
|
||
an integer value, and reading such data should result in an exception that
|
||
ends processing of the data or in special case handling for non integers.
|
||
|
||
An unsigned integer in the range $2^{21}\le n \lt 2^{27}$ starts with the bits $110\,00$,
|
||
$110$ being unary for a two bit wide field, and $00$ being those two bits, indicating
|
||
four bytes. Thus is represented by the integer itself $+0\text{x}\,c000\,0000$
|
||
in big endian format.
|
||
|
||
If the representation is less than $0\text{x}\,c020\,0000$ then it does not represent
|
||
an integer value, and reading such data should result in an exception that
|
||
ends processing of the data or in special case handling for non integers.
|
||
|
||
And similarly, five byte values start with $110\,01$, representing values in the range $2^{27}\le n \lt 2^{35}$, six byte values start with $110\,10$, representing values in the range $2^{35}\le n \lt 2^{43}$, seven byte values start with $110\,11$, representing values in the range $2^{43}\le n \lt 2^{51}$.
|
||
|
||
Similarly, eight byte values start with $1110\,000$, $1110$ being unary for
|
||
a three bit wide field, and $000$ being those three bits, representing values in the range $2^{51}\le n \lt 2^{57}$, nine byte values start with $1110\,001$, representing values in the range $2^{57}\le n \lt 2^{65}$, ten byte values start with $1110\,010$, and so on and so forth to $1110\,111$ representing a fifteen byte value in the range $2^{105}\le n \lt 2^{113}$.
|
||
|
||
And so on and so forth except that for some time to come the reader implementation is going to throw an exception if it attempts to read a value larger than $2^{64}-1$
|
||
|
||
Eventually we will allow 128 bit values, requiring a nine bit header, but not for quite some time.
|
||
|
||
### representation of signed integers as variable length quantities.
|
||
|
||
To represent unsigned integers, the header starts with a one bit for
|
||
positive quantities and a zero bit for negative quantities, to ensure that
|
||
bytestring order on the representation agrees with the order of the values
|
||
represented. Following that bit, we proceed as for unsigned integers, except
|
||
that for negative values, the header bits are inverted, to get correct
|
||
sort order, so that more negative values sort before less negative values
|
||
and more positive values sort after less positive values.
|
||
|
||
Thus a single byte value, representing positive signed integers in the
|
||
range $0\le n \lt 2^6$ starts with a leading one bit, followed by zero bit
|
||
(unary for a zero width count field), and a single byte value representing
|
||
negative integers in the range $-2^6\le n \lt 0$ starts with a leading
|
||
zero bit, representing the sign, and a leading one bit, being inverted unary
|
||
for a zero width count field
|
||
|
||
Thus the representation for a signed integer $n$ in the range $-2^6\le n \lt 2^6$
|
||
is the one byte signed integer itself $\oplus 0\text{x}\,80$
|
||
|
||
A two byte value, representing signed integers in the range $2^6\le n \lt 2^{12}$ starts with the bits $1\,10\, 0$, 10 being unary for field one bit long,
|
||
as for unsigned integers
|
||
|
||
A two byte value, representing signed integers in the range $-2^{12}\le n \lt -2^6$,
|
||
starts with the bits with the bits $0\,01\, 1$, as for unsigned integers but inverted.
|
||
|
||
Thus the representation for a signed integer $n$ in the range $(-2^{12}\le n \lt 2^6)\,\lor\,(2^6\le n \lt 2^{12})$
|
||
is the two byte signed integer itself $\oplus 0\text{x}\,c000$
|
||
in big endian format.
|
||
|
||
With, as usual "not an integer value" exceptions being thrown if the central excluded range is violated.
|
||
|
||
And, similarly, the representation for a signed integer $n$ in the range $(-2^{20}\le n \lt 2^{12})\,\lor\,(2^{12}\le n \lt 2^{20})$ is the three byte signed integer itself
|
||
$\oplus 0\text{x}\,c4\,0000$
|
||
in big endian format.
|
||
|
||
And so on and so forth for signed integers of unlimited size.
|
||
|
||
# bitstrings
|
||
|
||
It might be convenient to represent the data as a pile of edges,
|
||
rather than a pile of vertices, thus solving the problem that
|
||
the tree must always start with an edge, not vertex.
|
||
This duplicates the start position of every edge,
|
||
but this duplication does not matter because the patricia representation of an index,
|
||
and the standard and usual database representation of an index,
|
||
compresses out the leading duplication.
|
||
|
||
So we are representing an sql index by table whose primary key is the
|
||
bitstring of the start position, and whose values are the
|
||
start position and the end position.
|
||
The patricia edges of this table live in the same table, just
|
||
their values are distinguished from actual leaf values.
|
||
|
||
Variable length bitstrings are represented as variable
|
||
length bytestrings by appending a one bit followed by
|
||
zero to seven zero bits.
|
||
|
||
In the table we may compress the end values by discarding
|
||
all leading bytes except the overlap byte.
|
||
|
||
Thus the actual table, containing only the leaf values,
|
||
is a virtual table based on a select statement that
|
||
excludes the internal edges of the patricia tree from
|
||
the table of all edges, and concatenates the compressed
|
||
value with the index to form the absolute value.
|
||
|
||
It is very common for the end value to be very short.
|
||
|
||
We could save a byte (which is a premature optimization)
|
||
as follows:
|
||
|
||
If S is the length of the bitfield in bits:
|
||
|
||
If $0\le S \lt 5$, it is represented by the variable
|
||
length integer obtained by prepending a set bit to the bitfield.
|
||
|
||
If $5\le S$ we represent the bit sequence as a byte
|
||
sequence prepended with the byte count plus 48,
|
||
(leaving a gap of sixteen impermissible values for future expansion)
|
||
|
||
# Dewey decimal sequences.
|
||
|
||
The only operation we ever want to do with Dewey
|
||
decimal sequences is $<=>$, and they are always
|
||
positive numbers less than $10^{34}$, so we represent
|
||
them as a sequence of variable length positive
|
||
numbers terminated by a byte that corresponds
|
||
to the header of an impermissibly large number, the
|
||
byte `0xFFFF`, and compare them as bytestrings.
|
||
|
||
Albeit we could add, subtract, multiply, and divide
|
||
Dewey decimal sequences as polynomials,
|
||
which would require signed integer sequences,
|
||
but I cannot see any use case for this,
|
||
while unsigned integer sequences have the advantage
|
||
that the ones used to sort and identify things are always
|
||
positive in practice, and one may consider a utf
|
||
string to be a very long Dewey decimal sequence.
|
||
|
||
|
||
|
||
## Use case
|
||
|
||
QR codes and prefix free number encoding is useful in cases where we want data to be self describing – this bunch of bits is to be interpreted in a certain way, used in a certain action, means one thing, and not another thing. At present there is no standard for self description. QR codes are given meanings by the application, and could carry completely arbitrary data whose meaning and purpose comes from outside, from the context.
|
||
|
||
Ideally, it should make a connection, and that connection should then launch an interactive environment – the url case, where the url downloads a javascript app to address a particular database entry on a particular host.
|
||
|
||
A fixed length field is always in danger of
|
||
running out, so one needs a committee to allocate numbers.
|
||
With an arbitrary length field there is always plenty of
|
||
headroom, we can just let people use what numbers seem good
|
||
to them, and if there is a collision, well, one or both of
|
||
the colliders can move to another number.
|
||
|
||
For example, the hash of a public key structure has to contain an algorithm
|
||
identifier as to the hashing algorithm, to accommodate the possibility that
|
||
in future the existing algorithm becomes too weak, and we must introduce
|
||
new algorithms while retaining compatibility with the old. But there could
|
||
potentially be quite a lot of algorithms, though in practice initially there
|
||
will only be one, and it will be a long time before there are two.
|
||
|
||
When I say "arbitrarily large" I do not mean arbitrarily large, since this creates the possibility that someone could break something by sending a number bigger than the software can handle. There needs to be an absolute limit, such as sixty four bits, on representable numbers. But the limit should be larger than is ever likely to have a legitimate use.
|
||
|
||
# Other Solutions
|
||
|
||
## Zero byte encoding
|
||
|
||
Capt' Proto zero compresses out zero bytes, and uses an encoding such that uninformative and predictable fields are zero.
|
||
|
||
## 62 bit compressed numbers
|
||
|
||
QUIC expresses a sixty two bit number as one to four sixteen bit numbers. This is the fastest to encode and decode.
|
||
|
||
## VLQ Leading bit as number boundary
|
||
|
||
But it seems to me that the most efficient reasonably fast and elegant
|
||
solution is a variant on utf8 encoding, though not quite as fast as the
|
||
encoding used by QUIC:
|
||
|
||
Split the number into seven bit fields. For the leading fields, a one bit is
|
||
prepended making an eight bit byte. For the last field, a zero bit is prepended.
|
||
|
||
This has the capability to represent very large values, which is potentially
|
||
dangerous. The implementation has to impose a limit, but the limit can
|
||
be very large, and can be increased without breaking compatibility, and
|
||
without all implementations needing to changing their limit in the same
|
||
way at the same time.
|
||
|
||
The problem with this representation is that the sort order as a bitstring
|
||
differs from the sort order of the underlying integers, which is going to
|
||
result in problems if these are used to define paths in a Merkle-patricia dag.
|
||
|
||
## Prefix Free Number Encoding
|
||
|
||
In this class of solutions, numbers are embedded as variable sized groups of bits within a bitstream, in a way that makes it possible to find the boundary between one number and the next. It is used in data compression, but seldom used in compressed data transmission, because far too slow.
|
||
|
||
This class of problem is that of a
|
||
[universal code for integers](http://en.wikipedia.org/wiki/Universal_code_%28data_compression%29).
|
||
|
||
The particular coding I propose here is a variation on
|
||
Elias encoding, though I did not realize it when I
|
||
invented it.
|
||
|
||
On reflection, my proposed encoding is too clever by half,
|
||
better to use Elias δ coding, with large arbitrary
|
||
limits on the represented numbers, rather than
|
||
clever custom coding for each field. For the intended purpose of wrapping packets, of collecting UDP packets into messages, and messages into channels, limit the range of representable values to the range j: 0 \< j \< 2\^64, and pack all the fields representing the place of this UDP package in a bunch of messages in a bunch of channels into a single bitstream header that is then rounded into an integral number of bytes..
|
||
|
||
We have two bitstream headers, one of which contains always starts with the number 5 to identify the protocol. (Unknown protocols immediately ignored), and then another number to identify the encryption stream and the position in the encryption stream (no windowing). Then we decrypt the rest of the packet starting on a byte boundary. The decrypted packet then has additional bitstream headers.
|
||
|
||
For unsigned integers, we restrict the range to less than 2\^64-9. We then add 8 before encoding, and subtract 8 after encoding, so that our Elias δ encoded value always starts with two zero bits, which we always throw away. Thus the common values 0 to 7 inclusive are represented by a six bit value – I want to avoid wasting too much implied probability on the relatively low probability value of zero.
|
||
|
||
The restriction on the range is apt to produce unexpected errors, so I suppose we special case the additional 8 values, so that we can represent every signed integer.
|
||
|
||
For signed integers, we convert to an unsigned integer\
|
||
`uint_fast64_t y; y= 2*((uint_fast64_t)(-x)+1) : 2*(uint_fast64_t)x;`\
|
||
And then represent as a positive integer. The decoding algorithm has to know whether to call the routine for signed or unsigned. By using unsigned maths where values must always be positive, we save a bit. Which is a lot of farting around to save on one bit.
|
||
|
||
We would like a way to represent an arbitrarily large
|
||
number, a Huffman style representation of the
|
||
numbers. This is not strictly Huffman encoding,
|
||
since we want to be able to efficiently encode and decode
|
||
large numbers, without using a table, and we do not have
|
||
precise knowledge of what the probabilities of numbers are
|
||
likely to be, other than that small numbers are
|
||
substantially more probable than large numbers. In
|
||
the example above, we would like to be able to represent
|
||
numbers up to O(2^32^), but efficiently represent
|
||
the numbers one, and two, and reasonably efficiently
|
||
represent the numbers three and four. So to be
|
||
strictly correct, “prefix free number encoding”. As we
|
||
shall see at the end, prefix free number encoding always
|
||
corresponds to Huffman encoding for some reasonable weights
|
||
– but we are not worrying too much about weights, so are
|
||
not Huffman encoding.
|
||
|
||
###Converting to and from the representation
|
||
|
||
Assume X is a prefix free sequence of bit strings – that is to say, if we
|
||
are expecting a member of this sequence, we can tell where the member
|
||
ends.
|
||
|
||
Let \[m…n\] represent a sequence of integers m to n-1.
|
||
|
||
Then the function X→\[m…n\] is the function that converts a bit string of X
|
||
to the corresponding integer of \[m…n\], and similarly for \[m…n\]→X.
|
||
|
||
Thus X→\[m…n\] and \[m…n}→X provide us with a prefix free representation of
|
||
numbers greater than or equal to m, and less than n.
|
||
|
||
Assume the sequence X has n elements, and we can generate and recognize
|
||
each element.
|
||
|
||
Let ℓ(X,k) be a new sequence, constructed by taking the first element of
|
||
X, and appending to it the 2^k^ bit patterns of length i, the
|
||
next element of X and appending to it the 2^k+1^ bit patterns of
|
||
length k+1, and so on and so forth.
|
||
|
||
ℓ is a function that gives us this new sequence from an existing sequence
|
||
and an integer.
|
||
|
||
The new sequence ℓ(X,k) will be a sequence of prefix free bit patterns
|
||
that has 2^n+k+1^ - 2^k^ elements.
|
||
|
||
We can proceed iteratively, and define a sequence ℓ(ℓ(X,j),k), which class
|
||
of sequences is useful and efficient for numbers that are typically quite
|
||
small, but could often be very large. We will more precisely
|
||
prescribe what sequences are useful and efficient for what purposes when
|
||
we relate our encoding to Huffman coding.
|
||
|
||
To generate the m+1[th]{.small} element of ℓ(X,k), where X is a
|
||
sequence that has n elements:
|
||
|
||
Let j = m + 2^k^
|
||
|
||
Let p = floor(log~2~(j)) that is to say, p is the position of
|
||
the high order bit of j, zero if j is one, one if j is two
|
||
or three, two if j is four, five, six, or seven, and so on and so forth.
|
||
|
||
We encode p into its representation using the encoding \[k…n+k\]→X, and
|
||
append to that the low order p bits of j.
|
||
|
||
To do the reverse operation, decode from the prefix free representation to
|
||
the zero based sequence position, to perform the function ℓ(X,k)→\[0…2^n+k+1^-2^k^\],
|
||
we extract p from the bit stream using the decoding of X→\[j…n+j\], then
|
||
extract the next p bits of the bit stream, construct k from 2^p^-2^j^
|
||
plus the number represented by those bits.
|
||
|
||
Now all we need is an efficient sequence X for small numbers.
|
||
|
||
Let ℒ(n) be a such a sequence with n values. \
|
||
The first bit pattern of ℒ(n) is 0\
|
||
The next bit pattern of ℒ(n) is 10\
|
||
The next bit pattern of ℒ(n) is 110\
|
||
The next bit pattern of ℒ(n) is 1110\
|
||
…\
|
||
The next to last bit pattern of ℒ(n) is 11…110, containing n-2 one bits
|
||
and one zero bit.\
|
||
The last bit pattern of ℒ(n) breaks the sequence, for it is 11…11,
|
||
containing n-1 one bits and no zero bit.
|
||
|
||
The reason why we break the sequence, not permitting the
|
||
representation of unboundedly large numbers, is that
|
||
computers cannot handle unboundedly large numbers – one
|
||
must always specify a bound, or else some attacker will
|
||
cause our code to crash, producing results that we did not
|
||
anticipate, that the attacker may well be able to make use
|
||
of.
|
||
|
||
Perhaps a better solution is to waste a bit, thereby
|
||
allowing future expansion. We use a representation
|
||
that can represent arbitrarily large numbers, but clients
|
||
and servers can put some arbitrary maximum on the size of
|
||
the number. If that maximum proves too low, future clients
|
||
can just expand it without breaking backward compatibility.
|
||
This is similar to the fact that different file systems
|
||
have different arbitrary maxima for the nesting of
|
||
directories, the length of paths, and the length of
|
||
directory names. Provided the maxima are generous
|
||
it does not matter that they are not the same.
|
||
|
||
Thus the numbers 1 to 2 are represented by \[1…3\] →
|
||
ℒ(2), 1 being the pattern “0”, and 2 being the
|
||
pattern “1”
|
||
|
||
The numbers 0 to 5 are represented by \[0…6\] → ℒ(6), being the patterns\
|
||
“0”, “10”, “110”, “1110”, “11110”, “11111”
|
||
|
||
Thus \[0…6\] → ℒ(6)(3) is a bit pattern that represents the number
|
||
3, and it is “1110”
|
||
|
||
This representation is only useful if we expect our numbers
|
||
to be quite small.
|
||
|
||
\[0…6\] → ℓ(ℒ(2),1) is the sequence “00”, “01”,
|
||
“100”, “101”, “110”, “111” representing the
|
||
numbers zero to five, representing the numbers 0 to
|
||
less than 2^2+1^ – 2^1^
|
||
|
||
\[1…15\] → ℓ(ℒ(3),1) is similarly the sequence\
|
||
“00”, “01”,\
|
||
“1000”, “1001”, “1010 1011”,\
|
||
“11000”, “11001”, “11010”, “11011”,“11100”, “11101”, “11110”, “11111”,\
|
||
representing the numbers one to fourteen, representing the
|
||
numbers 1 to less than 1 + 2^3+1^ – 2^1^
|
||
|
||
We notice that ℓ(ℒ(n),k) has 2^n+k^ – 2^k^
|
||
patterns, and the shortest patterns are length 1+k, and the
|
||
largest patterns of length 2n+k-2
|
||
|
||
This representation in general requires twice as many bits
|
||
as to represent large numbers as the usual, non self
|
||
terminating representation does (assuming k to be small)
|
||
|
||
We can iterate this process again, to get the bit string sequence:\
|
||
ℓ(ℓ(ℒ(n),j),k)\
|
||
which sequence has 2\^(2^n+j^ - 2^j^ + k) - 2^k^
|
||
elements.
|
||
|
||
This representation is asymptotically efficient for very
|
||
large numbers, making further iterations pointless.
|
||
|
||
ℓ(ℒ(5),1) has 62 elements, starting with a two bit pattern, and ending
|
||
with a nine bit pattern. Thus ℓ(ℓ(ℒ(5),1),2) has
|
||
2^64^-4 elements, starting with a four bit pattern, and finishing
|
||
with a 72 bit pattern.
|
||
|
||
### prefix free encoding as Huffman coding
|
||
|
||
Now let us consider a Huffman representation of the
|
||
numbers when we assign the number `n` the
|
||
weight `1/(n*(n+1)) = 1/n – 1/(n+1)`
|
||
|
||
In this case the weight of the numbers in the range `n ... m` is `1/n – 1/(m+1)`
|
||
|
||
So our bit patterns are:\
|
||
0 (representing 1)\
|
||
100 101 representing 2 to 3\
|
||
11000 11001 11010 11011 representing 4 to 7\
|
||
1110000 1110001 1110010 1110011 1110100 1110101
|
||
1110110 1110111 representing 8 to 15
|
||
|
||
We see that the Huffman coding of the numbers that are
|
||
weighted as having probability `1/(n*(n+1))`
|
||
|
||
Is our friend \[1…\] → ℓ(ℒ(n),0), where n is very large.
|
||
|
||
Thus this is good in a situation where we are quite unlikely to encounter
|
||
a big number. However a very common situation, perhaps the most
|
||
common situation, is that we are quite likely to encounter numbers smaller
|
||
than a given small amount, but also quite likely to encounter numbers
|
||
larger than a given huge amount – that the probability of encountering a
|
||
number in the range 0…5 is somewhat comparable to the probability of
|
||
encountering a number in the range 5000…50000000.
|
||
|
||
We want an encoding that corresponds to a Huffman encoding where numbers are logarithmically distributed up to some enormous limit, corresponding to an encoding where for all n, n bit numbers are represented with an only slightly larger number of bits, n+O(log(n)) bits.
|
||
|
||
In such case, we should we should represent such values by members of a
|
||
prefix free sequence `ℓ(ℓ(ℒ,j),k)`
|