leaving potentially inconvenient history behind

2022-02-16 15:54:47 +10:00

8.7 KiB

Raw Blame History

title
Sharing the pool

Every distributed system needs a shared data pool, if only so that peers can find out who is around.

The peers each have connection to several others, and from time to time each notifies the other of data he does not know the other has. Trouble is, there is no canonical order for new data, so when forming a new connection, each has to notify the other of all his data. Which could be a great deal of data, most of which, though far from all of which, both of them already has.

A common way of dealing with this problem is Bloom filters and cuckoo filters. You send the other a body of data such that the other can detect what data the other has, that he definitely does not have.

A bloom filter is usually described as k hashes, though in fact, quicker to use one hash, divide the bits into two halves, to get two hashes, G and H, and then construct k hashes F_i=G+i*H for i=1,k.

The recipient generates his own bloom filter in the corresponding way, and any time he sets a bit that is not set in the data sent to him, knows he needs to send that data.

Each could send the other his filter, or one could receive a filter, and send a run length compressed version containing those bits corresponding to hashes he failed to generate, whereupon the sender regenerates his filter, or already has a big ass hash table corresponding to his filter generation, and sends the missing items.

Efficient filters

A bloom filter is defined by k, the number of hashes, m the number of bits, and n, the number of elements.

k\text{, the number of hashes,} \approx \bigg\lceil\frac{m\>\ln(2)}{n}\bigg\rceil

Sending a bloom filter is equivalent to sending a list of hashes of size \log_2(n/k)+m, run length compressed so as to be compressed down to only about three bits bigger than m, regardless of how big n gets.

Which is great, but if the number of hashes gets very large, large enough for \log_2(\text{the number of hashes}) to matter, you are likely to be in a situation where almost all hashes match, and thus carry no useful information for performing the merge, and also in a situation where you want few or no false positives, thus m has to be far larger than is required to efficiently merge the bulk of the data, so you are repeatedly sending a huge amount of redundant and useless data in order to synchronize, though vastly less than if everyone repeatedly sent everyone their entire collection of data.

For a 1% false positive rate, you need only ten bits for every item, regardless of how many items.

Now suppose you had a m items, and you wanted a one percent false positive rate, and you sent an fragment of each hash themselves, you would need \log_2(m)+7 bits per item. Which is bigger, but not hugely bigger.

Now suppose you sorted the reduced hashes in order, and sent the difference between each hash and the next, using an efficient encoding (the distribution of values would be exponentially declining).

Suppose you want a one in sixty four false positive rate. Then you want the average difference to be around sixty four. So a difference in the range 0 to 63 would consist of a 0 bit followed by six bits representing a value from 0 to 63, a value in the range 64 to 127 would consist of a 1 bit followed by a zero bit followed by six bits, a value in the range 128 to 191 would consist of two 1 bits followed by a zero bit followed by six bits -- and what do you know, we are looking at about the same size as a bloom filter. We are sending the low order part of the difference in binary, and the high order part in unary.

Generalizing we use a binary representation that can represent a value close to the scale height, and a unary representation of a multiple of the range covered by the binary representation.

If the scale height is \bigcirc(1) it is a bit mask. If the scale height is \bigcirc(2), and if we ignore zero differences (hash collisions) so that the minimal distance is 1

Indeed, one can combine both methods for a sparse bloom filter that is sent with zero runs compressed.

compression of exponentially distributed values

You find the expected scale height, the amount that causes the probability of a diff to diminish by half, round it to the nearest power of two. and express quantities as a unary multiple of that amount, and fixed width binary offset to that unary quantity. It might be convenient to pack the unary values and the fixed width values separately. If the scale height is very small, representing an item where runs are rare (or never happen by definition, the item represented is being a boundary) this nothing but unary, a bit mask.

Minimizing filter size

But if both sides have vast collections of identical or near identical transactions, as is highly likely because they probably just synchronized with the same people, each item in a filter is going to convey very little information. Further, you can never be sure that you are completely synchronized except by setting a lot of bits for each item.

Merkle Patricia tree

So, you build a Merkle patricia tree.

And then you want to transmit a filter that represents the upper portion of the tree where the likelihood of a discrepancy between Bob's tree and Carol's tree is around fifty percent. When you see a discrepancy, you go deeper into that part of the tree on the next sub round. A large part of the time, the discrepancy will be a single transaction. When you have isolated all the discrepancies, rinse and repeat. Eventually the root hashes will agree, so the snapshot the Bob's concurrent process took is now synchronized to Carol, and the snapshot that Carol's concurrent process took is now synchronized to Bob. But new transactions have probably arrived, so time to take the next snapshot.

You discover how deep that is by initially sending the full filter of vertex and leaf hashes for just a portion of the address space covered by the tree. From what shows up, in the next round you will be roughly right for filter depth.

You do want to use a cryptographically strong hash for the identifier of the each transaction, because that is global public information, and we do not want people to be able to cook up transactions that will force hash collisions, because that would enable them to engage in Byzantine defection. But you want to use Murmur for the vertices of the tree that represents transactions that Bob does not yet know whether Carol already has, since that is bilateral information maintained by concurrent process that is managing Bob's connection with Carol, so Byzantine defection is impossible. When, however, Bob's concurrent process managing the connection with Carol whips up a Merkle patricia tree, it should use Murmur3, because there will be a lot of such processes generating a lot of Merkle patricia trees, but only one cryptographic hashes representing each transaction. Lots of such trees are generated, and lots discarded.

The official release of Murmur3 is in the SMhasher test suite, and is obsolete now that C++20 defines system, machine, and compiler independent access to bit operations and to endianness.

Since we are hashing strong hashes, probably even Murmur3 is overkill.

Instead, if we want to hash two 128 bit hashes into one 128 bit hash:

Let the two 128 bit hashes be four 64 bit values, U_0 2^{64} + U_1 and V_0 2^{64}+V_1, and the resulting 128 bit hash is the two 64 bit values:

(U_0g^3+U_1g^2+V_0g+V_1)\%2^{64}

(U_1g^3+V_0g^2+V_1g+U_0)\%2^{64}

where g=11400714819323198485, the odd number nearest to $2^{64)} divided by the golden ratio

Which would be a disastrously weak hash if our starting values were highly ordered, but is likely to suffice because our starting values are strongly random. Needless to say, it has absolutely no resistance to cryptographic attack, but such an attack is pointless, because our starting values are cryptographically strong, our resulting values don't involve any public commitments and we intend to reveal the preimage in due course.

Come to think of it, we can get away with 64 bit hashes, provided we subsample the underlying cryptographically strong 256 bit hashes differently each time, since we do not need to get absolutely perfect synchronization in any one synchronization event. We can live with the occasional rare Merkle patricia tree that gives the same hash for two different sets of transactions. The error will be cleaned up in the next synchronization event.

Thus the hash of two 64 bit hashes, U and V, is (Ug+V)\%2^{64}.

But when we synchronize to the total canonical order, we do need 256 bit cryptographically strong hashes, since concocting two sets of transactions that have the same hash could be used for Byzantine defection. But we only have to construct that tree once.

8.7 KiB Raw Blame History

Efficient filters

compression of exponentially distributed values

Minimizing filter size

Merkle Patricia tree

8.7 KiB

Raw Blame History