wallet/docs/blockchain_structure_on_disk.md

4.4 KiB
Raw Blame History

title
Block chain structure on disk.

The question is: One enormous SQLite file, or actually store the chain as a collection of files?

In the minimum viable product, the blockchain will be quite small, and it will be workable to put it one big SQLite file. The trouble with one enormous SQLite file is that when it gets big enough, we face a high and steadily increasing risk of one sector on the enormous disk going bad, corrupting the entire database. SQLite does not handle the loss of a single sector gracefully.

We will eventually need our own database structure designed around Merkle-patricia trees, append only data structures, and accommodating a near certainty of sectors and entire disks continually going bad. When one hundred disks have to be added every year, entire disks will be failing every day or so, and sectors will be failing every second.

Eventually, a typical peer will have several big racks of disks. When we replace the world monetary system, twenty servers each with twenty disks, two hundred thousand transaction inputs and outputs a second, (for each transaction minimally involves one input and two outputs, a change output and a payment output, and usually a lot more. Each signature is sixty four bytes. Each input and output is at least forty bytes. So, say, on average two inputs and two outputs per payment say, perhaps 288 bytes per payment, and we will want to do one hundred thousand payments per second. So, about nine hundred terabytes a year. With 2020 disk technology, that is about seventy five twelve terabyte hard drives per year, costing about one hundred and fifty hard drives per year costing fifty five thousand dollars per year, to store all the transactions of the world forever.

If we are constructing one block per five minutes, each block is about ten gigabytes. Sqlite3 cannot possibly handle that the blocks are going to have to be dispersed over many drives and many physical computers. We are going to have to go to our own custom low level format, in which a block is distributed over many drives and many servers, the upper part of the block Merkle-patricia tree duplicated on every shard, but the the lower branches of the tree each in a separate shard. Instead of a file structure with many files on one enormous disk, we have one enormous data structure on servers, each server with many disks.

Optimal solution is to store recently accessed data in one big SQLite file, while also storing the data in a large collection of blocks, once it has become subject to wide consensus. Older blocks, fully incorporated in the current consensus, get written to disk in our own custom Merkle-patricia tree format, with append only Merkle-patricia tree node locations, a sequential append only collection of binary trees in postfix tree format.

Each file, incorporating a range of blocks, has its location on disk, time, size, and the roots of its Merkle-patricia trees recorded in the SQL database. On program launch, the size, touch time, and root has of newest block in the file are checked. If there is a discrepancy, we do a full check of the Merkle-patricia tree, editing it as necessary to an incomplete Merkle-patricia tree, download missing data from peers, and rebuild the blocks, thus winding up with a newer touch dates. Our per peer configuration file tells us where to find the block files, and if they are not stored where expected, we rebuild. If stored where expected, but touch dates unavailable or incorrect (perhaps because this is the first time the program launched) then the entire system of Merkle-patricia trees is validated, making sure the data on disk is consistent.

How do we tell the one true blockchain, from some other evil blockchain? Well, the running definition is consensus, that you can interact with other peers because they agree on the running root hash. So you downloaded this software from somewhere, and when you downloaded it, you got the means to contact a bunch of peers, whom we suppose agree, and each have evidence that other peers agree. And, having downloaded what they agree on, you then treat it as gospel and as more authoritative that what others say, so long a touch dates, file sizes, locations, and the hash of the most recent block in the file are consistent, and the internal contents of each file are consistent with root of the most recent tree.