wallet/docs/blockchain_structure_on_disk.md

70 lines
4.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: Block chain structure on disk.
---
The question is: One enormous SQLite file, or actually store the chain as a collection of files?
In the minimum viable product, the blockchain will be quite small, and it will be workable to put it one big SQLite file.
The trouble with one enormous SQLite file is that when it gets big enough, we face a high and steadily increasing risk of one sector on the enormous disk going bad, corrupting the entire database. SQLite does not handle the loss of a single sector gracefully.
We will eventually need our own database structure designed around
Merkle-patricia trees, append only data structures, and accommodating a near
certainty of sectors and entire disks continually going bad. When one hundred
disks have to be added every year, entire disks will be failing every day or
so, and sectors will be failing every second.
Eventually, a typical peer will have several big racks of disks. When we
replace the world monetary system, twenty servers each with twenty disks, two
hundred thousand transaction inputs and outputs a second, (for each
transaction minimally involves one input and two outputs, a change output and
a payment output, and usually a lot more. Each signature is sixty four bytes.
Each input and output is at least forty bytes. So, say, on average two inputs
and two outputs per payment say, perhaps 288 bytes per payment, and we will
want to do one hundred thousand payments per second. So, about nine hundred
terabytes a year. With 2020 disk technology, that is about seventy five twelve
terabyte hard drives per year, costing about one hundred and fifty hard drives
per year costing fifty five thousand dollars per year, to store all the
transactions of the world forever.
If we are constructing one block per five minutes, each block is about ten
gigabytes. Sqlite3 cannot possibly handle that the blocks are going to have
to be dispersed over many drives and many physical computers. We are going to
have to go to our own custom low level format, in which a block is distributed
over many drives and many servers, the upper part of the block Merkle-patricia
tree duplicated on every shard, but the the lower branches of the tree each in
a separate shard. Instead of a file structure with many files on one enormous
disk, we have one enormous data structure on servers, each server with many
disks.
Optimal solution is to store recently accessed data in one big SQLite file,
while also storing the data in a large collection of blocks, once it has become
subject to wide consensus. Older blocks, fully incorporated in the current
consensus, get written to disk in our own custom Merkle-patricia tree format,
with append only Merkle-patricia tree node locations, [a sequential append only
collection of binary trees in postfix tree format](
merkle_patricia-dac.html#a-sequential-append-only-collection-of-postfix-binary-trees).
Each file, incorporating a
range of blocks, has its location on disk, time, size, and the roots of its
Merkle-patricia trees recorded in the SQL database. On program launch, the
size, touch time, and root has of newest block in the file are checked. If
there is a discrepancy, we do a full check of the Merkle-patricia tree, editing
it as necessary to an incomplete Merkle-patricia tree, download missing data
from peers, and rebuild the blocks, thus winding up with a newer touch dates.
Our per peer configuration file tells us where to find the block files, and if
they are not stored where expected, we rebuild. If stored where expected, but
touch dates unavailable or incorrect (perhaps because this is the first time the
program launched) then the entire system of Merkle-patricia trees is validated,
making sure the data on disk is consistent.
How do we tell the one true blockchain, from some other evil blockchain?
Well, the running definition is consensus, that you can interact with other
peers because they agree on the running root hash. So you downloaded this
software from somewhere, and when you downloaded it, you got the means to
contact a bunch of peers, whom we suppose agree, and each have evidence that
other peers agree. And, having downloaded what they agree on, you then treat
it as gospel and as more authoritative that what others say, so long a touch
dates, file sizes, locations, and the hash of the most recent block in the file
are consistent, and the internal contents of each file are consistent with root
of the most recent tree.