70 lines
4.4 KiB
Markdown
70 lines
4.4 KiB
Markdown
|
---
|
|||
|
title: Block chain structure on disk.
|
|||
|
---
|
|||
|
|
|||
|
The question is: One enormous SQLite file, or actually store the chain as a collection of files?
|
|||
|
|
|||
|
In the minimum viable product, the blockchain will be quite small, and it will be workable to put it one big SQLite file.
|
|||
|
The trouble with one enormous SQLite file is that when it gets big enough, we face a high and steadily increasing risk of one sector on the enormous disk going bad, corrupting the entire database. SQLite does not handle the loss of a single sector gracefully.
|
|||
|
|
|||
|
We will eventually need our own database structure designed around
|
|||
|
Merkle-patricia trees, append only data structures, and accommodating a near
|
|||
|
certainty of sectors and entire disks continually going bad. When one hundred
|
|||
|
disks have to be added every year, entire disks will be failing every day or
|
|||
|
so, and sectors will be failing every second.
|
|||
|
|
|||
|
Eventually, a typical peer will have several big racks of disks. When we
|
|||
|
replace the world monetary system, twenty servers each with twenty disks, two
|
|||
|
hundred thousand transaction inputs and outputs a second, (for each
|
|||
|
transaction minimally involves one input and two outputs, a change output and
|
|||
|
a payment output, and usually a lot more. Each signature is sixty four bytes.
|
|||
|
Each input and output is at least forty bytes. So, say, on average two inputs
|
|||
|
and two outputs per payment – say, perhaps 288 bytes per payment, and we will
|
|||
|
want to do one hundred thousand payments per second. So, about nine hundred
|
|||
|
terabytes a year. With 2020 disk technology, that is about seventy five twelve
|
|||
|
terabyte hard drives per year, costing about one hundred and fifty hard drives
|
|||
|
per year costing fifty five thousand dollars per year, to store all the
|
|||
|
transactions of the world forever.
|
|||
|
|
|||
|
If we are constructing one block per five minutes, each block is about ten
|
|||
|
gigabytes. Sqlite3 cannot possibly handle that – the blocks are going to have
|
|||
|
to be dispersed over many drives and many physical computers. We are going to
|
|||
|
have to go to our own custom low level format, in which a block is distributed
|
|||
|
over many drives and many servers, the upper part of the block Merkle-patricia
|
|||
|
tree duplicated on every shard, but the the lower branches of the tree each in
|
|||
|
a separate shard. Instead of a file structure with many files on one enormous
|
|||
|
disk, we have one enormous data structure on servers, each server with many
|
|||
|
disks.
|
|||
|
|
|||
|
Optimal solution is to store recently accessed data in one big SQLite file,
|
|||
|
while also storing the data in a large collection of blocks, once it has become
|
|||
|
subject to wide consensus. Older blocks, fully incorporated in the current
|
|||
|
consensus, get written to disk in our own custom Merkle-patricia tree format,
|
|||
|
with append only Merkle-patricia tree node locations, [a sequential append only
|
|||
|
collection of binary trees in postfix tree format](
|
|||
|
merkle_patricia-dac.html#a-sequential-append-only-collection-of-postfix-binary-trees).
|
|||
|
|
|||
|
Each file, incorporating a
|
|||
|
range of blocks, has its location on disk, time, size, and the roots of its
|
|||
|
Merkle-patricia trees recorded in the SQL database. On program launch, the
|
|||
|
size, touch time, and root has of newest block in the file are checked. If
|
|||
|
there is a discrepancy, we do a full check of the Merkle-patricia tree, editing
|
|||
|
it as necessary to an incomplete Merkle-patricia tree, download missing data
|
|||
|
from peers, and rebuild the blocks, thus winding up with a newer touch dates.
|
|||
|
Our per peer configuration file tells us where to find the block files, and if
|
|||
|
they are not stored where expected, we rebuild. If stored where expected, but
|
|||
|
touch dates unavailable or incorrect (perhaps because this is the first time the
|
|||
|
program launched) then the entire system of Merkle-patricia trees is validated,
|
|||
|
making sure the data on disk is consistent.
|
|||
|
|
|||
|
How do we tell the one true blockchain, from some other evil blockchain?
|
|||
|
Well, the running definition is consensus, that you can interact with other
|
|||
|
peers because they agree on the running root hash. So you downloaded this
|
|||
|
software from somewhere, and when you downloaded it, you got the means to
|
|||
|
contact a bunch of peers, whom we suppose agree, and each have evidence that
|
|||
|
other peers agree. And, having downloaded what they agree on, you then treat
|
|||
|
it as gospel and as more authoritative that what others say, so long a touch
|
|||
|
dates, file sizes, locations, and the hash of the most recent block in the file
|
|||
|
are consistent, and the internal contents of each file are consistent with root
|
|||
|
of the most recent tree.
|