forked from cheng/wallet
362b7e653c
Which affected all documentation files.
70 lines
4.4 KiB
Markdown
70 lines
4.4 KiB
Markdown
---
|
||
title: Block chain structure on disk.
|
||
...
|
||
|
||
The question is: One enormous SQLite file, or actually store the chain as a collection of files?
|
||
|
||
In the minimum viable product, the blockchain will be quite small, and it will be workable to put it one big SQLite file.
|
||
The trouble with one enormous SQLite file is that when it gets big enough, we face a high and steadily increasing risk of one sector on the enormous disk going bad, corrupting the entire database. SQLite does not handle the loss of a single sector gracefully.
|
||
|
||
We will eventually need our own database structure designed around
|
||
Merkle-patricia trees, append only data structures, and accommodating a near
|
||
certainty of sectors and entire disks continually going bad. When one hundred
|
||
disks have to be added every year, entire disks will be failing every day or
|
||
so, and sectors will be failing every second.
|
||
|
||
Eventually, a typical peer will have several big racks of disks. When we
|
||
replace the world monetary system, twenty servers each with twenty disks, two
|
||
hundred thousand transaction inputs and outputs a second, (for each
|
||
transaction minimally involves one input and two outputs, a change output and
|
||
a payment output, and usually a lot more. Each signature is sixty four bytes.
|
||
Each input and output is at least forty bytes. So, say, on average two inputs
|
||
and two outputs per payment – say, perhaps 288 bytes per payment, and we will
|
||
want to do one hundred thousand payments per second. So, about nine hundred
|
||
terabytes a year. With 2020 disk technology, that is about seventy five twelve
|
||
terabyte hard drives per year, costing about one hundred and fifty hard drives
|
||
per year costing fifty five thousand dollars per year, to store all the
|
||
transactions of the world forever.
|
||
|
||
If we are constructing one block per five minutes, each block is about ten
|
||
gigabytes. Sqlite3 cannot possibly handle that – the blocks are going to have
|
||
to be dispersed over many drives and many physical computers. We are going to
|
||
have to go to our own custom low level format, in which a block is distributed
|
||
over many drives and many servers, the upper part of the block Merkle-patricia
|
||
tree duplicated on every shard, but the the lower branches of the tree each in
|
||
a separate shard. Instead of a file structure with many files on one enormous
|
||
disk, we have one enormous data structure on servers, each server with many
|
||
disks.
|
||
|
||
Optimal solution is to store recently accessed data in one big SQLite file,
|
||
while also storing the data in a large collection of blocks, once it has become
|
||
subject to wide consensus. Older blocks, fully incorporated in the current
|
||
consensus, get written to disk in our own custom Merkle-patricia tree format,
|
||
with append only Merkle-patricia tree node locations, [a sequential append only
|
||
collection of binary trees in postfix tree format](
|
||
merkle_patricia-dac.html#a-sequential-append-only-collection-of-postfix-binary-trees).
|
||
|
||
Each file, incorporating a
|
||
range of blocks, has its location on disk, time, size, and the roots of its
|
||
Merkle-patricia trees recorded in the SQL database. On program launch, the
|
||
size, touch time, and root has of newest block in the file are checked. If
|
||
there is a discrepancy, we do a full check of the Merkle-patricia tree, editing
|
||
it as necessary to an incomplete Merkle-patricia tree, download missing data
|
||
from peers, and rebuild the blocks, thus winding up with a newer touch dates.
|
||
Our per peer configuration file tells us where to find the block files, and if
|
||
they are not stored where expected, we rebuild. If stored where expected, but
|
||
touch dates unavailable or incorrect (perhaps because this is the first time the
|
||
program launched) then the entire system of Merkle-patricia trees is validated,
|
||
making sure the data on disk is consistent.
|
||
|
||
How do we tell the one true blockchain, from some other evil blockchain?
|
||
Well, the running definition is consensus, that you can interact with other
|
||
peers because they agree on the running root hash. So you downloaded this
|
||
software from somewhere, and when you downloaded it, you got the means to
|
||
contact a bunch of peers, whom we suppose agree, and each have evidence that
|
||
other peers agree. And, having downloaded what they agree on, you then treat
|
||
it as gospel and as more authoritative that what others say, so long a touch
|
||
dates, file sizes, locations, and the hash of the most recent block in the file
|
||
are consistent, and the internal contents of each file are consistent with root
|
||
of the most recent tree.
|