wallet/docs/paxos_protocol.md

363 lines
19 KiB
Markdown
Raw Normal View History

---
title: Paxos
---
Paxos addresses the arrow theorem, and the difficulty of having a reliable
broadcast channel.
You want a fifty percent vote, so you dont want anyone voting for two
candidates, at leas not until one vote has been invalidated by timeout,
and you want to somehow have a single arbitrarily selected agenda
to vote up or down.
And, having been voted up, you dont want anything else to be voted up,
so that you can definitely know when an agenda has been selected.
But Paxos assumes that many of these problems such as who is eligible to vote
and what their vote is worth, have solutions that have been somewhat
arbitrarily predetermined by the engineer setting things up, and that we dont
have the problem of the Roman Senate and popular assembly, where only about a
third of the Senate actually showed up to vote, and an insignificant number of
those eligible to vote in popular assembly showed up, most of them clients of
senators, so Paxos is not worried about people gaming the system to exclude
voters they do not want, nor worried about people gaming the somewhat
arbitrary preselection of the agenda to be voted up and down.
# Analysing [Paxos Made Simple] in terms of Arrow and Reliable Broadcast
[Paxos Made Simple]:./paxos-simple.pdf
The trouble with Lamports proposal, described in [Paxos Made Simple] is that
it assumes no byzantine failure, and that therefore the reliable broadcast
channel is trivial, and it assumes that any proposal will be acceptable, that
all anyone cares about is converging on one proposal, therefore it always
converges on the first proposal accepted by one acceptor.
> 1. A proposer chooses a new proposal number n and sends a `prepare n`
> request to each member of some set of acceptors, asking it to respond
> with:
> a. A promise never again to accept a proposal numbered less than
> `n`, and
> b. The proposal with the highest number less than `n` that it has
> accepted, if any.
> 2. If the proposer receives the requested responses from a majority of
> the acceptors, then it can issue a `accept n v` request with number
> `n` and value `v`, where `v` is the value of the highest-numbered
> proposal among the responses, or is any value selected by the proposer
> if the responders reported no proposals accepted.
So the proposer either signs on with the existing possible consensus `v`, and
notifies a bunch of acceptors with the consensus, or initiates new possible
consensus `v`
The assumption that `n` can be arbitrary seems to assume the proposers are
all agents of the same human, so do not care which proposal is accepted. But
we intend that they be agents of different humans. But let us figure out
how everything fits together before critiquing that.
> if an acceptor ignores a `prepare` or `accept` request because it has already
> received a prepare request with a higher number, then it should probably
> inform the proposer, who should then abandon its proposal. This is a
> performance optimization that does not affect correctness.
If a majority of acceptors accept some proposal, then we have a result. But
we do not yet have everyone, or indeed anyone, knowing the result.
Whereupon we have the reliable broadcast channel problem, which Lamport hand
waves away. The learners are going to learn it. Somehow. And once we have
a result accepted, we then happily go on to the next round.
Well, suppose the leaders proposal is just intolerable? Lamport assumes a
level of concord that is unlikely to exist.
Well screw them, they have to propose the same value already accepted by an
acceptor. So we are going to get a definite result. Worst case outcome, is
that proposers keep issuing new higher numbered proposals before any proposal
is accepted.
Lamport is assuming no byzantine failure but, assuming no byzantine failure,
this is going to generate a definite result sooner or later.
But because we could have overlapping proposals with no acceptances, Lamport
concludes, we need a “distinguished proposer”, a leader,
the primus inter pares, the Chief executive officer.
As an efficiency requirement, not a requirement to reach consensus.
Trouble is that Lamports Paxos algorithm is that as soon as it becomes known
that one acceptor has accepted one proposal, everyone should converge to it.
But suppose everyone becomes aware of the proposal, and 49% of acceptors
think it is great, and 51% of acceptors think it sucks intolerably?
If a leaders proposal could be widely heard, and widely rejected, we have a
case not addressed by Lamports Paxos protocol.
Lamport does not appear to think about the case that the leaders proposals
are rejected because they are just objectionable.
# Analysing [Practical Byzantine Fault Tolerance]
[Practical Byzantine Fault Tolerance]:./byzantine_paxos.pdf
[Practical Byzantine Fault Tolerance] differs from [Paxos Made Simple] in
having “views” where the change of leader (what they call “the primary”) is
accomplished by a change of “view”, and in having three phases, pre-prepare,
prepare, and accept, instead of two phases, prepare and accept.
Pre-prepare is the “primary” (leader, CEO, primus inter pares) notifying the
“replicas” (peers) of the total order of a client message.
Prepare is the “replicas” (peers, reliable broadcast channel) notifying each
other of association between total order and message digest.
Accept is the “replicas” and the client learning that $33\%+1$ of the
“replicas” (peers, reliable broadcast channel) agree on the total order of the
clients message.
# Analysing Raft Protocol
The [raft protocol] is inherently insecure against Byzantine attacks, because
the leader is fully responsible for managing log replication on the other
servers of the cluster
[raft protocol]: https://ipads.se.sjtu.edu.cn/_media/publications/wang_podc19.pdf
We obviously want the pool to be replicated peer to peer, with the primus
inter pares (leader, what [Practical Byzantine Fault Tolerance] call
“the primary”) organizing the peers to vote for one block after they have
already come close to agreement, and the only differences are transactions
not yet widely circulated, or disagreements over which of two conflicting
spends is to be incorporated in the next block.
I am pretty sure that this mapping of byzantine Paxos to blockchain is
garbled, confused, and incorrect. I am missing something, misunderstanding
something, there are a bunch of phases that matter which I am leaving out,
unaware I have left them out. I will have to revisit this.
The Paxos protocol should be understood as a system wherein peers agree on a
total ordering of transactions. Each transaction happens within a block, and
each block has a sequential integer identifier. Each transaction within a
valid block must be non conflicting with every other transaction within a
block and consistent with all past transactions, so that although the the
block defines a total order on every transaction within the block, all
transactions can be applied in parallel.
The problem is that we need authoritative agreement on what transactions are
part of block N.
Proposed transactions flood fill through the peers. A single distinguished
entity must propose a block, the pre-prepare message, notice of this
proposal, the root hash of the proposal, flood fills through the peers, and
peers notify each other of this proposal and that they are attempting to
synchronize on it. Synchronizing on the block and validating it are likely to
require huge amounts of bandwidth and processing power, and will take
significant time.
If a peer successfully synchronize, he issues a prepare message. If something
is wrong with the block, he issues a nack, a vote against the proposed
block, but the nack is informational only. It signals that peers should get
ready for a view change, but it is an optimization only.
If a peer receives a voting majority of prepare messages, he issues a commit
message.
And that is the Paxos protocol for that block of transactions. We then go
into the Paxos protocol for the block of prepare messages that proves a
majority voted “prepare”. The block of prepare messages chains to the block
of transactions, and the block of transactions chains to the previous block
of prepare messages.
And if time goes by, and we have not managed a commit, perhaps because there
are lot of nacks due to bad transactions, perhaps because the primus inter
pares claims to have transactions that not everyone has, and then is unable
to provide them, (maybe the internet went down for it) peers become open to
another pre-prepare message from the next in line to be primus inter pares.
In order that we can flood fill, we have to be able to simultaneously
synchronize on several different views of the pool of transactions. If
synchronization on a proposed block is stalling, perhaps because of missing
transactions, we end up synchronizing on multiple proposed blocks proposed by
multiple entities. Which is great if one runs to completion, and the others
do not, but we are likely to run into multiple proposed blocks succeeding. In
which case, we have what [Castro and Liskov](./byzantine_paxos.pdf)call a
view change.
If things are not going anywhere, a peer issues a view change message which
flood fills around the pool, nominating the next peer in line for primus
inter pares, or the peer nominated in existing pool of view change messages.
When it has a majority for a view change in its pool, it will then pay
attention to a pre-prepare message from the new primus inter pares, (which it
may well have already received) and we the new primus inter pares restarts
the protocol for deciding on the next block, issuing a new pre-prepare
message (which is likely one that many of the peers have already synchronized
on).
A peer has a pre-pare message. He has a recent vote that the entity that
issued the pre-pare message is primus inter pares. He has, or synchronizes
on, that block whose root hash is the one specified in that pre-prepare
message issued by that primus-inter-pares. When he has the block, and it is
valid, then away we go. He flood fills a prepare message, which prepare
message chains to the block, and to the latest vote for primus inter pares.
Each new state of the blockchain has a final root hash at its peak, and each
peer that accepts the new state of that blockchain has a pile of commits that
add up to a majority committing to this new peak. But it is probable that
different peers will have different piles of commits. Whenever an entity
wants to change the blockchain state (issues a pre-prepare message), it will
propose an addition to the blockchain that contains one particular pile of
commits validating the previous state of the blockchain. Thus each block
contains one specific view of the consensus validating the previous root. But
it does not contain the consensus validating itself. That is in the pool,
not yet in the blockchain.
A change of primus inter pares is itself a change in blockchain state, so
when the new primus inter pares issues a new proposed commit, which is to say
a new pre-prepare message, it is going to have, in addition to the pile of
commits that they may have already synchronized on, a pile of what [Castro
and Liskov](./byzantine_paxos.pdf)call view change messages, one specific
view of that pile of view change messages. A valid block is going to have a
valid pre-prepare message from a valid primus inter pares, and, if necessary,
the vote for that primus inter pares. But the change in primus inter pares,
for each peer, took place when that peer had a majority of votes for the
primus inter pares, and the commit to the new block took place when that peer
had the majority of votes for the block. For each peer, the consensus
depends on the votes in his pool, not the votes that subsequently get
recorded in the block chain.
So, a peer will not normally accept a proposed transaction from a client if
it already has a conflicting transaction. There is nothing in the protocol
enforcing this, but if the double spend is coming from Joe Random Scammer,
that is just extra work for that peer and all the other peers.
But once a peer has a accepted a double spend transaction, finding consistency
is likely to be easier in the final blockchain, where that transaction simply
has no outputs. Otherwise for the sake of premature optimization, we
complicate the algorithm for reaching consensus.
# Fast Byzantine multi paxos
# Generating the next block
But in the end, we have to generate the next block, which includes some transactions and not others
In the bitcoin blockchain, the transactions flood fill through all miners,
one miner randomly wins the right to decide on the block, forms the block,
and the block floodfills through the blockchain.
In our chain, the primus inter pares proposes a block, peers synchronize on
it, which should be fast because they have already synchronized with each
other, and if there is nothing terribly wrong with it, send in their
signatures. When the primus inter pares has enough signatures, he then sends
out the signature block, containing all the signatures.
If we have a primus inter pares, and his proposal is acceptable, then things
proceed straight forwardly through the same synchronization process as we use
in flood fill, except that we are focusing on flood filling older
information to make sure everyone is in agreement..
After a block is agreed upon, the peers focus on flood filling all the
transactions that they possess around. This corresponds to the client phase
of Fast Byzantine Collapsed MultiPaxos. When the time for the next block
arrives, they stop floodfilling, apart from floodfilling any old transactions
that they received before the previous block was agreed, but which were not
included in the previous block, and focus on achieving agreement on those old
transactions, plus a subset of their new transactions. They try to achieve
agreement by postponing new transactions, and flooding old transactions around.
When a peer is in the Paxos phase, a synchronization event corresponds to
what [Castro and Liskov 4.2](./byzantine_paxos.pdf)call a group commit.
Synchronization events with the primary inter pares result in what [Castro
and Liskov](./byzantine_paxos.pdf) call a pre-prepare multicast, though if
the primus inter pares times out, or its proposal is rejected, we then go
into what [Castro and Liskov 4.4](./byzantine_paxos.pdf) call view changes.
In their proposal, there is a designated sequence, and if one primus inter
pares fails, then you go to the next, and the next, thus reducing the
stalling problem when two entities are trying to be primus inter pares.
They assume an arbitrary sequence number, pre-assigned. We will instead go
through the leading signatories of the previous block, with a succession of
timeouts. “Hey, the previous primus inter pares has not responded. Let\s
hear it from the leading signatory of the previous block. Hey. No response
from him either. Let us try the second leading signatory”. Trying for a
different consensus nominator corresponds to what Castro and Liskov call
“view change”
The primus inter pares, Castro and Liskov\s primary, Castro and Liskov\s
view, issues a proposed root hash for the next block (a short message).
Everyone chats to everyone else, announcing that they are attempting to
synchronize on that root hash, also short messages, preparatory to long
messages as they attempt to synchronize. If they succeed in generating the
proposed block, and there is nothing wrong with the block, they send
approvals (short messages) to everyone else. At some point the primus inter
pares wraps up a critical mass of approvals in an approval block (a
potentially long message). When everyone has a copy of the proposed block,
and the approval block, then the block chain has added another immutable
block.
Castro and Liskovs pre prepare message is the primary telling everyone
“Hey, let us try for the next block having this root hash”
Castro and Liskovs prepare message is each peer telling all the other peers
“Trying to synchronize on this announcement”, confirming that the primus
inter pares is active, and assumed to be sane.
Castro and Liskovs commit message is each peer telling all the other peers
“I see a voting majority of peers trying to synchronize on this root hash”.
At this point Castro and Liskovs protocol is complete but of course
there is no guarantee that we will be able to synchronize on this root hash -
it might contain invalid transactions, it might reference transactions that
get lost because peers go down or internet communications are interrupted. So
our protocol is not complete.
The peers have not committed to the block. They have committed to commit to
the block if they have it and it is valid.
So, we have a Paxos process where they agree to try for a block of
transactions with one specific root hash. Then everyone tries to sync on the
block. Then we have a second Paxos process where they agree that they have a
block of signatories agreeing that they have the block of transactions and it
is valid.
The block of transactions Merkle chains to previous blocks of signatories and
transactions, and the block of signatories Merkle chains to previous blocks
of transactions and signatories. Each short message of commitment contains a
short proof that it chains to previous commitments, which a peer already has
and has already committed to.
When a peer has the block of transactions that a voting majority of peers
have agreed to commit to, and the block is valid, it announces it has the
block, and that the block has majority support, and it goes into the flood
fill transactions phase.
In the flood fill transactions phase, it randomly synchronizes its pool data
with other random peers, where each peer synchronizes with the other peer by
each peer giving the other the pool items it does not yet possess.
It also enters into a Paxos consensus phase where peers agree on the
authoritative block of signatures, and the time for the next block of
transactions, so that each not only has a majority of signatures, a but the
same block of signatures forming a majority. When the time for forming the
next block arrives, it switches to flood filling only old data around, the
point being to converge on common set of transactions.
After convergence on a common set of transactions has been going for a while,
they expect an announcement of a proposed consensus on those transactions by
primus inter pares, the pre-prepare message of the Paxos protocol.
They then proceed to Paxos consensus on the intended block, by way of the
prepare and commit messages, followed, if all goes well, by a voting majority
of peers announcing that they have the block and it is valid.
Our implementation of Byzantine Paxos differs from that of Castro and Liskov
in that a block corresponds to what they call a stable checkpoint, and also
corresponds to what they call a group transaction. If every peer swiftly
received every transaction, and then rejected conflicting transactions, then
it would approximate their protocol, but you cannot say a transaction is
rollback proof until you reach a stable checkpoint, and if rollbacks are
possible, people are going to game the system. We want authoritative
consensus on what transactions have been accepted more than we want prompt
response, as decentralized rollback is apt to be chaotic, unpredictable, and
easily gamed. They prioritized prompt response, while allowing the
possibility of inconsistent response which was OK in their application,
because no one wanted to manipulate the system into giving inconsistent
responses, and inconsistent responses did not leave angry clients out of
money.