363 lines
19 KiB
Markdown
363 lines
19 KiB
Markdown
---
|
||
title: Paxos
|
||
---
|
||
Paxos addresses the arrow theorem, and the difficulty of having a reliable
|
||
broadcast channel.
|
||
|
||
You want a fifty percent vote, so you don’t want anyone voting for two
|
||
candidates, at leas not until one vote has been invalidated by timeout,
|
||
and you want to somehow have a single arbitrarily selected agenda
|
||
to vote up or down.
|
||
|
||
And, having been voted up, you don’t want anything else to be voted up,
|
||
so that you can definitely know when an agenda has been selected.
|
||
|
||
But Paxos assumes that many of these problems such as who is eligible to vote
|
||
and what their vote is worth, have solutions that have been somewhat
|
||
arbitrarily predetermined by the engineer setting things up, and that we don’t
|
||
have the problem of the Roman Senate and popular assembly, where only about a
|
||
third of the Senate actually showed up to vote, and an insignificant number of
|
||
those eligible to vote in popular assembly showed up, most of them clients of
|
||
senators, so Paxos is not worried about people gaming the system to exclude
|
||
voters they do not want, nor worried about people gaming the somewhat
|
||
arbitrary preselection of the agenda to be voted up and down.
|
||
|
||
# Analysing [Paxos Made Simple] in terms of Arrow and Reliable Broadcast
|
||
|
||
[Paxos Made Simple]:./paxos-simple.pdf
|
||
|
||
The trouble with Lamport’s proposal, described in [Paxos Made Simple] is that
|
||
it assumes no byzantine failure, and that therefore the reliable broadcast
|
||
channel is trivial, and it assumes that any proposal will be acceptable, that
|
||
all anyone cares about is converging on one proposal, therefore it always
|
||
converges on the first proposal accepted by one acceptor.
|
||
|
||
> 1. A proposer chooses a new proposal number n and sends a `prepare n`
|
||
> request to each member of some set of acceptors, asking it to respond
|
||
> with:
|
||
> a. A promise never again to accept a proposal numbered less than
|
||
> `n`, and
|
||
> b. The proposal with the highest number less than `n` that it has
|
||
> accepted, if any.
|
||
> 2. If the proposer receives the requested responses from a majority of
|
||
> the acceptors, then it can issue a `accept n v` request with number
|
||
> `n` and value `v`, where `v` is the value of the highest-numbered
|
||
> proposal among the responses, or is any value selected by the proposer
|
||
> if the responders reported no proposals accepted.
|
||
|
||
So the proposer either signs on with the existing possible consensus `v`, and
|
||
notifies a bunch of acceptors with the consensus, or initiates new possible
|
||
consensus `v`
|
||
|
||
The assumption that `n` can be arbitrary seems to assume the proposers are
|
||
all agents of the same human, so do not care which proposal is accepted. But
|
||
we intend that they be agents of different humans. But let us figure out
|
||
how everything fits together before critiquing that.
|
||
|
||
> if an acceptor ignores a `prepare` or `accept` request because it has already
|
||
> received a prepare request with a higher number, then it should probably
|
||
> inform the proposer, who should then abandon its proposal. This is a
|
||
> performance optimization that does not affect correctness.
|
||
|
||
If a majority of acceptors accept some proposal, then we have a result. But
|
||
we do not yet have everyone, or indeed anyone, knowing the result.
|
||
Whereupon we have the reliable broadcast channel problem, which Lamport hand
|
||
waves away. The learners are going to learn it. Somehow. And once we have
|
||
a result accepted, we then happily go on to the next round.
|
||
|
||
Well, suppose the leader’s proposal is just intolerable? Lamport assumes a
|
||
level of concord that is unlikely to exist.
|
||
|
||
Well screw them, they have to propose the same value already accepted by an
|
||
acceptor. So we are going to get a definite result. Worst case outcome, is
|
||
that proposers keep issuing new higher numbered proposals before any proposal
|
||
is accepted.
|
||
|
||
Lamport is assuming no byzantine failure – but, assuming no byzantine failure,
|
||
this is going to generate a definite result sooner or later.
|
||
|
||
But because we could have overlapping proposals with no acceptances, Lamport
|
||
concludes, we need a “distinguished proposer”, a leader,
|
||
the primus inter pares, the Chief executive officer.
|
||
As an efficiency requirement, not a requirement to reach consensus.
|
||
|
||
Trouble is that Lamport’s Paxos algorithm is that as soon as it becomes known
|
||
that one acceptor has accepted one proposal, everyone should converge to it.
|
||
|
||
But suppose everyone becomes aware of the proposal, and 49% of acceptors
|
||
think it is great, and 51% of acceptors think it sucks intolerably?
|
||
|
||
If a leader’s proposal could be widely heard, and widely rejected, we have a
|
||
case not addressed by Lamport’s Paxos protocol.
|
||
|
||
Lamport does not appear to think about the case that the leader’s proposals
|
||
are rejected because they are just objectionable.
|
||
|
||
# Analysing [Practical Byzantine Fault Tolerance]
|
||
|
||
[Practical Byzantine Fault Tolerance]:./byzantine_paxos.pdf
|
||
|
||
[Practical Byzantine Fault Tolerance] differs from [Paxos Made Simple] in
|
||
having “views” where the change of leader (what they call “the primary”) is
|
||
accomplished by a change of “view”, and in having three phases, pre-prepare,
|
||
prepare, and accept, instead of two phases, prepare and accept.
|
||
|
||
Pre-prepare is the “primary” (leader, CEO, primus inter pares) notifying the
|
||
“replicas” (peers) of the total order of a client message.
|
||
|
||
Prepare is the “replicas” (peers, reliable broadcast channel) notifying each
|
||
other of association between total order and message digest.
|
||
|
||
Accept is the “replicas” and the client learning that $33\%+1$ of the
|
||
“replicas” (peers, reliable broadcast channel) agree on the total order of the
|
||
client’s message.
|
||
|
||
# Analysing Raft Protocol
|
||
|
||
The [raft protocol] is inherently insecure against Byzantine attacks, because
|
||
the leader is fully responsible for managing log replication on the other
|
||
servers of the cluster
|
||
|
||
[raft protocol]: https://ipads.se.sjtu.edu.cn/_media/publications/wang_podc19.pdf
|
||
|
||
We obviously want the pool to be replicated peer to peer, with the primus
|
||
inter pares (leader, what [Practical Byzantine Fault Tolerance] call
|
||
“the primary”) organizing the peers to vote for one block after they have
|
||
already come close to agreement, and the only differences are transactions
|
||
not yet widely circulated, or disagreements over which of two conflicting
|
||
spends is to be incorporated in the next block.
|
||
|
||
I am pretty sure that this mapping of byzantine Paxos to blockchain is
|
||
garbled, confused, and incorrect. I am missing something, misunderstanding
|
||
something, there are a bunch of phases that matter which I am leaving out,
|
||
unaware I have left them out. I will have to revisit this.
|
||
|
||
The Paxos protocol should be understood as a system wherein peers agree on a
|
||
total ordering of transactions. Each transaction happens within a block, and
|
||
each block has a sequential integer identifier. Each transaction within a
|
||
valid block must be non conflicting with every other transaction within a
|
||
block and consistent with all past transactions, so that although the the
|
||
block defines a total order on every transaction within the block, all
|
||
transactions can be applied in parallel.
|
||
|
||
The problem is that we need authoritative agreement on what transactions are
|
||
part of block N.
|
||
|
||
Proposed transactions flood fill through the peers. A single distinguished
|
||
entity must propose a block, the pre-prepare message, notice of this
|
||
proposal, the root hash of the proposal, flood fills through the peers, and
|
||
peers notify each other of this proposal and that they are attempting to
|
||
synchronize on it. Synchronizing on the block and validating it are likely to
|
||
require huge amounts of bandwidth and processing power, and will take
|
||
significant time.
|
||
|
||
If a peer successfully synchronize, he issues a prepare message. If something
|
||
is wrong with the block, he issues a nack, a vote against the proposed
|
||
block, but the nack is informational only. It signals that peers should get
|
||
ready for a view change, but it is an optimization only.
|
||
|
||
If a peer receives a voting majority of prepare messages, he issues a commit
|
||
message.
|
||
|
||
And that is the Paxos protocol for that block of transactions. We then go
|
||
into the Paxos protocol for the block of prepare messages that proves a
|
||
majority voted “prepare”. The block of prepare messages chains to the block
|
||
of transactions, and the block of transactions chains to the previous block
|
||
of prepare messages.
|
||
|
||
And if time goes by, and we have not managed a commit, perhaps because there
|
||
are lot of nacks due to bad transactions, perhaps because the primus inter
|
||
pares claims to have transactions that not everyone has, and then is unable
|
||
to provide them, (maybe the internet went down for it) peers become open to
|
||
another pre-prepare message from the next in line to be primus inter pares.
|
||
|
||
In order that we can flood fill, we have to be able to simultaneously
|
||
synchronize on several different views of the pool of transactions. If
|
||
synchronization on a proposed block is stalling, perhaps because of missing
|
||
transactions, we end up synchronizing on multiple proposed blocks proposed by
|
||
multiple entities. Which is great if one runs to completion, and the others
|
||
do not, but we are likely to run into multiple proposed blocks succeeding. In
|
||
which case, we have what [Castro and Liskov](./byzantine_paxos.pdf)call a
|
||
view change.
|
||
|
||
If things are not going anywhere, a peer issues a view change message which
|
||
flood fills around the pool, nominating the next peer in line for primus
|
||
inter pares, or the peer nominated in existing pool of view change messages.
|
||
When it has a majority for a view change in its pool, it will then pay
|
||
attention to a pre-prepare message from the new primus inter pares, (which it
|
||
may well have already received) and we the new primus inter pares restarts
|
||
the protocol for deciding on the next block, issuing a new pre-prepare
|
||
message (which is likely one that many of the peers have already synchronized
|
||
on).
|
||
|
||
A peer has a pre-pare message. He has a recent vote that the entity that
|
||
issued the pre-pare message is primus inter pares. He has, or synchronizes
|
||
on, that block whose root hash is the one specified in that pre-prepare
|
||
message issued by that primus-inter-pares. When he has the block, and it is
|
||
valid, then away we go. He flood fills a prepare message, which prepare
|
||
message chains to the block, and to the latest vote for primus inter pares.
|
||
|
||
Each new state of the blockchain has a final root hash at its peak, and each
|
||
peer that accepts the new state of that blockchain has a pile of commits that
|
||
add up to a majority committing to this new peak. But it is probable that
|
||
different peers will have different piles of commits. Whenever an entity
|
||
wants to change the blockchain state (issues a pre-prepare message), it will
|
||
propose an addition to the blockchain that contains one particular pile of
|
||
commits validating the previous state of the blockchain. Thus each block
|
||
contains one specific view of the consensus validating the previous root. But
|
||
it does not contain the consensus validating itself. That is in the pool,
|
||
not yet in the blockchain.
|
||
|
||
A change of primus inter pares is itself a change in blockchain state, so
|
||
when the new primus inter pares issues a new proposed commit, which is to say
|
||
a new pre-prepare message, it is going to have, in addition to the pile of
|
||
commits that they may have already synchronized on, a pile of what [Castro
|
||
and Liskov](./byzantine_paxos.pdf)call view change messages, one specific
|
||
view of that pile of view change messages. A valid block is going to have a
|
||
valid pre-prepare message from a valid primus inter pares, and, if necessary,
|
||
the vote for that primus inter pares. But the change in primus inter pares,
|
||
for each peer, took place when that peer had a majority of votes for the
|
||
primus inter pares, and the commit to the new block took place when that peer
|
||
had the majority of votes for the block. For each peer, the consensus
|
||
depends on the votes in his pool, not the votes that subsequently get
|
||
recorded in the block chain.
|
||
|
||
So, a peer will not normally accept a proposed transaction from a client if
|
||
it already has a conflicting transaction. There is nothing in the protocol
|
||
enforcing this, but if the double spend is coming from Joe Random Scammer,
|
||
that is just extra work for that peer and all the other peers.
|
||
|
||
But once a peer has a accepted a double spend transaction, finding consistency
|
||
is likely to be easier in the final blockchain, where that transaction simply
|
||
has no outputs. Otherwise for the sake of premature optimization, we
|
||
complicate the algorithm for reaching consensus.
|
||
|
||
# Fast Byzantine multi paxos
|
||
|
||
# Generating the next block
|
||
|
||
But in the end, we have to generate the next block, which includes some transactions and not others
|
||
|
||
In the bitcoin blockchain, the transactions flood fill through all miners,
|
||
one miner randomly wins the right to decide on the block, forms the block,
|
||
and the block floodfills through the blockchain.
|
||
|
||
In our chain, the primus inter pares proposes a block, peers synchronize on
|
||
it, which should be fast because they have already synchronized with each
|
||
other, and if there is nothing terribly wrong with it, send in their
|
||
signatures. When the primus inter pares has enough signatures, he then sends
|
||
out the signature block, containing all the signatures.
|
||
|
||
If we have a primus inter pares, and his proposal is acceptable, then things
|
||
proceed straight forwardly through the same synchronization process as we use
|
||
in flood fill, except that we are focusing on flood filling older
|
||
information to make sure everyone is in agreement..
|
||
|
||
After a block is agreed upon, the peers focus on flood filling all the
|
||
transactions that they possess around. This corresponds to the client phase
|
||
of Fast Byzantine Collapsed MultiPaxos. When the time for the next block
|
||
arrives, they stop floodfilling, apart from floodfilling any old transactions
|
||
that they received before the previous block was agreed, but which were not
|
||
included in the previous block, and focus on achieving agreement on those old
|
||
transactions, plus a subset of their new transactions. They try to achieve
|
||
agreement by postponing new transactions, and flooding old transactions around.
|
||
|
||
When a peer is in the Paxos phase, a synchronization event corresponds to
|
||
what [Castro and Liskov 4.2](./byzantine_paxos.pdf)call a group commit.
|
||
Synchronization events with the primary inter pares result in what [Castro
|
||
and Liskov](./byzantine_paxos.pdf) call a pre-prepare multicast, though if
|
||
the primus inter pares times out, or its proposal is rejected, we then go
|
||
into what [Castro and Liskov 4.4](./byzantine_paxos.pdf) call view changes.
|
||
In their proposal, there is a designated sequence, and if one primus inter
|
||
pares fails, then you go to the next, and the next, thus reducing the
|
||
stalling problem when two entities are trying to be primus inter pares.
|
||
|
||
They assume an arbitrary sequence number, pre-assigned. We will instead go
|
||
through the leading signatories of the previous block, with a succession of
|
||
timeouts. “Hey, the previous primus inter pares has not responded. Let\’s
|
||
hear it from the leading signatory of the previous block. Hey. No response
|
||
from him either. Let us try the second leading signatory”. Trying for a
|
||
different consensus nominator corresponds to what Castro and Liskov call
|
||
“view change”
|
||
|
||
The primus inter pares, Castro and Liskov\’s primary, Castro and Liskov\’s
|
||
view, issues a proposed root hash for the next block (a short message).
|
||
Everyone chats to everyone else, announcing that they are attempting to
|
||
synchronize on that root hash, also short messages, preparatory to long
|
||
messages as they attempt to synchronize. If they succeed in generating the
|
||
proposed block, and there is nothing wrong with the block, they send
|
||
approvals (short messages) to everyone else. At some point the primus inter
|
||
pares wraps up a critical mass of approvals in an approval block (a
|
||
potentially long message). When everyone has a copy of the proposed block,
|
||
and the approval block, then the block chain has added another immutable
|
||
block.
|
||
|
||
Castro and Liskov’s pre prepare message is the primary telling everyone
|
||
“Hey, let us try for the next block having this root hash”
|
||
|
||
Castro and Liskov’s prepare message is each peer telling all the other peers
|
||
“Trying to synchronize on this announcement”, confirming that the primus
|
||
inter pares is active, and assumed to be sane.
|
||
|
||
Castro and Liskov’s commit message is each peer telling all the other peers
|
||
“I see a voting majority of peers trying to synchronize on this root hash”.
|
||
At this point Castro and Liskov’s protocol is complete – but of course
|
||
there is no guarantee that we will be able to synchronize on this root hash -
|
||
it might contain invalid transactions, it might reference transactions that
|
||
get lost because peers go down or internet communications are interrupted. So
|
||
our protocol is not complete.
|
||
|
||
The peers have not committed to the block. They have committed to commit to
|
||
the block if they have it and it is valid.
|
||
|
||
So, we have a Paxos process where they agree to try for a block of
|
||
transactions with one specific root hash. Then everyone tries to sync on the
|
||
block. Then we have a second Paxos process where they agree that they have a
|
||
block of signatories agreeing that they have the block of transactions and it
|
||
is valid.
|
||
|
||
The block of transactions Merkle chains to previous blocks of signatories and
|
||
transactions, and the block of signatories Merkle chains to previous blocks
|
||
of transactions and signatories. Each short message of commitment contains a
|
||
short proof that it chains to previous commitments, which a peer already has
|
||
and has already committed to.
|
||
|
||
When a peer has the block of transactions that a voting majority of peers
|
||
have agreed to commit to, and the block is valid, it announces it has the
|
||
block, and that the block has majority support, and it goes into the flood
|
||
fill transactions phase.
|
||
|
||
In the flood fill transactions phase, it randomly synchronizes its pool data
|
||
with other random peers, where each peer synchronizes with the other peer by
|
||
each peer giving the other the pool items it does not yet possess.
|
||
|
||
It also enters into a Paxos consensus phase where peers agree on the
|
||
authoritative block of signatures, and the time for the next block of
|
||
transactions, so that each not only has a majority of signatures, a but the
|
||
same block of signatures forming a majority. When the time for forming the
|
||
next block arrives, it switches to flood filling only old data around, the
|
||
point being to converge on common set of transactions.
|
||
|
||
After convergence on a common set of transactions has been going for a while,
|
||
they expect an announcement of a proposed consensus on those transactions by
|
||
primus inter pares, the pre-prepare message of the Paxos protocol.
|
||
|
||
They then proceed to Paxos consensus on the intended block, by way of the
|
||
prepare and commit messages, followed, if all goes well, by a voting majority
|
||
of peers announcing that they have the block and it is valid.
|
||
|
||
Our implementation of Byzantine Paxos differs from that of Castro and Liskov
|
||
in that a block corresponds to what they call a stable checkpoint, and also
|
||
corresponds to what they call a group transaction. If every peer swiftly
|
||
received every transaction, and then rejected conflicting transactions, then
|
||
it would approximate their protocol, but you cannot say a transaction is
|
||
rollback proof until you reach a stable checkpoint, and if rollbacks are
|
||
possible, people are going to game the system. We want authoritative
|
||
consensus on what transactions have been accepted more than we want prompt
|
||
response, as decentralized rollback is apt to be chaotic, unpredictable, and
|
||
easily gamed. They prioritized prompt response, while allowing the
|
||
possibility of inconsistent response – which was OK in their application,
|
||
because no one wanted to manipulate the system into giving inconsistent
|
||
responses, and inconsistent responses did not leave angry clients out of
|
||
money.
|