forked from cheng/wallet
7674b879eb
many files updated with trivial fixes. modified: docs/design/TCP.md modified: docs/design/peer_socket.md modified: docs/design/proof_of_share.md modified: docs/estimating_frequencies_from_small_samples.md modified: docs/libraries.md modified: docs/libraries/scripting.md modified: docs/manifesto/May_scale_of_monetary_hardness.md modified: docs/manifesto/bitcoin.md modified: docs/manifesto/consensus.md modified: docs/manifesto/lightning.md modified: docs/manifesto/scalability.md modified: docs/manifesto/social_networking.md modified: docs/manifesto/sox_accounting.md modified: docs/manifesto/triple_entry_accounting.md modified: docs/manifesto/white_paper_YarvinAppendix.md modified: docs/names/multisignature.md modified: docs/names/petnames.md modified: docs/names/zookos_triangle.md modified: docs/notes/big_cirle_notation.md modified: docs/number_encoding.md modified: docs/scale_clients_trust.md modified: docs/setup/contributor_code_of_conduct.md modified: docs/setup/core_lightning_in_debian.md modified: docs/setup/set_up_build_environments.md modified: docs/setup/wireguard.md modified: docs/writing_and_editing_documentation.md
1338 lines
68 KiB
Markdown
1338 lines
68 KiB
Markdown
---
|
||
title: Replacing TCP, SSL, DNS, CAs, and TLS
|
||
sidebar: true
|
||
...
|
||
|
||
# related
|
||
|
||
[Client Server Data Representation](client_server.html){target="_blank"}
|
||
|
||
# Existing work
|
||
|
||
[µTP]:https://github.com/bittorrent/libutp
|
||
"libutp - The uTorrent Transport Protocol library"
|
||
{target="_blank"}
|
||
|
||
[µTP], Micro Transport Protocol has already been written, and it is just a
|
||
matter of copying it and embedding it where possible, and forking it if
|
||
unavoidable. DDOS resistance looks like it is going to need forking.
|
||
|
||
It implements ledbat, a protocol designed for applications that download
|
||
bulk data in the background, pushing the network close to its limits, while
|
||
still playing nice with TCP.
|
||
|
||
Implementing consensus over [µTP] is going to need [QUIC] style streams,
|
||
that can slow down or fail without the whole connection slowing down or
|
||
failing, though it might be easier to implement consensus that just calls
|
||
µTP for some tasks.
|
||
|
||
I have not investigated what implementing short fixed length streams over
|
||
[µTP] would involve. Bittorrent already necessarily does something mighty
|
||
like that. Maybe it just sequentializes everything. Which kind of makes
|
||
sense, a single concurrent process managing each connection is easier to
|
||
program and comprehend, even if it cannot give optimal performance.
|
||
Obviously it must have a request response layer, documented only in
|
||
source code. The question then is how it maps that layer onto a µTP
|
||
connection. You are going to have to copy, not just µTP, but that layer,
|
||
which should be part of µTP, but probably is not. You will have to
|
||
factorize that they probably not cleanly factorized.
|
||
|
||
Their request response layer is probably somewhat documented in
|
||
[BEP0055] I suspect that what I need is not just µTP, but the largest common factors of [BEP0055]
|
||
|
||
[BEP0055]:https://www.bittorrent.org/beps/bep_0055.html
|
||
"BEP0055"
|
||
{target="_blank"}
|
||
|
||
[`ut_holepunch` extension message]:http://bittorrent.org/beps/bep_0010.html
|
||
"BEP0010"
|
||
{target="_blank"}
|
||
|
||
[libtorrent source code]:https://github.com/arvidn/libtorrent/blob/c1ade2b75f8f7771509a19d427954c8c851c4931/src/bt_peer_connection.cpp#L1421
|
||
"bt_peer_connection.cpp"
|
||
{target="_blank"}
|
||
|
||
µTP does not itself implement hole punching, but interoperates smoothly
|
||
with libtorrents's [BEP0055]'s [`ut_holepunch` extension message], which is
|
||
only documented in [libtorrent source code].
|
||
|
||
A tokio-rust based µTP system is under development, but very far from
|
||
complete last time I looked. Rewriting µTP in rust seems pointless. Just
|
||
call it from a single tokio thread that gives effect to a hundred thousand
|
||
concurrent processes. There are several projects afoot to rewrite µTP in
|
||
rust, all of them stalled in a grossly broken and incomplete state.
|
||
|
||
[QUIC has grander design objectives]:https://docs.google.com/document/d/1RNHkx_VvKWyWg6Lr8SZ-saqsQx7rFV-ev2jRFUoVD34/edit
|
||
{target="_blank"}
|
||
|
||
[QUIC has grander design objectives],and is a well thought out, well
|
||
designed, and well tested implementation of no end of very good and
|
||
much needed ideas and technologies, but relies heavily on enemy
|
||
controlled cryptography.
|
||
|
||
Albeit there are some things I want to do, consensus between a small
|
||
number of peers, by invitation and each peer directly connected to each of
|
||
the others, the small set of peers being part of the consensus known to all
|
||
peers, and all peers always online and responding appropriately, or els
|
||
they get kicked out. (Practical Byzantine Fault *In*tolerant consensus)
|
||
which it really cannot do, though it might be efficient to use a different
|
||
algorithm to construct consensus, and then use µTP to download the bulk data.
|
||
|
||
# Existing documentation
|
||
|
||
There is a great pile of RFCs on issues that arise with using udp and icmp
|
||
to communicate, which contain much useful information.
|
||
|
||
[RFC5405](https://datatracker.ietf.org/doc/html/rfc5405#section-3), [RFC6773](https://datatracker.ietf.org/doc/html/rfc6773), [datagram congestion control](https://datatracker.ietf.org/doc/html/rfc5596), [RFC5595](https://datatracker.ietf.org/doc/html/rfc5595), [UDP Usage Guideline](https://datatracker.ietf.org/doc/html/rfc8085)
|
||
|
||
There is a formalized congestion control system `ECN` explicit congestion
|
||
control. Most severs ignore ECN. On a small proportion of routes, 1%,
|
||
ECN tagged packets are dropped
|
||
|
||
Raw sockets provide greater control than UDP sockets, and allow you to
|
||
do ICMP like things through ICMP.
|
||
|
||
I also have a discussion on NAT hole punching, [peering through nat](nat.html), that
|
||
summarizes various people's experience.
|
||
|
||
To get an initial estimate of the path MTU, connect a datagram socket to
|
||
the destination address using connect(2) and retrieve the MTU by calling
|
||
getsockopt(2) with the IP_MTU option. But this can only give you an
|
||
upper bound. To find the actual MTU, have to have a don't fragment field
|
||
(which is these days generally set by default on UDP) and empirically
|
||
track the largest packet that makes it on this connection. Which TCP does.
|
||
|
||
MTU (packet size) and MSS (data size, $MTU-40$) is a
|
||
[messy problem](https://www.cisco.com/c/en/us/support/docs/ip/generic-routing-encapsulation-gre/25885-pmtud-ipfrag.html)
|
||
Which can be side stepped by always sending packets
|
||
of size 576 contiaing 536 bytes of data.
|
||
|
||
## first baby steps
|
||
|
||
To try and puzzle this out, I need to build a client server that can listen on
|
||
an arbitrary port, and tell me about the messages it receives, and can send
|
||
messages to an arbitrary hostname:port or network address:port, and
|
||
which, when it receives a packet that is formatted for it, will display the
|
||
information in that packet, and obey the command in that packet, which
|
||
will typically be a command to send a reply that depicts what is in the
|
||
packet it received, which probably got transformed by passing through
|
||
multiple nats, and/or a command to display what is in the packet, which is
|
||
typically a depiction of how the packet to which this packet is a reply got
|
||
transformed
|
||
|
||
This test program sounds an awful lot like ICMP, which is best accessed
|
||
through raw sockets. Might be a good idea to give it the capability to send
|
||
ICMP, UDP, and fake TCP.
|
||
|
||
Raw sockets provide the lowest level access to the network available from
|
||
userspace. An immense pile of obscure and complicated stuff is in kernel.
|
||
|
||
# What the API should look like
|
||
|
||
It should be a consensus API for consensus among a small number of
|
||
peers, rather than message API, message response being the special case
|
||
of consensus between two peers, and broad consensus being constructed\
|
||
out of a large number of small invitation based consensi.
|
||
|
||
A peer explicitly joins the small group when its request is acked by a
|
||
majority, and rejected by no one.
|
||
|
||
On the other hand this involves re-inventing networking from scratch, as
|
||
compared to simply copying http/2, or some other reliable UDP system.
|
||
|
||
Total rewrites, however desirable and necessary, always fail
|
||
|
||
So on reflection this is a blue sky proposal - likely to involve immense delay:
|
||
|
||
I need to think about the way things should be done - but I don't want to
|
||
get lost in the weeds. I have repeatedly wasted a great deal of time
|
||
re-inventing stuff from scratch, only to find that when I was finished, I had
|
||
something vastly inferior to what already existed, so I wound up tossing
|
||
my work, and using someone else's library with minimum adaptation.
|
||
|
||
Many a time I see something is encrusted with ancient history, backward
|
||
compatibility means they cannot fix old mistakes, I design something new
|
||
and fresh, and vastly superior, and discover that there were one hundred
|
||
and one issues that old history encrusted thing had encountered and dealt
|
||
with, and I had not foreseen, that not all of that mighty pile of code is crap
|
||
to work around past mistakes which must continue to be supported, but a
|
||
lot of it is issues I had not foreseen having to deal with, and had not
|
||
planned a path to dealing with them.
|
||
|
||
When implementing stuff from scratch, all too often one discovers there
|
||
are no end of reasons for all the stuff one thought bad and unnecessary in
|
||
existing libraries.
|
||
|
||
But on with the vision. Though it will likely be vastly faster to just fix
|
||
someone else's library to have real security.
|
||
|
||
Although the api represents messages, rather than connections, it will
|
||
implicitly have a very large number of connections, in that a connection is
|
||
your current state with a counterparty, expected protocols (message types) and all that.
|
||
|
||
For an app to poll a very large number of connections over the network,
|
||
`select` does not cut the mustard. Network apis have been evolving, each in
|
||
its own idiosyncratic way, to the app making O(1) additions and deletions to
|
||
list of counterparties on the network whose messages it is listening to,
|
||
and getting notifications that are O(number of events) rather than
|
||
O(number of counterparties).
|
||
|
||
The way this should be done is a linked list of data structures containing
|
||
events, which the app can poll locklessly, or wait on (with a timer event
|
||
guaranteed to appear in the list eventually if it is waiting on it). If the app
|
||
fails to free anything from the list after an unreasonably long time,
|
||
suggesting that the app has shut down ungracefully or crashed, and there
|
||
are rather too many things on the list, the process that is putting things on
|
||
the list will start by pushing back on the parties sending messages to the
|
||
app, and end by shutting down their connections and discarding their data.
|
||
The network events live entirely in memory and are volatile. If they
|
||
represent long lived relationships, it is up to the app to commit the
|
||
information that they represent to disk.
|
||
|
||
Every message has a public key of sender, a public key of recipient, an
|
||
potentially an in-regards-to hash, a reply-to hash, and an in-reply-to hash.
|
||
Some or all of these hashes may be null. It seldom makes sense for all of
|
||
them to be null, and it seldom makes sense for all of them to be non null.
|
||
Usually reply-to is null, and it does not always make sense for it to be non
|
||
null.
|
||
|
||
The reply-to field opens up a very large can of worms, in that its main use
|
||
is to reference a third party message that came from a third party server,
|
||
with its own type information and sender public key, and the how does the
|
||
sender know the recipient has or can obtain that message?
|
||
|
||
Every hash and every public key represents a potential endpoint, and thus
|
||
represents an additive type, or rather gives the system potential clues on
|
||
how to discover a mutually known additive type. (Reflect on the slow and
|
||
chaotic semi automated complexity of how the many protocols involved in
|
||
sending and receiving an email message are discovered, every time, for
|
||
every email message.)
|
||
|
||
Some of the time, the message type is only known from one of these
|
||
hashes – they imply the type information, without which the recipient
|
||
would not know how to parse the message, and the recipient has to be able
|
||
to recognize them before he can recognize anything else. And some of the
|
||
time, figuring out the message type from these hashes is non trivial or just
|
||
flat out fails. No general automatic one size fits all procedure can work on
|
||
every mysterious second party hash. This is a problem that has to be dealt
|
||
with ad hoc use case by use case, protocol by protocol, message type by
|
||
message type.
|
||
|
||
Not all messages can be sent reliably, but the sender gets a notification
|
||
event – failed, succeeded, replied to, or unlikely to be known, and the
|
||
sender can immediately find out either the likely timing of such
|
||
notification, or that the likely timing of such notification is unknown – and
|
||
usually that the likely timing of such notification is unknown generates an
|
||
exception.
|
||
|
||
The api is potentially multilayered – the message may well get translated
|
||
to a multitude of similarly structured messages, that set up the connection,
|
||
find out information about the recipient, all that stuff, and when those
|
||
messages go on the wire, they do not necessarily have any of this stuff –
|
||
commonly they just have the network, the port address, and some numbers
|
||
that uniquely identify the context, which numbers are unique to the
|
||
connection, but unlike the hashes from which they are derived, not
|
||
globally unique, are sequential identifiers, not hashes. But at the top level,
|
||
the network address, the port, and all that stuff is just not represented,
|
||
except implicitly in that the public key of the recipient may well get
|
||
looked up in a hash table that may well have the network address and the port.
|
||
|
||
On the wire, network address and port serves the function of in-regards-to,
|
||
and will wrap stuff that provides a finer grained function of in-regards-to
|
||
and in-reply-to -- as I said, multilayered, with the hashes being internally
|
||
mapped to to data that serves equivalent functionality. Network address
|
||
and port being the outermost layer on the wire.
|
||
|
||
On the wire, once a connection is established, the sender and recipient
|
||
public keys are implicit in the ip header, and rest is opaque payload,
|
||
maximum payload being 1kiB. Inside the payload, the representation
|
||
depends on the message type, which was established when the connection
|
||
was established – the in-reply-to of the contained message is the unique
|
||
sequential nonce of the message being replied to, rather than the hash of
|
||
that message.
|
||
|
||
In the api, the application and api know the message type, because
|
||
otherwise the api just would not work. But on the rare occasions when the
|
||
message is represented globally, outside the api, *then* it needs a message type header.
|
||
|
||
# TCP is broken
|
||
|
||
TCP was designed in more trusting times, when the name system
|
||
consisted of a widely shared hosts file, and everyone trusted everyone.
|
||
|
||
Over the years people have piled warts on top of TCP and warts on top of
|
||
warts to fix one problem after another, and every fix results in additional round trips
|
||
|
||
Thus “Cloudfare is checking your browser, you will be redirected shortly”
|
||
|
||
Every additional round trip before a web page comes up results in a
|
||
significant loss of viewers. Hence http2. Which fails to fix the DDOS and
|
||
cloudfare problem.
|
||
|
||
TCP is a major problem, which is slowing down the internet. DDoS
|
||
protection and the certificate mess are warts growing on top of warts.
|
||
|
||
Any business that resists corporate cancer is going to come under DDoS,
|
||
and if it employs a DDoS resistance service, that service is likely to place
|
||
pressure on the business to do political stuff that is counterproductive to
|
||
pursuing a profit. And even if it does not, the DDoS service slows down
|
||
people trying to view the business website.
|
||
|
||
If the TCP replacement fixes those warts, you get more views.
|
||
|
||
# Domain name system and SSL is broken
|
||
|
||
Any organization that has a certificate authority in its pocket can perform
|
||
a man in the middle attack on an SSL connection, though the CAA domain
|
||
name record somewhat mitigates this problem.
|
||
|
||
We need to also need to replace the TCP/SSL/CA/DNS system because
|
||
there is money in it. A great deal of money.
|
||
|
||
The trouble with an ICO (initial coin offering), is that the issuer has no
|
||
obligation to do anything other than take the money and run. We are
|
||
moving to an economy where much of the value is “goodwill”, “goodwill”
|
||
being names with reputations and relationships. The blockchain (or
|
||
blockdag, since blockdags theoretically have better scaling than
|
||
blockchains) could be used to render this value liquid in IPOs by having
|
||
both names and money on the blockchain.
|
||
|
||
Atomic transactions between blockchains, plus names on the blockchain
|
||
with money, a replacement for TCP/SSL/CAs/DNS could support sovereign
|
||
corporations on the blockchain, so that an ICO could be an IPO (Initial
|
||
Public Offering). If the blockchain is a name service as well as a money
|
||
service, it could give the investors ownership of the name. The owners of
|
||
examplecorp shares get to designate the board public key, and the board gets to
|
||
designate the public key of CEO@examplecorp from time to time, thus
|
||
rendering the value of a name potentially liquid.
|
||
|
||
Cryptocurrency exchanges are run by crooks, and are full of crooks each
|
||
trying to scam all the other crooks.
|
||
|
||
If you don’t know who the pigeon is, you are the pigeon.
|
||
|
||
A healthy cryptocurrency market needs to leave the cryptocurrency
|
||
exchanges behind, replacing them with atomic blockchain transactions
|
||
between separate blockchains. They are dangerously centralized, and
|
||
linked to a corruptly regulated finance and accounting system, which
|
||
corruption we saw with Great Minority Mortgage Meltdown and the
|
||
Mortgage backed Security market from 2005 November to 2007, and saw
|
||
with MF Global. Jon Corzine did worse than embezzle client funds. He
|
||
embezzled client funds legally.
|
||
|
||
Demand for crypto currencies is driven in substantial part by the fact that
|
||
recent regulations have cheerfully set aside laws on fiduciary duty that are
|
||
millennia old. The exchanges cheerfully adhere to such regulations as they
|
||
find dangerously convenient, while taking advantage of cryptocurrency to
|
||
avoid those regulations that they find inconvenient.
|
||
|
||
The banks, the stock exchanges, and the big accounting firms are regulated
|
||
agencies whose regulators are in their pocket. The crypto currency exchanges
|
||
are semi regulated, taking advantage of regulations written for those who
|
||
have regulators in their pocket.
|
||
|
||
The cryptocurrency market needs to get rid of exchanges, starting with
|
||
cryptocurrency exchanges, and proceeding to get rid of stock exchanges.
|
||
|
||
An exchange exists to provide an escrow that faithfully observes
|
||
its fiduciary duty. And there have been a great many recent examples of such
|
||
entities getting up to no good, and in the case of the mortgage backed
|
||
security market, up to no good with enormous amounts of money.
|
||
|
||
A cryptocurrency with a name system could eat their lunch, greatly enriching
|
||
its founders in the process.
|
||
|
||
# Networking itself is broken
|
||
|
||
But that is too hard a problem to fix.
|
||
|
||
I had to sweat hard setting up Wireguard, because it pretends to be just
|
||
another `network adaptor` so that it can sweep away a pile of issues as out
|
||
of scope, and reading up posts and comments referencing these issues, I
|
||
suspect that almost no one understands these issues, or at least no one who
|
||
understands these issues is posting about them. They have a magic
|
||
incomprehensible incantation which works for them in their configuration,
|
||
and do not understand why it does not work for someone else in a subtly
|
||
different configuration.
|
||
|
||
## Internet protocol too many layer of abstraction
|
||
|
||
I have to talk internet protocol to reach other systems over the internet, but
|
||
internet protocol is a messy pile of ad hoc bits of software built on top of
|
||
ad hoc bits of software, and the reason it is hard to understand the nuts and
|
||
bolts when you actually try to do anything useful is that you do not
|
||
understand, and indeed almost no one understands, what is actually going
|
||
on at the level of network adaptors and internet switches. When you send a
|
||
udp packet, you are already at a high level of abstraction, and the
|
||
complexity that these abstractions are intended to hide leaks.
|
||
|
||
And because you do not understand the intentionally hidden complexity
|
||
that is leaking, it bites you.
|
||
|
||
### Adaptors and switches
|
||
|
||
A private network consists of a bunch of `network adaptors` all connected to
|
||
one `ethernet switch` and its configuration consists of configuring
|
||
the software on each particular computer with each particular `network adaptor`
|
||
to be consistent with the configuration of each of the others connected to
|
||
the same `ethernet switch`, unless you have a `DHCP server` attached to the
|
||
network, in which case each of the machines gets a random, and all too
|
||
often changing, configuration from that `DHCP server`, but at least it is
|
||
guaranteed to be consistent with the configuration of each of the other
|
||
`network adaptors` attached to that one `ethernet switch`. Why do DHCP
|
||
configurations not live forever, why do they not acknowledge the machine
|
||
human readable name, why does the ethernet switch not have a human
|
||
readable name, and why does the DHCP server have a network address
|
||
related to that of the ethernet switch, but not a human readable name
|
||
related to that of the ethernet switch?
|
||
|
||
What happens when you have several different network adaptors in one computer?
|
||
|
||
Obviously an IP address range has to be associated with each network
|
||
adaptor, so that the computer can dispatch packets to the correct adaptor.
|
||
And when the network adaptor receives a packet, the computer has to
|
||
figure out what to do with it. And what it does with it is the result of a pile
|
||
of undocumented software executing a pile of undocumented scripts.
|
||
|
||
If you manually configure each particular machine connected to an
|
||
ethernet switch, the configuration consists of arcane magic formulae
|
||
interpreted by undocumented software that differs between one system and the next.
|
||
|
||
As rapidly becomes apparent when you have to deal with more than one
|
||
adaptor, connected to more than one switch.
|
||
|
||
Each physical or virtual network adaptor is driven by a device driver,
|
||
which is different for each physical device and operating system. From the
|
||
point of view of the software, the device driver api *is* the network adaptor
|
||
programmer interface, and it does not care about which device driver it is,
|
||
so all network adaptors must have the same programmer interface. And
|
||
what is that interface?
|
||
|
||
Networking is a wart built on top of warts built on top of warts. IP6 was
|
||
intended to clean up this mess, but kind of collapsed under rule by
|
||
committee, developing a multitude of arcane, overly complicated, and overly
|
||
clever cancers of its own, different from, and in part incompatible
|
||
with, the vast pile of cruft that has grown on top of IP4.
|
||
|
||
The committee wanted to throw away the low order sixty four bits of
|
||
address space to use to post information for the NSA to mop up, and then
|
||
other people said to themselves, "this seems like a useless way to abuse
|
||
the low order sixty four bits, so let us abuse it for something else. After all,
|
||
no one is using it, nor can they use it because it is being abused". But
|
||
everyone whose internet facing host has been assigned a single address,
|
||
which means has actually been assigned $2^{64}$ addresses because he has
|
||
sixty four bits of useless address space, needs to use it, since he probably
|
||
wants to connect a private in house network through his single internet
|
||
facing host, and would like to be free to give some of his in house hosts
|
||
globally routable addresses.
|
||
|
||
In which case he has a private network address space, which is a random
|
||
subnet of fd::/8, and a 64 bit subnet of the global address space, and what
|
||
he wants is that he can assign an in house computer a globally routable
|
||
address, whereupon anything it sends that has a destination that is not on
|
||
his private network address space, nor his subnet of the globally routable
|
||
address space, gets sent to the internet facing network interface.
|
||
|
||
Further, he would like every computer on his network to be automatically
|
||
assigned a globally routable address if it uses a name in the global system,
|
||
or a private fd:: address if it is using a name not in the global system, so
|
||
that the first time his computer tries to access the network with the domain
|
||
name he just assigned, it gets a unique network address which will never
|
||
change, and a reverse dns that can only be accessed through an address on
|
||
his private network. And if he assigns it a globally accessible name, he
|
||
would like the global dns servers and reverse dns servers to automatically
|
||
learn that address.
|
||
|
||
This is, at present, doable by the DDI, which updates both your DHC
|
||
server and your DNS server. Except that hardly anyone has an in house
|
||
DNS server that serves up his globally routable addresses. The I in DDI
|
||
stands for IP Address Manager or IPAM. In practice, everyone relies on
|
||
named entities having extremely durable network addresses which are a
|
||
pain and a disaster to dynamically update, or they use dynamic DNS, not IPAM.
|
||
|
||
What would be vastly more useful and usable is that your internet facing
|
||
peer routed globally routable packets to and from your private network,
|
||
and machines booting up on your private network automatically received
|
||
addresses static addresses corresponding their name.
|
||
|
||
Globally routable subnets can change, because of physical changes in the
|
||
global network, but this happens so rarely that a painful changeover is
|
||
acceptable. The IP6 fix for automatically accommodating this issue is a
|
||
cumbersome disaster, and everyone winds up embedding their globally
|
||
routable IP6 subnet address in a multitude of mystery magic incantations,
|
||
which, in the event of a change, have to be painstakingly hunted down and
|
||
changed one by one, so the IP6 automatic configuration system is just a
|
||
great big wart in a dinosaur's asshole. It throws away half the address
|
||
space, and seldom accomplishes anything useful.
|
||
|
||
# Distributed Denial of Service attack
|
||
|
||
At present, resistance to Distributed Denial of Service attacks rests on
|
||
dangerously powerful central authorities, in particular Cloudfare, whose
|
||
service in addition to being dangerously centralized, is expensive and poor.
|
||
|
||
The TCP replacement needs an adjustable proof of work (pow) handshake
|
||
as the first part of the connection handshake, the proof of work request
|
||
being first server packet in the four packet handshake.
|
||
|
||
First packet, client requests connection, second packet, server requests
|
||
work,and supplies a durable and a short lived public key, third packet,
|
||
client supplies work and offers transient public key, making
|
||
communication possible, plus the message it is trying to send the server, or
|
||
the first part of that message.
|
||
|
||
The work demanded goes up as the server load increases, thus fixing the
|
||
horrors of DDoS protection.
|
||
|
||
## Key agreement
|
||
|
||
Key agreement needs to be part of the the TCP replacement handshake, rather
|
||
than a layer on top, to reduce round tripping.
|
||
|
||
The name system needs to be integrated with the key system, so that you get
|
||
the key when when you get the network address associated with the name, and
|
||
the key/name pairing needs to be blockchain secured, so you don’t have one
|
||
thousand certificate authorities each with the authority to mount a man in the middle attack.
|
||
|
||
## replacement handshake for publicly identified server
|
||
|
||
The the TCP replacement handshake needs to be a four phase handshake.
|
||
|
||
1. Client->Server: Give me a connection, here are my parameters, here is my
|
||
session key.
|
||
|
||
1. Server->Client: Here is a proof of work request, my parameters, and a keyed
|
||
hash of your and my parameters. Ask again with proof of work, the same
|
||
parameters, and the keyed hash.
|
||
|
||
Server then throws away the request, allocating no memory.
|
||
|
||
1. Client->Server: OK, here I am again, with all that stuff you asked for.
|
||
|
||
This includes a konce (key used once,single use elliptic point), and
|
||
assumes that the client reliably knows the server public key i
|
||
advance. This protocol is inappropriate to signons that are restricted
|
||
to identified entities, because we probably do not want everyone to
|
||
know who is
|
||
identified.
|
||
|
||
1. Server checks the poly1305 authentication to ensure that this is a
|
||
real client reply to a real and recent server reply. Then it checks the
|
||
proof of work.
|
||
|
||
If the proof of work passes, Server allocates memory, generates and stores a
|
||
session key, and stores connection parameters, the client and server
|
||
session keys among them.
|
||
|
||
1. Server->Client: OK, here is my session key, authenticated but not
|
||
signed by my permanent key, and stuff, now you can start sending
|
||
actual data.
|
||
|
||
Thus we can integrate TCP handshake and encryption hand shake and the
|
||
innumerable DDoS protection handshakes “Cloudfare is checking your browser,
|
||
oops, your browser did not pass, here is a captcha” at the cost of one single
|
||
additional trip, half a round trip.
|
||
|
||
Instead of the person establishing the connection fuming while round trip
|
||
after round trip goes through, we get all that stuff at the cost of one
|
||
additional half round trip.
|
||
|
||
### pow implementation
|
||
|
||
Each sequential proof of work request contains a 64 bit sequential integer.
|
||
The integer starts at random 63 bit value, to ensure that every possible
|
||
successful proof of work ever used is unique in the universe. The
|
||
sequential integer is treated as a windowed value into a 512 bit integer,
|
||
whose high order part is an unshared secret that remains unchanged for the
|
||
duration.
|
||
|
||
From that 512 bit value, the server generates a unique XChaCha20 512 bit
|
||
value, 256 bits of which are used to generate a Poly1305 authenticator for
|
||
the proof of work request. If it receives a completed proof of work request
|
||
containing the authentication, it knows it comes from an entity at that
|
||
network address that was able to receive the proof of work request.
|
||
Knowing it is talking to real network addresses, it can derank network
|
||
addresses that create excessive burdens, so that they cannot slow down
|
||
everyone else, only themselves.
|
||
|
||
When it receives the completed proof of work, it first checks the sequence
|
||
number to ensure it is a recently issued request for work, then checks if
|
||
there is already a channel allocated for that pow, using a table of doubly
|
||
linked lists of recently allocated channels.indexed by the low order part of
|
||
the pow sequence number If it discovers it has already passed that proof of
|
||
work and allocated a channel, moves that proof of work to the head of list,
|
||
so that the next check will be instant, just in case it is about to receive a
|
||
million copies of that proof of work. Then it checks for revealed bits from
|
||
those generated by XChaCha20. Then it checks the work and the
|
||
Poly1305 authentication.
|
||
|
||
Checking if there is already a channel allocated overlaps and intersects
|
||
with presence notification protocol. We want to have a very large number
|
||
of inactive presences without secrets or network addresses in the database,
|
||
a large number of long lived active presences in memory, with secrets that
|
||
are not paged to disk (`sodium_allocarray`), and considerably smaller
|
||
number of considerably shorter lived channels with flow control and
|
||
buffering. A presence can only exchange short messages that fit in one
|
||
packet, and only one message can be active in any round trip time. You
|
||
open a presence, and the presence can then open a channel.
|
||
|
||
We probably want to do the checks in whatever order is empirically most
|
||
efficient for type of DDoS attacks that we encounter in practice, the most
|
||
common probably being garbage random values that bear no particular
|
||
resemblance to valid connection attempts.
|
||
|
||
The next problem will valid connections that then make excessive
|
||
demands. These get deranked by the next layer, and they will then have to
|
||
make a new connection, which will face increasing pow and discrimination
|
||
against their network address.
|
||
|
||
## replacement handshake for limited circulation server
|
||
|
||
In this case the server is the gateway for a group, possibly many groups,
|
||
whose unique id is not widely known. It is analogous to a closely kept email address.
|
||
|
||
The the TCP replacement handshake needs to be a four phase handshake.
|
||
|
||
1. Client->Server: Give me a connection, here are my parameters,
|
||
here is a clue about what private group I want to connect to.
|
||
|
||
1. Server->Client: Here is a proof of work request, my parameters,
|
||
including a use once elliptic point, and a keyed hash of your and
|
||
my parameters. Ask again with proof of work, the same parameters,
|
||
and the keyed hash.
|
||
|
||
Server then throws away the request, allocating no memory.
|
||
|
||
1. Client->Server: OK, here I am again, with all that stuff you asked for.
|
||
|
||
At this point, client has given server a clue about which private
|
||
group it wants to connect to, and server has given client a clue
|
||
about which private group it expects membership of, and therefore
|
||
what public key the client should attempt to communicate with.
|
||
|
||
1. Server checks the keyed hash to ensure that this is a real client
|
||
reply to a real and recent server reply. Then it checks the proof of
|
||
work.
|
||
|
||
If the proof of work passes, Server allocates memory
|
||
|
||
Then it generates a transient secret from the konces (keys used
|
||
once, single use elliptic points), and uses it to decrypt the clien
|
||
durable public key, verifying that the client does indeed know the
|
||
transient scalar. If the client durable key is OK, sign on allowed, it
|
||
constructs a shared secret from all four keys, the sum of two secrets
|
||
multiplying the sum of two elliptic points, and we now have an
|
||
encrypted stream associated with the port number and network addresses.
|
||
|
||
# Summary of the replacement
|
||
|
||
Thus we can integrate TCP handshake and encryption hand shake and the
|
||
innumerable DDoS protection handshakes “Cloudfare is checking your browser,
|
||
oops, your browser did not pass, here is a captcha” at the cost of one single
|
||
additional trip, half a round trip.
|
||
|
||
Instead of the person establishing the connection fuming while round trip
|
||
after round trip goes through, we get all that stuff at the cost of one
|
||
additional half round trip.
|
||
|
||
# messages, not streams
|
||
|
||
TCP sockets are designed for synchronous procedural programming, on
|
||
machines with very limited memory processing limitless streams. They are
|
||
now almost always used for message processing from event oriented
|
||
asynchronous code, with a messaging layer on top of the endless stream
|
||
layer. The replacement needs to have application layer sending messages
|
||
and receiving messages in events. The application layer should not have
|
||
to deal with sockets and streams. Rather, it sends a message to destination
|
||
identified by its durable public key, and gets a reply, where the reply
|
||
might be that the socket could not be opened, or that the socket was open but
|
||
the reply timed out, among other things. When sending a message, there is a
|
||
time to wait for response before giving up, and a time for the socket that
|
||
may be created to live idle.
|
||
|
||
# Proposed replacement
|
||
|
||
[QUIC] is the current TCP replacement. Also known as HTTP/3
|
||
|
||
[QUIC]: https://github.com/private-octopus/picoquic
|
||
|
||
We have no alternative but to interface to the vast HTTP/2 HTTP/3
|
||
ecosystem. The wallet is going to have to talk as a client to legacy server
|
||
http/3 devices, and accept their CA certificates, preferably subject to
|
||
Zooko scrutiny, and legacy http/3 client devices are going to have to talk
|
||
to our wallet (after their wallet has downloaded a zooko based certificate
|
||
from the server wallet).
|
||
|
||
Talking HTTP/3 means being wide open to DDOS attack, so that you are
|
||
forced to use cloudfare. When a device with our version of QUIC talks to
|
||
another device with our version of QUIC, it has to implement our DDOS
|
||
resistance, and Zooko in place of CA. But when it talks to a legacy
|
||
HTTP/3 device, it has to lay itself wide open to DDOS attack and CA
|
||
interception.
|
||
|
||
Backwards compatibility with insecure systems always creates a massive
|
||
security hole. On the one hand, every build from scratch project dies. On
|
||
the gripping hand, every attempt to do fax over the internet failed and was
|
||
eventually replaced by pdf attachments to email. Backwards compatibility
|
||
was simply too crippling, and backwards compatibility with QUIC is
|
||
going to cripple security.
|
||
|
||
Instead of putting the secure system transparently as an alternate protocol
|
||
within the insecure system, you non transparently put the insecure system
|
||
as a downgrade protocol within the secure system, which means our
|
||
version of QUIC simply is not going to talk to older versions of QUIC
|
||
unless you take some special measures to tell it to do so or enable it to do
|
||
so for that particular communication end point.
|
||
|
||
The least friction interface would be that every time a new SSL name is
|
||
encountered, we get a window saying "This authority claims that this is
|
||
this entity. Trust this authority for this entity?" And if there is a change of
|
||
authority, complain. Wrap backwards compatibility in Zooko vouched
|
||
certificates, pinned certificates, and the CAA record indicating who is the
|
||
right issuer for the SSL certificate
|
||
|
||
We have to have downgrade capability, but it has to be an afterthought,
|
||
slipped in as a special path and special case, as user friendly as possible,
|
||
but no friendlier.
|
||
|
||
QUIC's one way streams are messages.
|
||
|
||
Its two way streams are backwards compatibility with TCP
|
||
|
||
It solves the long fat pipe problem with flexible window size.
|
||
|
||
It puts multiple objects and messages in one stream, so that one message
|
||
does not have to wait for lost packets in another message to be resolved.
|
||
|
||
TCP flow control is constructed around pushback - that the sender should
|
||
not send data faster than the receiver is able and willing to handle it.
|
||
Normally there is one thread, or pool of of threads, handling the data
|
||
received. To prevent DDoS, we should probably only have one unit of
|
||
pushback per pair of network addresses. If someone has a slow receiver
|
||
thread pool, and a fast receiver thread pool communicating with the same
|
||
machine, he needs to break the slow receiver communication into lots of
|
||
small requests and replies, hence one channel per pair of network
|
||
addresses.
|
||
|
||
Quic implements everything you need to have one channel per pair of
|
||
network addresses, multiplexing many request-replies into a single stream,
|
||
many channels in one channel, but does not in fact implement one channel
|
||
per pair of network addresses in the sense of one unit of packet flow
|
||
control and one unit of DDoS monitoring, per pair of network addresses.
|
||
|
||
Finer grained flow control should be implemented as request reply on
|
||
messages that may well be much larger than a packet, but much smaller than
|
||
memory
|
||
|
||
In the request reply model, if the requests and replies are reasonably short,
|
||
pushback does not matter, and becomes a representation of flow control. It
|
||
is seldom sane to download enormous blocks of data as a single message,
|
||
and we probably just should not do it - restrict replies to what can
|
||
reasonably fit into memory, so that a very large message that the receiver
|
||
is processing one chunk at a time has to get acks of its submessages,
|
||
separate from the flow control system.
|
||
|
||
What the LEMP stack does with request headers is dynamically allocate
|
||
8KiB buffers, stuff headers into a part or whole of at 8KiB buffer, and if a
|
||
header is bigger than 8KiB, arbitrarily truncates it, which suggests that this
|
||
is a tactic to minimize the overheads of dynamically allocating many
|
||
moderate sized buffers of variable size. Experimenting, I find that
|
||
dynamic allocation tends to be the major cost in many programs, but if
|
||
you do it LEMP style, dynamic allocation is unlikely to be a significant cost.
|
||
|
||
QUIC has a pile of feature bloat:
|
||
|
||
+ The push feature is married to html, and belongs in the webserver
|
||
and the browser, not in the protocol. Something sending a request
|
||
message should be aware it might have several messages in reply,
|
||
depending on the kind of the request, and simply have a message
|
||
handler that can deal with many messages.
|
||
|
||
+ We don’t really need the unique and sequential message id if finding and
|
||
interpreting the message id is part of how to response handler handles the
|
||
messages – best to hand that as far down into the endpoints as possible.
|
||
|
||
+ its data format, header and frames, is married to html, which is
|
||
always sending repetitious and redundant information, treating
|
||
related fragments of html as absolutely distinct.
|
||
it implements html specific compression, HPACK.
|
||
|
||
It suffers from the SSL/TLS problem of a thousand CA authorities, NSA
|
||
friendly encryption, and, being funded in large part by Cloudfare, has no
|
||
substantial defense against DDoS.
|
||
|
||
It fails to support rendezvous routing.
|
||
|
||
But, it has already struggled with and solved a thousand problems whose
|
||
solutions I have been confusedly struggling with. So the obvious solution
|
||
is to adopt Quic, rip out the domain name system, add DDoS resistance,
|
||
rip out NSA friendly encryption in favour of the standard and
|
||
recommended Libsodium packet encryption. (XChaCha20-Poly1305), for
|
||
immortality rip out the 62 bit compressed integers in favour of unlimited
|
||
precision windowed integers (With a negotiated limit on precision that
|
||
will in practice always be 64 bits for the next several centuries.)
|
||
|
||
XChaCha20 is not the fastest on a long stream, but it has key agility, can
|
||
encrypt arbitrary length values, including a single bit, and is as
|
||
fast as ChaCha20 without any limits on the nonce.
|
||
|
||
Quic’s messaging is excessively married to HTTP. We need a generic
|
||
messaging system where every message has an short number indicating
|
||
destination handler, and you can generate a handler, code continuation,
|
||
and get number assigned to it on the fly, so that you can send a message,
|
||
and the reply goes to your code continuation.
|
||
|
||
We need to lift as much of the [QUIC] design as possible, and also make things
|
||
act much like TCP, so that existing NATs will not notice anything has
|
||
changed. Thus packets will continue to be sent to and from a widely known
|
||
port that is usually below 1024 on the server, from a random port on the
|
||
client in the range 49152--65535. A connection will continue to require a
|
||
three phase handshake which creates a socket, albeit our sockets will be very
|
||
different.
|
||
|
||
With a rendezvous, both peers will use the same socket in the range
|
||
1024-49151
|
||
|
||
The rendezvous handshake will look like the TCP handshake Syn Syn-Ack Ack,
|
||
but they will both send syn packets, both send syn-ack packets, and both
|
||
send ack packets. Their syn packets will be timed so that, if the timing
|
||
is done right, both are sent just before the other peer’s packet is
|
||
expected to be received.
|
||
|
||
Our sockets will always have a shared secret associated, which proves
|
||
identity and enables encrypted communication, but which cannot be used to
|
||
prove identity to a third party. The initial handshake will exchange
|
||
transient secret keys, which will generate a transient durable secret,
|
||
which is used to encrypt the exchange of durable secret keys, which
|
||
establish a shared secret based on the both the durable and transient key,
|
||
establishing forward secrecy, and failing to establish identity to third
|
||
parties.
|
||
|
||
Since setting up a shared secret is costly, this creates the opportunity to
|
||
syn flood attacks, therefore the syn-ack will always be a syn cookie,
|
||
structured rather like existing syn cookies, a cryptographic hash of the syn
|
||
based on an unshared secret known only to the server, plus it will always
|
||
have a proof of work request, which may be zero, and it will have a list of
|
||
supported protocols if the protocol proposed in the initial syn cookie is
|
||
unacceptable. The proof of work will be that the hash of the client ack
|
||
must have a certain number of zeros, and the ack
|
||
must contain the cryptographic cookie, and the data that the server checks
|
||
the cookie against.
|
||
|
||
TCP was designed around the case of the client sending an endless stream of
|
||
characters, typed with one finger, to a program on the server. We are
|
||
going to design around message response, with responses not necessarily
|
||
returning in order.
|
||
|
||
The client sends a message from a durable public key to a to a durable
|
||
public key. The creation and destruction of such connections is not
|
||
tightly linked to messaging. If connection exists, it is used. If it does
|
||
not exist, it is created. It may be torn down after a while of being
|
||
unused, but the tear down is not tightly linked to message completion
|
||
|
||
In TCP a count is kept of bytes sent and bytes received, with an ack
|
||
counting as one byte.
|
||
|
||
We need a count for each packet, since packets can arrive out of order,
|
||
repeated, or missing. The count values will be sequential nonces for the
|
||
encryption, and will start at one. As the count can potentially grow
|
||
quite large, the count value will be windowed, but, unlike TCP, the
|
||
windowed count represents a potentially much larger absolute count known
|
||
by both ends.
|
||
|
||
Negotiating a window size is hard, since you do not really know in advance
|
||
what window size will be needed. The thirty two bit window is adequate for
|
||
all normal uses, but fails in special and important uses.
|
||
|
||
We will specify the window size in each packet, with the high order bit of
|
||
each byte in the nonce indicating whether there is another seven bits in
|
||
the nonce window, so that we can dynamically adjust the window size. We
|
||
dynamically adjust the window size to big enough to exclude ambiguity.
|
||
Which for the first 128 packets, and on a connection that is not very busy,
|
||
all packets, will be seven windowed count bits and one window size bit.
|
||
|
||
The window needs to be large enough to exclude the ambiguity of delayed
|
||
and duplicated packets wandering in late, so has to be several times
|
||
larger than the difference between the most recently acked value, and the
|
||
the value that will fill the reception window. Thirty two times larger
|
||
should be ample. At the start, there are no early packets capable of
|
||
wandering in late, so big enough to hold the full count always suffices.
|
||
|
||
If `a` represents a recent nonce, `n`
|
||
represents the nonce, `w` represents the windowed nonce. and
|
||
`M` represents the window mask, communicated in each packet in
|
||
unary, then:
|
||
|
||
`w = n&M`
|
||
|
||
`n = (w − a)&M + a`
|
||
|
||
We use a window large enough to give the same answer on both the most
|
||
recently acked nonce, and the most recently sent nonce.
|
||
|
||
The nonce will serve the dual purpose of enabling the decryption of each
|
||
packet, and flow control. Each packet has a sequential nonce, we make sure
|
||
all packets are acked. Nonces on packets coming from the client refer to a
|
||
different shared secret than nonces on packets coming from
|
||
|
||
## API
|
||
|
||
To send a message, you will construct a response handler if you are
|
||
expecting a response, and then call the api with a network address, a
|
||
public key of the recipient, an identifying secret key and public key of
|
||
the sender, a timeout for attempting to connect, and flags permitting for
|
||
direct connection, rendezvous connection, retransmit, and store and
|
||
forward. If a response is expected for the message, give the expected
|
||
lifetime for the response handler, a nonce for the response handler and a
|
||
class identifier for the nonce. (the nonce only has to be unique within
|
||
the class). You will probably use a different nonce population for
|
||
messages that have to be handled promptly, messages that have to be
|
||
handled within a session, and non volatile nonces that survive between
|
||
sessions. Nonce populations can be windowed per class identifier, with a
|
||
window large enough to accommodate the timeout, and a different class
|
||
identifier for volatile and non volatile nonces. The nonce is used once
|
||
within a window and within a class, but can be re-used in another class
|
||
and another window.
|
||
|
||
The application code is event oriented, like gui code. It is driven by a
|
||
message pump, with constructors creating event handlers, and the events
|
||
driving the event handler through the message pump, and event handler, on
|
||
being fired, creates new event handlers and fires old event handlers.
|
||
|
||
When the application needs to perform a task that spans many events, it does
|
||
not call `yield` or `await,` but instead the event handler for each event
|
||
constructs or enables the next event handler. If it needs to push information
|
||
onto a stack between events, has its own explicit stack for its own multi
|
||
event task, or creates a linked list of event handlers. Non volatile event
|
||
handlers must be trivial C+ classes, therefore cannot contain an `std::stack`,
|
||
|
||
State that would be on the stack in synchronous code is in the event
|
||
handler in asynchronous code. This potentially gets messy if you are
|
||
processing an endless stream of structured data whose structure is
|
||
orthogonal to message boundaries. Since we allow arbitrary length
|
||
messages, don’t do that.
|
||
|
||
Notification of message failure may occur any time within the lifetime of
|
||
the response handler, but will mostly happen within the timeout for
|
||
attempting to connect.
|
||
|
||
The usual flow of control will be create an event handler, assign a nonce
|
||
to it (fire it) and then it gets triggered when the event actually
|
||
happens, and is then usually destroyed. Events will usually create and
|
||
fire new events and trigger events that existed before they were created,
|
||
rather than changing their state.
|
||
|
||
Below the api, additional messages, using low numbered message response
|
||
classes, may be constructed for encryption and flow control. If an
|
||
encrypted connection exists, it will use that without constructing
|
||
additional messages. If it does not exist, will construct it.
|
||
|
||
Constructing a encrypted connection provides perfect forward secrecy
|
||
between one connection and the next by generate new random session keys
|
||
each time.
|
||
|
||
## Reliability and flow control
|
||
|
||
TCP achieves reliable transmission with acks and nacks.
|
||
|
||
The original design simply acked that all bytes (not exactly bytes, because
|
||
acks and nacks are counted) had been received up to a certain byte. If the
|
||
transmitter has transmitted stuff, and not received an ack for what it
|
||
transmitted it sends a nack, after a timeout. The receiver may resend acks.
|
||
|
||
This mechanism worked fine on short thin pipes, but if you have a million
|
||
packets in flight, and packet three hundred thousand gets lost, you then
|
||
then have to send seven hundred thousand to replace one packet. So the
|
||
duplicate ack possibility was tortured to create a half assed version of
|
||
selective acknowledgment. If the receiver receives packet 100, and 101,
|
||
but not packet 99, it sends duplicate acks for packet 98. If the receiver
|
||
receives three duplicate acks for packet 98, it retransmits packet 99. (two
|
||
duplicate acks could be just the normal randomness.)
|
||
|
||
[QUIC], however, has fix for this built in.
|
||
|
||
Obviously true selective acknowledgment is better. The receiver acks the
|
||
most recent received packet, and sends a list of missing packets prior to
|
||
this (acks a windowed value for the most recent packet, and the difference
|
||
between packet nonces for missing packets) The sender resends the missing
|
||
packets, except for the most recent missing packets. If they are still
|
||
missing, they will be caught on the next ack.
|
||
|
||
In each ack, the receiver tells the sender how much more data it can
|
||
receive before it sends the next ack. This prevents the receiver from
|
||
being flooded, but a more common problem is the pipe being flooded.
|
||
|
||
To handle pipe flooding, the sender has a timer. If it sends stuff, and
|
||
does not get an ack, it backs off, it sets the timer to a slower rate, and
|
||
retransmits with a nack. The initial value of the timer is the initial
|
||
timer value is smoothed $RTT + max(G,4*RTT variance)$
|
||
|
||
TCP flow control focuses on getting a segment complete and acknowledged,
|
||
so it can move on to the next segments. It may have a great many packets
|
||
in flight, but does not have too many segments in flight. The backoff
|
||
algorithm is linked with the push segments algorithm. You only push the
|
||
segment the receiver has asked for in his previous acknowledgment. So you
|
||
typically have the segment you are finalizing, the segment that is in
|
||
flight, and the segment that the receiver asked for.
|
||
|
||
The algorithm is that the sender gets an ack that acknowledges what the
|
||
receiver has received, and tells the sender how much more the receiver can
|
||
receive. Whereupon the sender resends anything missing, and resumes pushing
|
||
new stuff up to the limit that the receiver has specified, spread out
|
||
roughly evenly over the timer period. Which implies that the receiver
|
||
should ask wisely, as well as the sender send wisely.
|
||
|
||
Implementing our own flow control sounds like a lot of work. Need to lift
|
||
[QUIC]’s flow control, and drop our own encryption and attack resistance
|
||
into it, while letting it worry about flow control. I can hack into its library,
|
||
while I cannot hack into the TCP library.
|
||
|
||
I have been analysing how TCP works, with a view to what needs fixing. Time to
|
||
analyse how something works for which I have a library and example code.
|
||
|
||
Best (because smallest and least married to HTTP3) is [picoquic].
|
||
|
||
[picoquic]: https://github.com/private-octopus/picoquic
|
||
|
||
The TCP state machine assumes that the server opens a connection on receiving
|
||
a syn, sends an ack-syn to the client, whereupon the client acks the
|
||
connection. But if we are using syn cookies, we are using a different state
|
||
machine, where the connection is in fact only opened on receiving the server
|
||
syn-ack cookie in the client ack. So the server has to acknowledge the
|
||
connection, which would make it a four step handshake instead of a three step
|
||
handshake. To avoid this, we have a rule that the client only opens a
|
||
connection when it has data ready to send. It then gets a server cookie, and
|
||
sends the cookie-ack with some data, which data the server acks.
|
||
|
||
With the cookie ack, we get a round trip time and offset between server
|
||
steady time and client steady time. If we see unstable round trip times,
|
||
we suspect the pipe is overloaded, and back off our estimate of max
|
||
bandwidth. For flow control, we maintain an estimate of pipe length and
|
||
width. Sudden pipe widenings indicate an overflow condition, because pipes
|
||
may respond to overflow by massively discarding packets, or massively
|
||
backing up packets, or quite possibly both. We maintain a probability
|
||
estimate of the pipe behaviour.
|
||
|
||
## Outline protocol
|
||
|
||
A packet protocol that establishes an encrypted connection on top of
|
||
unreliable packets with minimal round trips without increasing fragility to
|
||
DoS.
|
||
|
||
For servers, public keys, globally human readable names, the key owning the
|
||
name, and the temporary key signed by the key owning the name, will usually
|
||
be public and widely known, but this also supports the case of
|
||
communication where this information is only known to the parties, and the
|
||
server does not want to make the connection between a network address and a
|
||
public key widely known.
|
||
|
||
To establish a connection, we need to set a bunch of values specific to
|
||
this particular channel, and also create a shared secret that
|
||
eavesdroppers and active attackers cannot discover.
|
||
|
||
The client is the part that initiates the communication, the server is
|
||
the party that responds.
|
||
|
||
I assume a mode that provides both authentication and encryption – if a
|
||
packet decrypts into a valid message, this shows it originated from an
|
||
entity possessing the shared secret. This does not provide signing – the
|
||
recipient cannot prove to a third party that he received it, rather than
|
||
making it up.
|
||
|
||
For the moment I ignore the hard question of server key distribution,
|
||
glibly invoking Zooko’s triangle without proposing an implementation of
|
||
the other two points and three sides of the triangle or a solution to the
|
||
problem of managing distributed reputations in Zooko’s triangle. (Be
|
||
warned that whenever people charge ahead without solving the key
|
||
distribution problem, the result is a disaster.)
|
||
|
||
Client 🠆 Server: Equivalent to the syn of the three phase TCP
|
||
handshake.
|
||
|
||
> Client’s network address and port on which client will receive
|
||
> packets, protocol identifier, and client steady time that the
|
||
> message was sent.
|
||
|
||
If the requested protocol is not OK, we go into protocol negotiation,
|
||
server responds with a list of protocols and protocol versions that it will
|
||
accept, in the form of a list of lists of numbers.
|
||
|
||
Assuming it is OK, which it probably will be, server allocates nothing,
|
||
prepares nothing, but sends the equivalent of a TCP ack-syn cookie,
|
||
containing, among other things, a cryptographic hash of the information
|
||
that was received and sent, based on a private secret known only to the
|
||
server. It sends a transient public key, which changes every few minutes
|
||
or so, plus a short windowed id for that transient public key, and a demand
|
||
for proof of work, which may be zero. The proof of work is that the
|
||
client’s ack, equivalent of the third phase of the TCP handshake, has to
|
||
hash to a value ending in `n` zero bits, where `n`
|
||
may be zero.
|
||
|
||
This cryptographic hash based on an unshared secret will be sent to client,
|
||
and then back to server, unchanged. Its function is to avoid the necessity for
|
||
the server to allocate memory or perform asymmetric cryptographic operations
|
||
for a client that has not yet validated. Instead the state information is sent
|
||
back and forth.
|
||
|
||
1. Server 🠆 Client: Equivalent to the syn-ack of the three phase TCP handshake.
|
||
|
||
Cryptographic hash based on unshared secret, server steady time,
|
||
transient public key, server windowed identifier of server transient
|
||
public key, proof of work demand, and any channel parameters.
|
||
|
||
The proof of work is trivial if the server is not under load, but is
|
||
increased as the server load approaches the maximum the server is
|
||
capable of, in order to throttle demand.
|
||
|
||
Client computes transient handshake shared secret as its transient private
|
||
key times the server shared transient public key. It returns in the clear
|
||
a copy of the cryptographic hash that the server sent to it, the data in
|
||
the clear needed to validate the hash, performs the proof of work, and
|
||
sends its public key, which may be a per server durable public key, always
|
||
used when accessing this server on this identity, encrypted using the
|
||
transient key, and the public key it wants to talk to on the server.
|
||
|
||
Subsequent information is not encrypted using the transient keys, but using
|
||
the sum of transient plus secret keys.
|
||
|
||
This implies that the client has to know the public key that the server is
|
||
using, which may be a key signed by the master public key that owns the
|
||
name authorizing that new key, which key changes about as often as the
|
||
server IP changes, and is therefore distributed in the same channel as the
|
||
network address associated with global human names is distributed. If the
|
||
client gets it wrong, then the server ignores the information encrypted to
|
||
the wrong public key, and responds with the authentication of its new
|
||
public key, signed by the master public key of its globally unique name,
|
||
encrypted using the transient secret – this is usually public information,
|
||
but since by this point we have established a shared secret and allocated
|
||
memory, might as well send it securely, for sometimes it is going to be
|
||
private information.
|
||
|
||
1. Client 🠆 Server: Equivalent to the final ack of the three phase TCP
|
||
handshake.
|
||
|
||
Sends in the clear server hash as received, any data needed to
|
||
reconstruct the hash, and transient secret key. Then, encrypted to
|
||
transient keys, the hash of the identifier of the public key it wants to
|
||
talk to, its durable public key, and client steady time at which this was
|
||
sent, so that both sides have an estimate of the round trip time and the
|
||
offset between server steady time and client steady time.
|
||
|
||
Server checks the proof of work, checks the cryptographic hash against the
|
||
data in the clear, *then* creates an entry in its hash table for this
|
||
connection, with the shared secret being the transient keys plus the public
|
||
keys.
|
||
|
||
We have two protocols, one for the authenticated phase, and one for
|
||
unauthenticated phase. The client has to know one of the unauthenticated
|
||
protocols offered by the server, or else protocol negotiation will fail in
|
||
the abnormal case that protocol negotiation is needed. Normally there will
|
||
only be one protocol for secured but unauthenticated communication during
|
||
setup, but we make provision by having two protocols, trivially different,
|
||
and three protocols, trivially different for the authenticated phase.
|
||
|
||
You will notice that the server only allocates memory and and asymmetric
|
||
encryption computation *after* the client has successfully performed proof of
|
||
work and shown that it is indeed capable of receiving data sent to the
|
||
advertised network address.
|
||
|
||
In the normal case, the client requests one way authenticated encryption in
|
||
the syn, where the server authenticates but the server does not, and the
|
||
server may, and usually will, offer in the syn-ack only two way
|
||
authenticated encryption, where the client provides an identity unique to
|
||
that server and user’s current default name, but which cannot be used to
|
||
identify the default name, nor the same user accessing a different
|
||
website. This allows the server to see that the same user is accessing
|
||
different resources, how many uniques the server has, and what each unique
|
||
is doing, but does not enable the server’s to put their heads together and
|
||
see that the same user is doing things on one server, and also on another
|
||
server.
|
||
|
||
Now we have a shared secret, protocol negotiated, client logged in, in
|
||
one round trip plus the third one way trip carrying the actual data – the
|
||
same number of round trips as when setting up an unencrypted
|
||
unauthenticated TCP connection.
|
||
|
||
You will notice there is no explicit step checking that both have the
|
||
same shared secret – This is because we assume that each packet sent is
|
||
also authenticated by the shared secret, so if they do not have the same
|
||
secret, nothing will authenticate.
|
||
|
||
# Critiques of TCP/SSL
|
||
|
||
Does the job so badly that using a different method is just as plausible.
|
||
People fight to avoid TLS already, they’d rather send stuff in the clear if
|
||
they could. So just solve the problems they have.
|
||
|
||
In Web Services we frequently require message layer security in addition to
|
||
transport layer security because a Web Service transaction might involve more
|
||
than two endpoints and messages that are stored and forwarded etc. This is why
|
||
WS-\* is not TLS. (It is unfortunately horribly baroque but that was not my
|
||
doing).
|
||
|
||
Problem that occurred with TLS was that there was an assumption that the job\
|
||
was to secure the reliable stream connection mechanics of TCP. False
|
||
assumption.
|
||
|
||
Pretty much nobody uses streams by design, they use datagrams. And they use
|
||
them in a particular fashion: request-response. Where we went wrong with TCP
|
||
was that this was the easiest way to handle the mechanics of getting the
|
||
response back to the agent that sent the request. Without TCP, one had to deal
|
||
with the raw incoming datagrams and allocate them to the different sending
|
||
agents.
|
||
|
||
A second problem was that the design was too intertwined with commercial PKI
|
||
so certs were hung on the side as a millstone for server authentication and
|
||
discarded as client side, leaving passwords to fill that gap. A mess, which
|
||
is an opportunity for redesign, frequently exploited by many designs already.
|
||
|
||
SSL came at this and built a message (record) interface on top of TCP (because
|
||
that was convenient for defining a crypto layer), and then a (mainly) stream
|
||
interface on top of its message interface – because programmers were by now
|
||
familiar with streams, not records.
|
||
|
||
And so … here we are. Living in a city built on top of generations of
|
||
older cities. Dig down and see the accreted layers.
|
||
|
||
What *is* the “right” (easiest to use correctly, hardest to use
|
||
incorrectly, with good performance, across a large number of distinct
|
||
application APIs) underlying interface for a secure network link? The fact
|
||
that the first thing pretty much all APIs do is create a message structure
|
||
on top of TCP makes it clear that “pure stream” isn’t it. Record-oriented
|
||
designs derived from 80-column punch cards are unlikely to be the answer
|
||
either. What a “clean slate” interface would look like is an interesting
|
||
question, and perhaps it’s finally time to explore it.
|
||
|
||
# General and unorganized comments
|
||
|
||
µTP, Micro Transport Protocol is a Bittorrent near drop in replacement for TCP
|
||
that provides lower priority bulk downloads in the background. The library is
|
||
not well documented, (header file plus examples) but as far as I can see,
|
||
provides a reasonably clean separation between Bittorrent and the transport
|
||
mechanism.
|
||
|
||
Google has a TCP/SSL replacement, [QUIC], which avoids round tripping and
|
||
renegotiation by integrating the security layer with the reliability layer,
|
||
and by supporting multiple asynchronous streams within a stream
|
||
|
||
Layering a new peer-to-peer packet network over the Internet is simply
|
||
what the Internet is designed for. UDP is broken in a few ways, but not
|
||
that can’t be fixed. It’s simply a matter of time before a new virtual
|
||
packet layer is deployed – probably one in which authentication and
|
||
encryption are inherent.
|
||
|
||
For authentication and encryption to be inherent, needs to connect
|
||
between public keys, needs to be based on Zooko’s triangle. Also
|
||
needs to penetrate firewalls, and do protocol negotiation with an
|
||
unlimited number of possible protocols – avoiding that internet names and
|
||
numbers authority.
|
||
|
||
Ian Grigg “Good protocols divide into two parts, the first of which says
|
||
to the second, trust this key completely!”.
|
||
|
||
This might well be the basis of a better problem factorization than the
|
||
layer factorization – divide the task by the way trust is embodied, rather
|
||
than the basis of layered communication.
|
||
|
||
Trust is an application level issue, not a communication layer issue,
|
||
but neither do we want each application to roll its own trust cryptography
|
||
– which at present web servers are forced to do. (Insert my standard rant
|
||
against SSL/TLS).
|
||
|
||
Most web servers are vulnerable to attacks akin to session cookie
|
||
fixation attack, because each web page reinvents session cookie handling,
|
||
and even experts in cryptography are apt to get it wrong.
|
||
|
||
The correct procedure is to generate and issue a strongly unguessable
|
||
random https only cookie on successful login, representing the fact that
|
||
the possessor of this cookie has proven his association with a particular
|
||
database record, but very few people, including very few experts in
|
||
cryptography, actually do it this way. Association between a client
|
||
request and a database record needs to be part of the security system. It
|
||
should not something each web page developer is expected to build on top
|
||
of the security system.
|
||
|
||
TCP constructs a reliable pipeline stream connection out of unreliable
|
||
packet connections.
|
||
|
||
There are a bunch of problems with TCP. No provision was made for
|
||
protocol negotiation and so any upgrade has to be fully backwards
|
||
compatible. A number of fixes have been made, for example the long
|
||
fat pipe problem has been fixed by window size negotiation, which is semi
|
||
incompatible and leads to flaky behaviour with old style routers, but the
|
||
transaction problem remains intolerable. The transaction problem has
|
||
been reduced by protocol level workarounds, such as “Keep alive” for HTTP,
|
||
but these are not entirely satisfactory. The fix for syn flooding
|
||
works, but causes some minor unnecessary degradation of performance under
|
||
syn flood attacks, because the syn cookie is limited to 48 bits – needs to
|
||
be 128 bits both to deal with the syn flood attack, and to prevent TCP
|
||
hijacking.
|
||
|
||
TCP is inefficient over wireless, because interference problems are
|
||
rather different to those provided for in the TCP model. This
|
||
problem is pretty much insoluble because of the lack of protocol
|
||
negotiation.
|
||
|
||
There are cases intermediate between TCP and UDP, which require
|
||
different balances of timeliness, reliability, streaming, and record
|
||
boundary distinction. DCCP and SCTP have been introduced to deal with
|
||
these intermediate cases, SCTP for when one has many independent
|
||
transactions running over a single connection, and DCCP for data where
|
||
time sensitivity matters more than reliability such as voice over
|
||
IP. SCTP would have been better for HTML and HTTP than TCP is,
|
||
though it is a bit difficult to change now. Problems such as
|
||
password-authenticated key agreement transaction to a banking site require
|
||
something that resembles encrypted SCTP, analogous to the way that TLS is
|
||
encrypted TCP, but nothing like that exists as yet. Standards exist for
|
||
encrypted DCCP, though I think the standards are unsatisfactory and
|
||
suspect that each vendor will implement his own incompatible version, each
|
||
of which will claim to conform to the standard.
|
||
|
||
But a new threat has arrived: TCP man in the middle forgery.
|
||
|
||
Connection providers, such as Comcast, frequently sell more bandwidth
|
||
than they can deliver. To curtail customer demands, they forge
|
||
connection shutdown packets (reset packets), to make it appear that the
|
||
nodes are misbehaving, when in fact it is the connection between nodes,
|
||
the connection that Comcast provides, that is misbehaving. Similarly, the
|
||
great firewall of China forges reset packets when Chinese connect to web
|
||
sites that contain information that the Chinese government does not
|
||
approve of. Not only does the Chinese government censor, but it is able to
|
||
use a mechanism that conceals the fact of censorship.
|
||
|
||
The solution to all these problems is to have protocol negotiation,
|
||
standard encryption, and flow control inside the encryption.
|
||
|
||
A problem with the OSI Layer model is that as one piles one layer on top
|
||
of another, one is apt to get redundant round trips.
|
||
|
||
According to [google research] 400
|
||
milliseconds reduces usage by 0.76%, or roughly two percent per second of delay.
|
||
|
||
[google research]: http://googleresearch.blogspot.com/2009/06/speed-matters.html
|
||
|
||
Redundant round trips become an ever more serious problem as bandwidths
|
||
and processor speeds increase, but round trip times reminds constant,
|
||
indeed increase as we become increasingly global and increasingly rely on
|
||
space based communications.
|
||
|
||
Used to be that the biggest problem with encryption was the asymmetric
|
||
encryption calculations – the PKI model has lots and lots of redundant and
|
||
excessive asymmetric encryptions. It also has lots and lots of redundant
|
||
round trips. Now that we can use the NVIDIA GPU with CUDA as a very high
|
||
speed cheap massively parallel cryptographic coprocessor, excessive PKI
|
||
calculations should become less of a problem, but excess round trips are
|
||
an ever increasing problem.
|
||
|
||
Any significant authentication and encryption overhead will result in
|
||
people being too clever by half, and only using encryption and
|
||
authentication where it is needed, with the result that they invariably
|
||
screw up and fail to use it where it is needed – for example the login on
|
||
the http page. So we have to lower the cost of encrypted authenticated
|
||
communications, so that people can simply encrypt and authenticate
|
||
everything without needing to think about it.
|
||
|
||
To get stuff right, we have to ditch the OSI layer model, but simply
|
||
ditching it without replacement will result in problems. It exists for a
|
||
reason, and we have to replace it with something else.
|