1
0
forked from cheng/wallet
wallet/docs/design/TCP.md
reaction.la a247a1d30c
No end of changes, lost track.
Switched to Deva V for greater consistency between mono spaced and serif
2024-02-06 15:32:06 +10:00

1338 lines
68 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: Replacing TCP, SSL, DNS, CAs, and TLS
sidebar: true
...
# related
[Client Server Data Representation](client_server.html){target="_blank"}
# Existing work
[µTP]:https://github.com/bittorrent/libutp
"libutp - The uTorrent Transport Protocol library"
{target="_blank"}
[µTP], Micro Transport Protocol has already been written, and it is just a
matter of copying it and embedding it where possible, and forking it if
unavoidable. DDOS resistance looks like it is going to need forking.
It implements ledbat, a protocol designed for applications that download
bulk data in the background, pushing the network close to its limits, while
still playing nice with TCP.
Implementing consensus over [µTP] is going to need [QUIC] style streams,
that can slow down or fail without the whole connection slowing down or
failing, though it might be easier to implement consensus that just calls
µTP for some tasks.
I have not investigated what implementing short fixed length streams over
[µTP] would involve. Bittorrent already necessarily does something mighty
like that. Maybe it just sequentializes everything. Which kind of makes
sense, a single concurrent process managing each connection is easier to
program and comprehend, even if it cannot give optimal performance.
Obviously it must have a request response layer, documented only in
source code. The question then is how it maps that layer onto a µTP
connection. You are going to have to copy, not just µTP, but that layer,
which should be part of µTP, but probably is not. You will have to
factorize that they probably not cleanly factorized.
Their request response layer is probably somewhat documented in
[BEP0055] I suspect that what I need is not just µTP, but the largest common factors of [BEP0055]
[BEP0055]:https://www.bittorrent.org/beps/bep_0055.html
"BEP0055"
{target="_blank"}
[`ut_holepunch` extension message]:http://bittorrent.org/beps/bep_0010.html
"BEP0010"
{target="_blank"}
[libtorrent source code]:https://github.com/arvidn/libtorrent/blob/c1ade2b75f8f7771509a19d427954c8c851c4931/src/bt_peer_connection.cpp#L1421
"bt_peer_connection.cpp"
{target="_blank"}
µTP does not itself implement hole punching, but interoperates smoothly
with libtorrents's [BEP0055]'s [`ut_holepunch` extension message], which is
only documented in [libtorrent source code].
A tokio-rust based µTP system is under development, but very far from
complete last time I looked. Rewriting µTP in rust seems pointless. Just
call it from a single tokio thread that gives effect to a hundred thousand
concurrent processes. There are several projects afoot to rewrite µTP in
rust, all of them stalled in a grossly broken and incomplete state.
[QUIC has grander design objectives]:https://docs.google.com/document/d/1RNHkx_VvKWyWg6Lr8SZ-saqsQx7rFV-ev2jRFUoVD34/edit
{target="_blank"}
[QUIC has grander design objectives],and is a well thought out, well
designed, and well tested implementation of no end of very good and
much needed ideas and technologies, but relies heavily on enemy
controlled cryptography.
Albeit there are some things I want to do, consensus between a small
number of peers, by invitation and each peer directly connected to each of
the others, the small set of peers being part of the consensus known to all
peers, and all peers always online and responding appropriately, or els
they get kicked out. (Practical Byzantine Fault *In*tolerant consensus)
which it really cannot do, though it might be efficient to use a different
algorithm to construct consensus, and then use µTP to download the bulk data.
# Existing documentation
There is a great pile of RFCs on issues that arise with using udp and icmp
to communicate, which contain much useful information.
[RFC5405](https://datatracker.ietf.org/doc/html/rfc5405#section-3), [RFC6773](https://datatracker.ietf.org/doc/html/rfc6773), [datagram congestion control](https://datatracker.ietf.org/doc/html/rfc5596), [RFC5595](https://datatracker.ietf.org/doc/html/rfc5595), [UDP Usage Guideline](https://datatracker.ietf.org/doc/html/rfc8085)
There is a formalized congestion control system `ECN` explicit congestion
control. Most severs ignore ECN. On a small proportion of routes, 1%,
ECN tagged packets are dropped
Raw sockets provide greater control than UDP sockets, and allow you to
do ICMP like things through ICMP.
I also have a discussion on NAT hole punching, [peering through nat](nat.html), that
summarizes various people's experience.
To get an initial estimate of the path MTU, connect a datagram socket to
the destination address using connect(2) and retrieve the MTU by calling
getsockopt(2) with the IP_MTU option. But this can only give you an
upper bound. To find the actual MTU, have to have a don't fragment field
(which is these days generally set by default on UDP) and empirically
track the largest packet that makes it on this connection. Which TCP does.
MTU (packet size) and MSS (data size, $MTU-40$) is a
[messy problem](https://www.cisco.com/c/en/us/support/docs/ip/generic-routing-encapsulation-gre/25885-pmtud-ipfrag.html)
Which can be side stepped by always sending packets
of size 576 contiaing 536 bytes of data.
## first baby steps
To try and puzzle this out, I need to build a client server that can listen on
an arbitrary port, and tell me about the messages it receives, and can send
messages to an arbitrary hostname:port or network address:port, and
which, when it receives a packet that is formatted for it, will display the
information in that packet, and obey the command in that packet, which
will typically be a command to send a reply that depicts what is in the
packet it received, which probably got transformed by passing through
multiple nats, and/or a command to display what is in the packet, which is
typically a depiction of how the packet to which this packet is a reply got
transformed
This test program sounds an awful lot like ICMP, which is best accessed
through raw sockets. Might be a good idea to give it the capability to send
ICMP, UDP, and fake TCP.
Raw sockets provide the lowest level access to the network available from
userspace. An immense pile of obscure and complicated stuff is in kernel.
# What the API should look like
It should be a consensus API for consensus among a small number of
peers, rather than message API, message response being the special case
of consensus between two peers, and broad consensus being constructed\
out of a large number of small invitation based consensi.
A peer explicitly joins the small group when its request is acked by a
majority, and rejected by no one.
On the other hand this involves re-inventing networking from scratch, as
compared to simply copying http/2, or some other reliable UDP system.
Total rewrites, however desirable and necessary, always fail
So on reflection this is a blue sky proposal - likely to involve immense delay:
I need to think about the way things should be done - but I don't want to
get lost in the weeds. I have repeatedly wasted a great deal of time
re-inventing stuff from scratch, only to find that when I was finished, I had
something vastly inferior to what already existed, so I wound up tossing
my work, and using someone else's library with minimum adaptation.
Many a time I see something is encrusted with ancient history, backward
compatibility means they cannot fix old mistakes, I design something new
and fresh, and vastly superior, and discover that there were one hundred
and one issues that old history encrusted thing had encountered and dealt
with, and I had not foreseen, that not all of that mighty pile of code is crap
to work around past mistakes which must continue to be supported, but a
lot of it is issues I had not foreseen having to deal with, and had not
planned a path to dealing with them.
When implementing stuff from scratch, all too often one discovers there
are no end of reasons for all the stuff one thought bad and unnecessary in
existing libraries.
But on with the vision. Though it will likely be vastly faster to just fix
someone else's library to have real security.
Although the api represents messages, rather than connections, it will
implicitly have a very large number of connections, in that a connection is
your current state with a counterparty, expected protocols (message types) and all that.
For an app to poll a very large number of connections over the network,
`select` does not cut the mustard. Network apis have been evolving, each in
its own idiosyncratic way, to the app making O(1) additions and deletions to
list of counterparties on the network whose messages it is listening to,
and getting notifications that are O(number of events) rather than
O(number of counterparties).
The way this should be done is a linked list of data structures containing
events, which the app can poll locklessly, or wait on (with a timer event
guaranteed to appear in the list eventually if it is waiting on it). If the app
fails to free anything from the list after an unreasonably long time,
suggesting that the app has shut down ungracefully or crashed, and there
are rather too many things on the list, the process that is putting things on
the list will start by pushing back on the parties sending messages to the
app, and end by shutting down their connections and discarding their data.
The network events live entirely in memory and are volatile. If they
represent long lived relationships, it is up to the app to commit the
information that they represent to disk.
Every message has a public key of sender, a public key of recipient, an
potentially an in-regards-to hash, a reply-to hash, and an in-reply-to hash.
Some or all of these hashes may be null. It seldom makes sense for all of
them to be null, and it seldom makes sense for all of them to be non null.
Usually reply-to is null, and it does not always make sense for it to be non
null.
The reply-to field opens up a very large can of worms, in that its main use
is to reference a third party message that came from a third party server,
with its own type information and sender public key, and the how does the
sender know the recipient has or can obtain that message?
Every hash and every public key represents a potential endpoint, and thus
represents an additive type, or rather gives the system potential clues on
how to discover a mutually known additive type. (Reflect on the slow and
chaotic semi automated complexity of how the many protocols involved in
sending and receiving an email message are discovered, every time, for
every email message.)
Some of the time, the message type is only known from one of these
hashes they imply the type information, without which the recipient
would not know how to parse the message, and the recipient has to be able
to recognize them before he can recognize anything else. And some of the
time, figuring out the message type from these hashes is non trivial or just
flat out fails. No general automatic one size fits all procedure can work on
every mysterious second party hash. This is a problem that has to be dealt
with ad hoc use case by use case, protocol by protocol, message type by
message type.
Not all messages can be sent reliably, but the sender gets a notification
event failed, succeeded, replied to, or unlikely to be known, and the
sender can immediately find out either the likely timing of such
notification, or that the likely timing of such notification is unknown and
usually that the likely timing of such notification is unknown generates an
exception.
The api is potentially multilayered the message may well get translated
to a multitude of similarly structured messages, that set up the connection,
find out information about the recipient, all that stuff, and when those
messages go on the wire, they do not necessarily have any of this stuff
commonly they just have the network, the port address, and some numbers
that uniquely identify the context, which numbers are unique to the
connection, but unlike the hashes from which they are derived, not
globally unique, are sequential identifiers, not hashes. But at the top level,
the network address, the port, and all that stuff is just not represented,
except implicitly in that the public key of the recipient may well get
looked up in a hash table that may well have the network address and the port.
On the wire, network address and port serves the function of in-regards-to,
and will wrap stuff that provides a finer grained function of in-regards-to
and in-reply-to -- as I said, multilayered, with the hashes being internally
mapped to to data that serves equivalent functionality. Network address
and port being the outermost layer on the wire.
On the wire, once a connection is established, the sender and recipient
public keys are implicit in the ip header, and rest is opaque payload,
maximum payload being 1kiB. Inside the payload, the representation
depends on the message type, which was established when the connection
was established the in-reply-to of the contained message is the unique
sequential nonce of the message being replied to, rather than the hash of
that message.
In the api, the application and api know the message type, because
otherwise the api just would not work. But on the rare occasions when the
message is represented globally, outside the api, *then* it needs a message type header.
# TCP is broken
TCP was designed in more trusting times, when the name system
consisted of a widely shared hosts file, and everyone trusted everyone.
Over the years people have piled warts on top of TCP and warts on top of
warts to fix one problem after another, and every fix results in additional round trips
Thus “Cloudfare is checking your browser, you will be redirected shortly”
Every additional round trip before a web page comes up results in a
significant loss of viewers. Hence http2. Which fails to fix the DDOS and
cloudfare problem.
TCP is a major problem, which is slowing down the internet. DDoS
protection and the certificate mess are warts growing on top of warts.
Any business that resists corporate cancer is going to come under DDoS,
and if it employs a DDoS resistance service, that service is likely to place
pressure on the business to do political stuff that is counterproductive to
pursuing a profit. And even if it does not, the DDoS service slows down
people trying to view the business website.
If the TCP replacement fixes those warts, you get more views.
# Domain name system and SSL is broken
Any organization that has a certificate authority in its pocket can perform
a man in the middle attack on an SSL connection, though the CAA domain
name record somewhat mitigates this problem.
We need to also need to replace the TCP/SSL/CA/DNS system because
there is money in it. A great deal of money.
The trouble with an ICO (initial coin offering), is that the issuer has no
obligation to do anything other than take the money and run. We are
moving to an economy where much of the value is “goodwill”, “goodwill”
being names with reputations and relationships. The blockchain (or
blockdag, since blockdags theoretically have better scaling than
blockchains) could be used to render this value liquid in IPOs by having
both names and money on the blockchain.
Atomic transactions between blockchains, plus names on the blockchain
with money, a replacement for TCP/SSL/CAs/DNS could support sovereign
corporations on the blockchain, so that an ICO could be an IPO (Initial
Public Offering). If the blockchain is a name service as well as a money
service, it could give the investors ownership of the name. The owners of
examplecorp shares get to designate the board public key, and the board gets to
designate the public key of CEO@examplecorp from time to time, thus
rendering the value of a name potentially liquid.
Cryptocurrency exchanges are run by crooks, and are full of crooks each
trying to scam all the other crooks.
If you dont know who the pigeon is, you are the pigeon.
A healthy cryptocurrency market needs to leave the cryptocurrency
exchanges behind, replacing them with atomic blockchain transactions
between separate blockchains. They are dangerously centralized, and
linked to a corruptly regulated finance and accounting system, which
corruption we saw with Great Minority Mortgage Meltdown and the
Mortgage backed Security market from 2005 November to 2007, and saw
with MF Global. Jon Corzine did worse than embezzle client funds. He
embezzled client funds legally.
Demand for crypto currencies is driven in substantial part by the fact that
recent regulations have cheerfully set aside laws on fiduciary duty that are
millennia old. The exchanges cheerfully adhere to such regulations as they
find dangerously convenient, while taking advantage of cryptocurrency to
avoid those regulations that they find inconvenient.
The banks, the stock exchanges, and the big accounting firms are regulated
agencies whose regulators are in their pocket. The crypto currency exchanges
are semi regulated, taking advantage of regulations written for those who
have regulators in their pocket.
The cryptocurrency market needs to get rid of exchanges, starting with
cryptocurrency exchanges, and proceeding to get rid of stock exchanges.
An exchange exists to provide an escrow that faithfully observes
its fiduciary duty. And there have been a great many recent examples of such
entities getting up to no good, and in the case of the mortgage backed
security market, up to no good with enormous amounts of money.
A cryptocurrency with a name system could eat their lunch, greatly enriching
its founders in the process.
# Networking itself is broken
But that is too hard a problem to fix.
I had to sweat hard setting up Wireguard, because it pretends to be just
another `network adaptor` so that it can sweep away a pile of issues as out
of scope, and reading up posts and comments referencing these issues, I
suspect that almost no one understands these issues, or at least no one who
understands these issues is posting about them. They have a magic
incomprehensible incantation which works for them in their configuration,
and do not understand why it does not work for someone else in a subtly
different configuration.
## Internet protocol too many layer of abstraction
I have to talk internet protocol to reach other systems over the internet, but
internet protocol is a messy pile of ad hoc bits of software built on top of
ad hoc bits of software, and the reason it is hard to understand the nuts and
bolts when you actually try to do anything useful is that you do not
understand, and indeed almost no one understands, what is actually going
on at the level of network adaptors and internet switches. When you send a
udp packet, you are already at a high level of abstraction, and the
complexity that these abstractions are intended to hide leaks.
And because you do not understand the intentionally hidden complexity
that is leaking, it bites you.
### Adaptors and switches
A private network consists of a bunch of `network adaptors` all connected to
one `ethernet switch` and its configuration consists of configuring
the software on each particular computer with each particular `network adaptor`
to be consistent with the configuration of each of the others connected to
the same `ethernet switch`, unless you have a `DHCP server` attached to the
network, in which case each of the machines gets a random, and all too
often changing, configuration from that `DHCP server`, but at least it is
guaranteed to be consistent with the configuration of each of the other
`network adaptors` attached to that one `ethernet switch`. Why do DHCP
configurations not live forever, why do they not acknowledge the machine
human readable name, why does the ethernet switch not have a human
readable name, and why does the DHCP server have a network address
related to that of the ethernet switch, but not a human readable name
related to that of the ethernet switch?
What happens when you have several different network adaptors in one computer?
Obviously an IP address range has to be associated with each network
adaptor, so that the computer can dispatch packets to the correct adaptor.
And when the network adaptor receives a packet, the computer has to
figure out what to do with it. And what it does with it is the result of a pile
of undocumented software executing a pile of undocumented scripts.
If you manually configure each particular machine connected to an
ethernet switch, the configuration consists of arcane magic formulae
interpreted by undocumented software that differs between one system and the next.
As rapidly becomes apparent when you have to deal with more than one
adaptor, connected to more than one switch.
Each physical or virtual network adaptor is driven by a device driver,
which is different for each physical device and operating system. From the
point of view of the software, the device driver api *is* the network adaptor
programmer interface, and it does not care about which device driver it is,
so all network adaptors must have the same programmer interface. And
what is that interface?
Networking is a wart built on top of warts built on top of warts. IP6 was
intended to clean up this mess, but kind of collapsed under rule by
committee, developing a multitude of arcane, overly complicated, and overly
clever cancers of its own, different from, and in part incompatible
with, the vast pile of cruft that has grown on top of IP4.
The committee wanted to throw away the low order sixty four bits of
address space to use to post information for the NSA to mop up, and then
other people said to themselves, "this seems like a useless way to abuse
the low order sixty four bits, so let us abuse it for something else. After all,
no one is using it, nor can they use it because it is being abused". But
everyone whose internet facing host has been assigned a single address,
which means has actually been assigned $2^{64}$ addresses because he has
sixty four bits of useless address space, needs to use it, since he probably
wants to connect a private in house network through his single internet
facing host, and would like to be free to give some of his in house hosts
globally routable addresses.
In which case he has a private network address space, which is a random
subnet of fd::/8, and a 64 bit subnet of the global address space, and what
he wants is that he can assign an in house computer a globally routable
address, whereupon anything it sends that has a destination that is not on
his private network address space, nor his subnet of the globally routable
address space, gets sent to the internet facing network interface.
Further, he would like every computer on his network to be automatically
assigned a globally routable address if it uses a name in the global system,
or a private fd:: address if it is using a name not in the global system, so
that the first time his computer tries to access the network with the domain
name he just assigned, it gets a unique network address which will never
change, and a reverse dns that can only be accessed through an address on
his private network. And if he assigns it a globally accessible name, he
would like the global dns servers and reverse dns servers to automatically
learn that address.
This is, at present, doable by the DDI, which updates both your DHC
server and your DNS server. Except that hardly anyone has an in house
DNS server that serves up his globally routable addresses. The I in DDI
stands for IP Address Manager or IPAM. In practice, everyone relies on
named entities having extremely durable network addresses which are a
pain and a disaster to dynamically update, or they use dynamic DNS, not IPAM.
What would be vastly more useful and usable is that your internet facing
peer routed globally routable packets to and from your private network,
and machines booting up on your private network automatically received
addresses static addresses corresponding their name.
Globally routable subnets can change, because of physical changes in the
global network, but this happens so rarely that a painful changeover is
acceptable. The IP6 fix for automatically accommodating this issue is a
cumbersome disaster, and everyone winds up embedding their globally
routable IP6 subnet address in a multitude of mystery magic incantations,
which, in the event of a change, have to be painstakingly hunted down and
changed one by one, so the IP6 automatic configuration system is just a
great big wart in a dinosaur's asshole. It throws away half the address
space, and seldom accomplishes anything useful.
# Distributed Denial of Service attack
At present, resistance to Distributed Denial of Service attacks rests on
dangerously powerful central authorities, in particular Cloudfare, whose
service in addition to being dangerously centralized, is expensive and poor.
The TCP replacement needs an adjustable proof of work (pow) handshake
as the first part of the connection handshake, the proof of work request
being first server packet in the four packet handshake.
First packet, client requests connection, second packet, server requests
work,and supplies a durable and a short lived public key, third packet,
client supplies work and offers transient public key, making
communication possible, plus the message it is trying to send the server, or
the first part of that message.
The work demanded goes up as the server load increases, thus fixing the
horrors of DDoS protection.
## Key agreement
Key agreement needs to be part of the the TCP replacement handshake, rather
than a layer on top, to reduce round tripping.
The name system needs to be integrated with the key system, so that you get
the key when when you get the network address associated with the name, and
the key/name pairing needs to be blockchain secured, so you dont have one
thousand certificate authorities each with the authority to mount a man in the middle attack.
## replacement handshake for publicly identified server
The the TCP replacement handshake needs to be a four phase handshake.
1. Client->Server: Give me a connection, here are my parameters, here is my
session key.
1. Server->Client: Here is a proof of work request, my parameters, and a keyed
hash of your and my parameters. Ask again with proof of work, the same
parameters, and the keyed hash.
Server then throws away the request, allocating no memory.
1. Client->Server: OK, here I am again, with all that stuff you asked for.
This includes a konce (key used once,single use elliptic point), and
assumes that the client reliably knows the server public key i
advance. This protocol is inappropriate to signons that are restricted
to identified entities, because we probably do not want everyone to
know who is
identified.
1. Server checks the poly1305 authentication to ensure that this is a
real client reply to a real and recent server reply. Then it checks the
proof of work.
If the proof of work passes, Server allocates memory, generates and stores a
session key, and stores connection parameters, the client and server
session keys among them.
1. Server->Client: OK, here is my session key, authenticated but not
signed by my permanent key, and stuff, now you can start sending
actual data.
Thus we can integrate TCP handshake and encryption hand shake and the
innumerable DDoS protection handshakes “Cloudfare is checking your browser,
oops, your browser did not pass, here is a captcha” at the cost of one single
additional trip, half a round trip.
Instead of the person establishing the connection fuming while round trip
after round trip goes through, we get all that stuff at the cost of one
additional half round trip.
### pow implementation
Each sequential proof of work request contains a 64 bit sequential integer.
The integer starts at random 63 bit value, to ensure that every possible
successful proof of work ever used is unique in the universe. The
sequential integer is treated as a windowed value into a 512 bit integer,
whose high order part is an unshared secret that remains unchanged for the
duration.
From that 512 bit value, the server generates a unique XChaCha20 512 bit
value, 256 bits of which are used to generate a Poly1305 authenticator for
the proof of work request. If it receives a completed proof of work request
containing the authentication, it knows it comes from an entity at that
network address that was able to receive the proof of work request.
Knowing it is talking to real network addresses, it can derank network
addresses that create excessive burdens, so that they cannot slow down
everyone else, only themselves.
When it receives the completed proof of work, it first checks the sequence
number to ensure it is a recently issued request for work, then checks if
there is already a channel allocated for that pow, using a table of doubly
linked lists of recently allocated channels.indexed by the low order part of
the pow sequence number If it discovers it has already passed that proof of
work and allocated a channel, moves that proof of work to the head of list,
so that the next check will be instant, just in case it is about to receive a
million copies of that proof of work. Then it checks for revealed bits from
those generated by XChaCha20. Then it checks the work and the
Poly1305 authentication.
Checking if there is already a channel allocated overlaps and intersects
with presence notification protocol. We want to have a very large number
of inactive presences without secrets or network addresses in the database,
a large number of long lived active presences in memory, with secrets that
are not paged to disk (`sodium_allocarray`), and considerably smaller
number of considerably shorter lived channels with flow control and
buffering. A presence can only exchange short messages that fit in one
packet, and only one message can be active in any round trip time. You
open a presence, and the presence can then open a channel.
We probably want to do the checks in whatever order is empirically most
efficient for type of DDoS attacks that we encounter in practice, the most
common probably being garbage random values that bear no particular
resemblance to valid connection attempts.
The next problem will valid connections that then make excessive
demands. These get deranked by the next layer, and they will then have to
make a new connection, which will face increasing pow and discrimination
against their network address.
## replacement handshake for limited circulation server
In this case the server is the gateway for a group, possibly many groups,
whose unique id is not widely known. It is analogous to a closely kept email address.
The the TCP replacement handshake needs to be a four phase handshake.
1. Client->Server: Give me a connection, here are my parameters,
here is a clue about what private group I want to connect to.
1. Server->Client: Here is a proof of work request, my parameters,
including a use once elliptic point, and a keyed hash of your and
my parameters. Ask again with proof of work, the same parameters,
and the keyed hash.
Server then throws away the request, allocating no memory.
1. Client->Server: OK, here I am again, with all that stuff you asked for.
At this point, client has given server a clue about which private
group it wants to connect to, and server has given client a clue
about which private group it expects membership of, and therefore
what public key the client should attempt to communicate with.
1. Server checks the keyed hash to ensure that this is a real client
reply to a real and recent server reply. Then it checks the proof of
work.
If the proof of work passes, Server allocates memory
Then it generates a transient secret from the konces (keys used
once, single use elliptic points), and uses it to decrypt the clien
durable public key, verifying that the client does indeed know the
transient scalar. If the client durable key is OK, sign on allowed, it
constructs a shared secret from all four keys, the sum of two secrets
multiplying the sum of two elliptic points, and we now have an
encrypted stream associated with the port number and network addresses.
# Summary of the replacement
Thus we can integrate TCP handshake and encryption hand shake and the
innumerable DDoS protection handshakes “Cloudfare is checking your browser,
oops, your browser did not pass, here is a captcha” at the cost of one single
additional trip, half a round trip.
Instead of the person establishing the connection fuming while round trip
after round trip goes through, we get all that stuff at the cost of one
additional half round trip.
# messages, not streams
TCP sockets are designed for synchronous procedural programming, on
machines with very limited memory processing limitless streams. They are
now almost always used for message processing from event oriented
asynchronous code, with a messaging layer on top of the endless stream
layer. The replacement needs to have application layer sending messages
and receiving messages in events. The application layer should not have
to deal with sockets and streams. Rather, it sends a message to destination
identified by its durable public key, and gets a reply, where the reply
might be that the socket could not be opened, or that the socket was open but
the reply timed out, among other things. When sending a message, there is a
time to wait for response before giving up, and a time for the socket that
may be created to live idle.
# Proposed replacement
[QUIC] is the current TCP replacement. Also known as HTTP/3
[QUIC]: https://github.com/private-octopus/picoquic
We have no alternative but to interface to the vast HTTP/2 HTTP/3
ecosystem. The wallet is going to have to talk as a client to legacy server
http/3 devices, and accept their CA certificates, preferably subject to
Zooko scrutiny, and legacy http/3 client devices are going to have to talk
to our wallet (after their wallet has downloaded a zooko based certificate
from the server wallet).
Talking HTTP/3 means being wide open to DDOS attack, so that you are
forced to use cloudfare. When a device with our version of QUIC talks to
another device with our version of QUIC, it has to implement our DDOS
resistance, and Zooko in place of CA. But when it talks to a legacy
HTTP/3 device, it has to lay itself wide open to DDOS attack and CA
interception.
Backwards compatibility with insecure systems always creates a massive
security hole. On the one hand, every build from scratch project dies. On
the gripping hand, every attempt to do fax over the internet failed and was
eventually replaced by pdf attachments to email. Backwards compatibility
was simply too crippling, and backwards compatibility with QUIC is
going to cripple security.
Instead of putting the secure system transparently as an alternate protocol
within the insecure system, you non transparently put the insecure system
as a downgrade protocol within the secure system, which means our
version of QUIC simply is not going to talk to older versions of QUIC
unless you take some special measures to tell it to do so or enable it to do
so for that particular communication end point.
The least friction interface would be that every time a new SSL name is
encountered, we get a window saying "This authority claims that this is
this entity. Trust this authority for this entity?" And if there is a change of
authority, complain. Wrap backwards compatibility in Zooko vouched
certificates, pinned certificates, and the CAA record indicating who is the
right issuer for the SSL certificate
We have to have downgrade capability, but it has to be an afterthought,
slipped in as a special path and special case, as user friendly as possible,
but no friendlier.
QUIC's one way streams are messages.
Its two way streams are backwards compatibility with TCP
It solves the long fat pipe problem with flexible window size.
It puts multiple objects and messages in one stream, so that one message
does not have to wait for lost packets in another message to be resolved.
TCP flow control is constructed around pushback - that the sender should
not send data faster than the receiver is able and willing to handle it.
Normally there is one thread, or pool of of threads, handling the data
received. To prevent DDoS, we should probably only have one unit of
pushback per pair of network addresses. If someone has a slow receiver
thread pool, and a fast receiver thread pool communicating with the same
machine, he needs to break the slow receiver communication into lots of
small requests and replies, hence one channel per pair of network
addresses.
Quic implements everything you need to have one channel per pair of
network addresses, multiplexing many request-replies into a single stream,
many channels in one channel, but does not in fact implement one channel
per pair of network addresses in the sense of one unit of packet flow
control and one unit of DDoS monitoring, per pair of network addresses.
Finer grained flow control should be implemented as request reply on
messages that may well be much larger than a packet, but much smaller than
memory
In the request reply model, if the requests and replies are reasonably short,
pushback does not matter, and becomes a representation of flow control. It
is seldom sane to download enormous blocks of data as a single message,
and we probably just should not do it - restrict replies to what can
reasonably fit into memory, so that a very large message that the receiver
is processing one chunk at a time has to get acks of its submessages,
separate from the flow control system.
What the LEMP stack does with request headers is dynamically allocate
8KiB buffers, stuff headers into a part or whole of at 8KiB buffer, and if a
header is bigger than 8KiB, arbitrarily truncates it, which suggests that this
is a tactic to minimize the overheads of dynamically allocating many
moderate sized buffers of variable size. Experimenting, I find that
dynamic allocation tends to be the major cost in many programs, but if
you do it LEMP style, dynamic allocation is unlikely to be a significant cost.
QUIC has a pile of feature bloat:
+ The push feature is married to html, and belongs in the webserver
and the browser, not in the protocol. Something sending a request
message should be aware it might have several messages in reply,
depending on the kind of the request, and simply have a message
handler that can deal with many messages.
+ We dont really need the unique and sequential message id if finding and
interpreting the message id is part of how to response handler handles the
messages best to hand that as far down into the endpoints as possible.
+ its data format, header and frames, is married to html, which is
always sending repetitious and redundant information, treating
related fragments of html as absolutely distinct.
it implements html specific compression, HPACK.
It suffers from the SSL/TLS problem of a thousand CA authorities, NSA
friendly encryption, and, being funded in large part by Cloudfare, has no
substantial defense against DDoS.
It fails to support rendezvous routing.
But, it has already struggled with and solved a thousand problems whose
solutions I have been confusedly struggling with. So the obvious solution
is to adopt Quic, rip out the domain name system, add DDoS resistance,
rip out NSA friendly encryption in favour of the standard and
recommended Libsodium packet encryption. (XChaCha20-Poly1305), for
immortality rip out the 62 bit compressed integers in favour of unlimited
precision windowed integers (With a negotiated limit on precision that
will in practice always be 64 bits for the next several centuries.)
XChaCha20 is not the fastest on a long stream, but it has key agility, can
encrypt arbitrary length values, including a single bit, and is as
fast as ChaCha20 without any limits on the nonce.
Quics messaging is excessively married to HTTP. We need a generic
messaging system where every message has an short number indicating
destination handler, and you can generate a handler, code continuation,
and get number assigned to it on the fly, so that you can send a message,
and the reply goes to your code continuation.
We need to lift as much of the [QUIC] design as possible, and also make things
act much like TCP, so that existing NATs will not notice anything has
changed. Thus packets will continue to be sent to and from a widely known
port that is usually below 1024 on the server, from a random port on the
client in the range 49152--65535. A connection will continue to require a
three phase handshake which creates a socket, albeit our sockets will be very
different.
With a rendezvous, both peers will use the same socket in the range
1024-49151
The rendezvous handshake will look like the TCP handshake Syn Syn-Ack Ack,
but they will both send syn packets, both send syn-ack packets, and both
send ack packets. Their syn packets will be timed so that, if the timing
is done right, both are sent just before the other peers packet is
expected to be received.
Our sockets will always have a shared secret associated, which proves
identity and enables encrypted communication, but which cannot be used to
prove identity to a third party. The initial handshake will exchange
transient secret keys, which will generate a transient durable secret,
which is used to encrypt the exchange of durable secret keys, which
establish a shared secret based on the both the durable and transient key,
establishing forward secrecy, and failing to establish identity to third
parties.
Since setting up a shared secret is costly, this creates the opportunity to
syn flood attacks, therefore the syn-ack will always be a syn cookie,
structured rather like existing syn cookies, a cryptographic hash of the syn
based on an unshared secret known only to the server, plus it will always
have a proof of work request, which may be zero, and it will have a list of
supported protocols if the protocol proposed in the initial syn cookie is
unacceptable. The proof of work will be that the hash of the client ack
must have a certain number of zeros, and the ack
must contain the cryptographic cookie, and the data that the server checks
the cookie against.
TCP was designed around the case of the client sending an endless stream of
characters, typed with one finger, to a program on the server. We are
going to design around message response, with responses not necessarily
returning in order.
The client sends a message from a durable public key to a to a durable
public key. The creation and destruction of such connections is not
tightly linked to messaging. If connection exists, it is used. If it does
not exist, it is created. It may be torn down after a while of being
unused, but the tear down is not tightly linked to message completion
In TCP a count is kept of bytes sent and bytes received, with an ack
counting as one byte.
We need a count for each packet, since packets can arrive out of order,
repeated, or missing. The count values will be sequential nonces for the
encryption, and will start at one. As the count can potentially grow
quite large, the count value will be windowed, but, unlike TCP, the
windowed count represents a potentially much larger absolute count known
by both ends.
Negotiating a window size is hard, since you do not really know in advance
what window size will be needed. The thirty two bit window is adequate for
all normal uses, but fails in special and important uses.
We will specify the window size in each packet, with the high order bit of
each byte in the nonce indicating whether there is another seven bits in
the nonce window, so that we can dynamically adjust the window size. We
dynamically adjust the window size to big enough to exclude ambiguity.
Which for the first 128 packets, and on a connection that is not very busy,
all packets, will be seven windowed count bits and one window size bit.
The window needs to be large enough to exclude the ambiguity of delayed
and duplicated packets wandering in late, so has to be several times
larger than the difference between the most recently acked value, and the
the value that will fill the reception window. Thirty two times larger
should be ample. At the start, there are no early packets capable of
wandering in late, so big enough to hold the full count always suffices.
If `a` represents a recent nonce, `n`
represents the nonce, `w` represents the windowed nonce. and
`M` represents the window mask, communicated in each packet in
unary, then:
`w = n&M`
`n = (w a)&M + a`
We use a window large enough to give the same answer on both the most
recently acked nonce, and the most recently sent nonce.
The nonce will serve the dual purpose of enabling the decryption of each
packet, and flow control. Each packet has a sequential nonce, we make sure
all packets are acked. Nonces on packets coming from the client refer to a
different shared secret than nonces on packets coming from
## API
To send a message, you will construct a response handler if you are
expecting a response, and then call the api with a network address, a
public key of the recipient, an identifying secret key and public key of
the sender, a timeout for attempting to connect, and flags permitting for
direct connection, rendezvous connection, retransmit, and store and
forward. If a response is expected for the message, give the expected
lifetime for the response handler, a nonce for the response handler and a
class identifier for the nonce. (the nonce only has to be unique within
the class). You will probably use a different nonce population for
messages that have to be handled promptly, messages that have to be
handled within a session, and non volatile nonces that survive between
sessions. Nonce populations can be windowed per class identifier, with a
window large enough to accommodate the timeout, and a different class
identifier for volatile and non volatile nonces. The nonce is used once
within a window and within a class, but can be re-used in another class
and another window.
The application code is event oriented, like gui code. It is driven by a
message pump, with constructors creating event handlers, and the events
driving the event handler through the message pump, and event handler, on
being fired, creates new event handlers and fires old event handlers.
When the application needs to perform a task that spans many events, it does
not call `yield` or `await,` but instead the event handler for each event
constructs or enables the next event handler. If it needs to push information
onto a stack between events, has its own explicit stack for its own multi
event task, or creates a linked list of event handlers. Non volatile event
handlers must be trivial C+ classes, therefore cannot contain an `std::stack`,
State that would be on the stack in synchronous code is in the event
handler in asynchronous code. This potentially gets messy if you are
processing an endless stream of structured data whose structure is
orthogonal to message boundaries. Since we allow arbitrary length
messages, dont do that.
Notification of message failure may occur any time within the lifetime of
the response handler, but will mostly happen within the timeout for
attempting to connect.
The usual flow of control will be create an event handler, assign a nonce
to it (fire it) and then it gets triggered when the event actually
happens, and is then usually destroyed. Events will usually create and
fire new events and trigger events that existed before they were created,
rather than changing their state.
Below the api, additional messages, using low numbered message response
classes, may be constructed for encryption and flow control. If an
encrypted connection exists, it will use that without constructing
additional messages. If it does not exist, will construct it.
Constructing a encrypted connection provides perfect forward secrecy
between one connection and the next by generate new random session keys
each time.
## Reliability and flow control
TCP achieves reliable transmission with acks and nacks.
The original design simply acked that all bytes (not exactly bytes, because
acks and nacks are counted) had been received up to a certain byte. If the
transmitter has transmitted stuff, and not received an ack for what it
transmitted it sends a nack, after a timeout. The receiver may resend acks.
This mechanism worked fine on short thin pipes, but if you have a million
packets in flight, and packet three hundred thousand gets lost, you then
then have to send seven hundred thousand to replace one packet. So the
duplicate ack possibility was tortured to create a half assed version of
selective acknowledgment. If the receiver receives packet 100, and 101,
but not packet 99, it sends duplicate acks for packet 98. If the receiver
receives three duplicate acks for packet 98, it retransmits packet 99. (two
duplicate acks could be just the normal randomness.)
[QUIC], however, has fix for this built in.
Obviously true selective acknowledgment is better. The receiver acks the
most recent received packet, and sends a list of missing packets prior to
this (acks a windowed value for the most recent packet, and the difference
between packet nonces for missing packets) The sender resends the missing
packets, except for the most recent missing packets. If they are still
missing, they will be caught on the next ack.
In each ack, the receiver tells the sender how much more data it can
receive before it sends the next ack. This prevents the receiver from
being flooded, but a more common problem is the pipe being flooded.
To handle pipe flooding, the sender has a timer. If it sends stuff, and
does not get an ack, it backs off, it sets the timer to a slower rate, and
retransmits with a nack. The initial value of the timer is the initial
timer value is smoothed $RTT + max(G,4*RTT variance)$
TCP flow control focuses on getting a segment complete and acknowledged,
so it can move on to the next segments. It may have a great many packets
in flight, but does not have too many segments in flight. The backoff
algorithm is linked with the push segments algorithm. You only push the
segment the receiver has asked for in his previous acknowledgment. So you
typically have the segment you are finalizing, the segment that is in
flight, and the segment that the receiver asked for.
The algorithm is that the sender gets an ack that acknowledges what the
receiver has received, and tells the sender how much more the receiver can
receive. Whereupon the sender resends anything missing, and resumes pushing
new stuff up to the limit that the receiver has specified, spread out
roughly evenly over the timer period. Which implies that the receiver
should ask wisely, as well as the sender send wisely.
Implementing our own flow control sounds like a lot of work. Need to lift
[QUIC]s flow control, and drop our own encryption and attack resistance
into it, while letting it worry about flow control. I can hack into its library,
while I cannot hack into the TCP library.
I have been analysing how TCP works, with a view to what needs fixing. Time to
analyse how something works for which I have a library and example code.
Best (because smallest and least married to HTTP3) is [picoquic].
[picoquic]: https://github.com/private-octopus/picoquic
The TCP state machine assumes that the server opens a connection on receiving
a syn, sends an ack-syn to the client, whereupon the client acks the
connection. But if we are using syn cookies, we are using a different state
machine, where the connection is in fact only opened on receiving the server
syn-ack cookie in the client ack. So the server has to acknowledge the
connection, which would make it a four step handshake instead of a three step
handshake. To avoid this, we have a rule that the client only opens a
connection when it has data ready to send. It then gets a server cookie, and
sends the cookie-ack with some data, which data the server acks.
With the cookie ack, we get a round trip time and offset between server
steady time and client steady time. If we see unstable round trip times,
we suspect the pipe is overloaded, and back off our estimate of max
bandwidth. For flow control, we maintain an estimate of pipe length and
width. Sudden pipe widenings indicate an overflow condition, because pipes
may respond to overflow by massively discarding packets, or massively
backing up packets, or quite possibly both. We maintain a probability
estimate of the pipe behaviour.
## Outline protocol
A packet protocol that establishes an encrypted connection on top of
unreliable packets with minimal round trips without increasing fragility to
DoS.
For servers, public keys, globally human readable names, the key owning the
name, and the temporary key signed by the key owning the name, will usually
be public and widely known, but this also supports the case of
communication where this information is only known to the parties, and the
server does not want to make the connection between a network address and a
public key widely known.
To establish a connection, we need to set a bunch of values specific to
this particular channel, and also create a shared secret that
eavesdroppers and active attackers cannot discover.
The client is the part that initiates the communication, the server is
the party that responds.
I assume a mode that provides both authentication and encryption if a
packet decrypts into a valid message, this shows it originated from an
entity possessing the shared secret. This does not provide signing the
recipient cannot prove to a third party that he received it, rather than
making it up.
For the moment I ignore the hard question of server key distribution,
glibly invoking Zookos triangle without proposing an implementation of
the other two points and three sides of the triangle or a solution to the
problem of managing distributed reputations in Zookos triangle.  (Be
warned that whenever people charge ahead without solving the key
distribution problem, the result is a disaster.)
Client 🠆 Server: Equivalent to the syn of the three phase TCP
handshake.
> Clients network address and port on which client will receive
> packets, protocol identifier, and client steady time that the
> message was sent.
If the requested protocol is not OK, we go into protocol negotiation,
server responds with a list of protocols and protocol versions that it will
accept, in the form of a list of lists of numbers.
Assuming it is OK, which it probably will be, server allocates nothing,
prepares nothing, but sends the equivalent of a TCP ack-syn cookie,
containing, among other things, a cryptographic hash of the information
that was received and sent, based on a private secret known only to the
server. It sends a transient public key, which changes every few minutes
or so, plus a short windowed id for that transient public key, and a demand
for proof of work, which may be zero. The proof of work is that the
clients ack, equivalent of the third phase of the TCP handshake, has to
hash to a value ending in `n` zero bits, where `n`
may be zero.
This cryptographic hash based on an unshared secret will be sent to client,
and then back to server, unchanged. Its function is to avoid the necessity for
the server to allocate memory or perform asymmetric cryptographic operations
for a client that has not yet validated. Instead the state information is sent
back and forth.
1. Server 🠆 Client: Equivalent to the syn-ack of the three phase TCP handshake.
Cryptographic hash based on unshared secret, server steady time,
transient public key, server windowed identifier of server transient
public key, proof of work demand, and any channel parameters.
The proof of work is trivial if the server is not under load, but is
increased as the server load approaches the maximum the server is
capable of, in order to throttle demand.
Client computes transient handshake shared secret as its transient private
key times the server shared transient public key. It returns in the clear
a copy of the cryptographic hash that the server sent to it, the data in
the clear needed to validate the hash, performs the proof of work, and
sends its public key, which may be a per server durable public key, always
used when accessing this server on this identity, encrypted using the
transient key, and the public key it wants to talk to on the server.
Subsequent information is not encrypted using the transient keys, but using
the sum of transient plus secret keys.
This implies that the client has to know the public key that the server is
using, which may be a key signed by the master public key that owns the
name authorizing that new key, which key changes about as often as the
server IP changes, and is therefore distributed in the same channel as the
network address associated with global human names is distributed. If the
client gets it wrong, then the server ignores the information encrypted to
the wrong public key, and responds with the authentication of its new
public key, signed by the master public key of its globally unique name,
encrypted using the transient secret this is usually public information,
but since by this point we have established a shared secret and allocated
memory, might as well send it securely, for sometimes it is going to be
private information.
1. Client 🠆 Server: Equivalent to the final ack of the three phase TCP
handshake.
Sends in the clear server hash as received, any data needed to
reconstruct the hash, and transient secret key. Then, encrypted to
transient keys, the hash of the identifier of the public key it wants to
talk to, its durable public key, and client steady time at which this was
sent, so that both sides have an estimate of the round trip time and the
offset between server steady time and client steady time.
Server checks the proof of work, checks the cryptographic hash against the
data in the clear, *then* creates an entry in its hash table for this
connection, with the shared secret being the transient keys plus the public
keys.
We have two protocols, one for the authenticated phase, and one for
unauthenticated phase. The client has to know one of the unauthenticated
protocols offered by the server, or else protocol negotiation will fail in
the abnormal case that protocol negotiation is needed. Normally there will
only be one protocol for secured but unauthenticated communication during
setup, but we make provision by having two protocols, trivially different,
and three protocols, trivially different for the authenticated phase.
You will notice that the server only allocates memory and and asymmetric
encryption computation *after* the client has successfully performed proof of
work and shown that it is indeed capable of receiving data sent to the
advertised network address.
In the normal case, the client requests one way authenticated encryption in
the syn, where the server authenticates but the server does not, and the
server may, and usually will, offer in the syn-ack only two way
authenticated encryption, where the client provides an identity unique to
that server and users current default name, but which cannot be used to
identify the default name, nor the same user accessing a different
website. This allows the server to see that the same user is accessing
different resources, how many uniques the server has, and what each unique
is doing, but does not enable the servers to put their heads together and
see that the same user is doing things on one server, and also on another
server.
Now we have a shared secret, protocol negotiated, client logged in, in
one round trip plus the third one way trip carrying the actual data the
same number of round trips as when setting up an unencrypted
unauthenticated TCP connection.
You will notice there is no explicit step checking that both have the
same shared secret This is because we assume that each packet sent is
also authenticated by the shared secret, so if they do not have the same
secret, nothing will authenticate.
# Critiques of TCP/SSL
Does the job so badly that using a different method is just as plausible.
People fight to avoid TLS already, theyd rather send stuff in the clear if
they could.  So just solve the problems they have.
In Web Services we frequently require message layer security in addition to
transport layer security because a Web Service transaction might involve more
than two endpoints and messages that are stored and forwarded etc. This is why
WS-\* is not TLS. (It is unfortunately horribly baroque but that was not my
doing).
Problem that occurred with TLS was that there was an assumption that the job\
was to secure the reliable stream connection mechanics of TCP.  False
assumption.
Pretty much nobody uses streams by design, they use datagrams.  And they use
them in a particular fashion: request-response.  Where we went wrong with TCP
was that this was the easiest way to handle the mechanics of getting the
response back to the agent that sent the request. Without TCP, one had to deal
with the raw incoming datagrams and allocate them to the different sending
agents.
A second problem was that the design was too intertwined with commercial PKI
so certs were hung on the side as a millstone for server authentication and
discarded as client side, leaving passwords to fill that gap.  A mess, which
is an opportunity for redesign, frequently exploited by many designs already.
SSL came at this and built a message (record) interface on top of TCP (because
that was convenient for defining a crypto layer), and then a (mainly) stream
interface on top of its message interface because programmers were by now
familiar with streams, not records.
And so … here we are.  Living in a city built on top of generations of
older cities.  Dig down and see the accreted layers.
What *is* the “right” (easiest to use correctly, hardest to use
incorrectly, with good performance, across a large number of distinct
application APIs) underlying interface for a secure network link? The fact
that the first thing pretty much all APIs do is create a message structure
on top of TCP makes it clear that “pure stream” isnt it.  Record-oriented
designs derived from 80-column punch cards are unlikely to be the answer
either.  What a “clean slate” interface would look like is an interesting
question, and perhaps its finally time to explore it.
# General and unorganized comments
µTP, Micro Transport Protocol is a Bittorrent near drop in replacement for TCP
that provides lower priority bulk downloads in the background. The library is
not well documented, (header file plus examples) but as far as I can see,
provides a reasonably clean separation between Bittorrent and the transport
mechanism.
Google has a TCP/SSL replacement, [QUIC], which avoids round tripping and
renegotiation by integrating the security layer with the reliability layer,
and by supporting multiple asynchronous streams within a stream
Layering a new peer-to-peer packet network over the Internet is simply
what the Internet is designed for. UDP is broken in a few ways, but not
that cant be fixed. Its simply a matter of time before a new virtual
packet layer is deployed probably one in which authentication and
encryption are inherent.
For authentication and encryption to be inherent, needs to connect
between public keys, needs to be based on Zookos triangle.  Also
needs to penetrate firewalls, and do protocol negotiation with an
unlimited number of possible protocols avoiding that internet names and
numbers authority.
Ian Grigg “Good protocols divide into two parts, the first of which says
to the second, trust this key completely!”.
This might well be the basis of a better problem factorization than the
layer factorization divide the task by the way trust is embodied, rather
than the basis of layered communication.
Trust is an application level issue, not a communication layer issue,
but neither do we want each application to roll its own trust cryptography
which at present web servers are forced to do. (Insert my standard rant
against SSL/TLS).
Most web servers are vulnerable to attacks akin to session cookie
fixation attack, because each web page reinvents session cookie handling,
and even experts in cryptography are apt to get it wrong.
The correct procedure is to generate and issue a strongly unguessable
random https only cookie on successful login, representing the fact that
the possessor of this cookie has proven his association with a particular
database record, but very few people, including very few experts in
cryptography, actually do it this way. Association between a client
request and a database record needs to be part of the security system. It
should not something each web page developer is expected to build on top
of the security system.
TCP constructs a reliable pipeline stream connection out of unreliable
packet connections.
There are a bunch of problems with TCP.  No provision was made for
protocol negotiation and so any upgrade has to be fully backwards
compatible.  A number of fixes have been made, for example the long
fat pipe problem has been fixed by window size negotiation, which is semi
incompatible and leads to flaky behaviour with old style routers, but the
transaction problem remains intolerable.  The transaction problem has
been reduced by protocol level workarounds, such as “Keep alive” for HTTP,
but these are not entirely satisfactory.  The fix for syn flooding
works, but causes some minor unnecessary degradation of performance under
syn flood attacks, because the syn cookie is limited to 48 bits needs to
be 128 bits both to deal with the syn flood attack, and to prevent TCP
hijacking.
TCP is inefficient over wireless, because interference problems are
rather different to those provided for in the TCP model.  This
problem is pretty much insoluble because of the lack of protocol
negotiation.
There are cases intermediate between TCP and UDP, which require
different balances of timeliness, reliability, streaming, and record
boundary distinction. DCCP and SCTP have been introduced to deal with
these intermediate cases, SCTP for when one has many independent
transactions running over a single connection, and DCCP for data where
time sensitivity matters more than reliability such as voice over
IP.  SCTP would have been better for HTML and HTTP than TCP is,
though it is a bit difficult to change now.  Problems such as
password-authenticated key agreement transaction to a banking site require
something that resembles encrypted SCTP, analogous to the way that TLS is
encrypted TCP, but nothing like that exists as yet. Standards exist for
encrypted DCCP, though I think the standards are unsatisfactory and
suspect that each vendor will implement his own incompatible version, each
of which will claim to conform to the standard.
But a new threat has arrived:  TCP man in the middle forgery.
Connection providers, such as Comcast, frequently sell more bandwidth
than they can deliver.  To curtail customer demands, they forge
connection shutdown packets (reset packets), to make it appear that the
nodes are misbehaving, when in fact it is the connection between nodes,
the connection that Comcast provides, that is misbehaving. Similarly, the
great firewall of China forges reset packets when Chinese connect to web
sites that contain information that the Chinese government does not
approve of. Not only does the Chinese government censor, but it is able to
use a mechanism that conceals the fact of censorship.
The solution to all these problems is to have protocol negotiation,
standard encryption, and flow control inside the encryption.
A problem with the OSI Layer model is that as one piles one layer on top
of another, one is apt to get redundant round trips.
According to [google research] 400
milliseconds reduces usage by 0.76%, or roughly two percent per second of delay.
[google research]: http://googleresearch.blogspot.com/2009/06/speed-matters.html
Redundant round trips become an ever more serious problem as bandwidths
and processor speeds increase, but round trip times reminds constant,
indeed increase as we become increasingly global and increasingly rely on
space based communications.
Used to be that the biggest problem with encryption was the asymmetric
encryption calculations the PKI model has lots and lots of redundant and
excessive asymmetric encryptions. It also has lots and lots of redundant
round trips. Now that we can use the NVIDIA GPU with CUDA as a very high
speed cheap massively parallel cryptographic coprocessor, excessive PKI
calculations should become less of a problem, but excess round trips are
an ever increasing problem.
Any significant authentication and encryption overhead will result in
people being too clever by half, and only using encryption and
authentication where it is needed, with the result that they invariably
screw up and fail to use it where it is needed for example the login on
the http page. So we have to lower the cost of encrypted authenticated
communications, so that people can simply encrypt and authenticate
everything without needing to think about it.
To get stuff right, we have to ditch the OSI layer model, but simply
ditching it without replacement will result in problems. It exists for a
reason, and we have to replace it with something else.