finally figured the connection distribution algorithm

This commit is contained in:
reaction.la 2022-07-03 12:53:50 +10:00
parent 522b96336c
commit 8530ae73b0
No known key found for this signature in database
GPG Key ID: 99914792148C8388
2 changed files with 97 additions and 18 deletions

9
.gitmodules vendored
View File

@ -1,9 +1,12 @@
[submodule "libsodium"]
path = libsodium
url = cpal.pw:~/libsodium.git
url = git@rho.la:~/libsodium.git
branch = rho-fork
[submodule "mpir"]
path = mpir
url = cpal.pw:~/mpir.git
url = git@rho.la:~/mpir.git
branch = rho-fork
[submodule "wxWidgets"]
path = wxWidgets
url = cpal.pw:~/wxWidgets.git
url = git@rho.la:~/wxWidgets.git
branch = rho-fork

View File

@ -285,7 +285,12 @@ dangerous centralization would fail under the inevitable attack. It needs to
I will describe the Kademlia distributed hash table algorithm not in the
way that it is normally described and defined, but in such a way that we
can easily replace its metric by [social distance metric], assuming that we can construct a suitable metric, which reflects what feeds a given host is following, and what feeds the hosts of which it knows the unstable network address of are following, and what stable network addresses it knows and the feeds they are following, a quantity over which a distance can be found that reflects how close a peer is to an unstable network address, or knows a peer that is likely to know a peer that is likely to know an unstable network address.
can easily replace its metric by [social distance metric], assuming that we
can construct a suitable metric, which reflects what feeds a given host is
following, and what network addresses it knows and the feeds they are
following, a quantity over which a distance can be found that reflects how
close a peer is to an unstable network address, or knows a peer that is
likely to know a peer that is likely to know an unstable network address.
A distributed hash table works by each peer on the network maintaining a
large number of live and active connections to computers such that the
@ -305,13 +310,6 @@ Kademlia the $log_2$ of the exclusive-or between his hash and your hash.
and open ports, and connections that are distant from you distant from
each other.
Social distance is costly and complex to calculate, and requires that
information on a public feed showing its social connections be widely
shared, which is a lot of information that everyone has to acquire and
store, and perform a heavy calculation on. If there are more than thirty
or a hundred entities, need to use dimensional reduction. But we do not
need to do it very often.
The reason that the Kademlia distributed hash table cannot work in the
face of enemy action, is that the shills who want to prevent something
from being found create a hundred entries with a hash close to their target
@ -365,11 +363,12 @@ This handles public posts.
### Kademlia in social space
The vector of each identity is a sequence ones and zeros of unbounded
length, unboundedly large dimension, but in practice you will not need
anything beyond the first few hundred.
The vector of an identity is $+1$ for each one bit, and $-1$ for each zero bit.
We deterministically generate the vector by hashing the public key of the identity.
We don't use the entire two hundred fifty six dimensional vector, just
enough of it that the truncated vector of every identity that anyone might
be tracking has a very high probability of being approximately orthogonal
to the truncated vector of every other identity.
We do not have, and do not need, an exact consensus on how much of the
vector to actually use, but everyone needs to use roughly the same amount
@ -381,15 +380,92 @@ Each party indicates what entities he can provide a direct link to by
publishing the sum of the vectors of the parties he can link to - and also
the sum of the their sums, and also the sum of their ... to as many deep as
turns out to be needed in practice, which is likely to two or three such
vector sums, maybe four or five.
vector sums, maybe four or five. What is needed will depend on the pattern of tracking that people engage in in practice.
If everyone behind a firewall or with an unstable network address arranges
to notify a well known peer with stable network address whenever his
address changes, and that peer, as part of the arrangement, includes him in
that peer's sum vector, the number of well known peers with stable
network address offering this service is not enormously large, they track
each other, and everyone tracks some of them, we only need the sum and
the sum of sums.
When someone is looking to find how to connect to an identity, he goes
through the entities he can connect to, and looks at the dot product of
their sum vectors with target vector.
their sum vectors with target identity vector.
He contacts the closest entity, or a close entity, and if that does not work
out, contacts another. The closest entity will likely be able to contact
the target, or contact an entity more likely to contact the target.
the target, or contact an entity more likely to be able to contact the target.
* the identity vector represents the public key of a peer
* the sum vector represents what identities a peer thinks he has valid connection information for.
* the sum of sum vectors indicate what identities that he thinks he can connect to think that they can connect to.
* the sum of the sum of the sum vectors ...
A vector that provides the paths to connect to a billion entities, each of
them redundantly through a thousand different paths, is still sixty or so
thirty two bit signed integers, distributed in a normal distribution with a
variance of a million or so, but everyone has to store quite a lot of such
vectors. Small devices such as phones can get away with tracking a small
number of such integers, at the cost of needing more lookups, hence not being
very useful for other people to track for connection information.
To prevent hostile parties from jamming the network by registering
identities that closely approximate identities that they do not want people
to be able to look up, we need the system to work in such a way that
identities that lots of people want to look up tend to heavily over
represented in sum of sums vectors relative to those that no one wants to
look up. If you repeatedly provide lookup services for a certain entity,
you should track that entity that had last stable network address on the
path that proved successful to the target entity, so that peers that
provide useful tracking information are over represented, and entities that
provide useless tracking information are under represented.
If an entity makes publicly available network address information for an
identity whose vector is an improbably good approximation to an existing
widely looked up vector, a sybil attack is under way, and needs to be
ignored.
To be efficient at very large scale, the network should contain a relatively
small number of large well connected devices each of which tracks the
tracking information of large number of other such computers, and a large
number of smaller, less well connected devices, that track their friends and
acquaintances, and also track well connected devices. Big fanout on on the
interior vertices, smaller fanout on the exterior vertices, stable identities
on all devices, moderately stable network addresses on the interior vertices, possibly unstable network addresses on the exterior vertices.
If we have a thousand identities that are making public the information
needed to make connection to them, and everyone tracks all the peers that
provide third party look up service, we need only the first sum, and only
about twenty dimensions.
But if everyone attempts to track all the connection information network
for all peers that provide third party lookup services, there are soon going
to be a whole lot shill, entryist, and spammer peers purporting to provide
such services, whereupon we will need white lists, grey lists, and human
judgement, and not everyone will track all peers who are providing third
party lookup services, whereupon we need the first two sums.
In that case random peer searching for connection information to another
random peer first looks to through those for which has good connection
information, does not find the target. Then looks through for someone connected to the target, may not find him, then looks for someone connected to someone connected to the target and, assuming that most genuine peers providing tracking information are tracking most other peers providing genuine tracking information, and the peer doing the search has the information for a fair number of peers providing genuine tracking information, will find him.
Suppose there are a billion peers for which tracking information exists. In that case, we need the first fifty dimensions, and possibly one more level of indirection in the lookup (the sum of the sum of the sum of vectors being tracked). Suppose a trillion peers, then about the first sixty dimensions, and possibly one more level of indirection in the lookup.
That is a quite large amount of data, but if who is tracking whom is stable, even if the network addresses are unstable, updates are infrequent and small.
If everyone tracks ten thousand identities, and we have a billion identities
whose network address is being made public, and million always up peers
with fairly stable network addresses, each of whom tracks one thousand
unstable network addresses and several thousand other peers who also
track large numbers of unstable addresses, then we need about fifty
dimensions and two sum vectors for each entity being tracked, about a
million integers, total -- too big to be downloaded in full every time, but
not a problem if downloaded in small updates, or downloaded in full
infrequently. The data can be substantially compressed by a compression algorithm that takes advantage of the fact that the values of a vector have a normal distribution around zero.
But suppose no one specializes in tracking unstable network addresses. If your network address is unstable, you only provide updates to those following your feed, and if you have a lot of followers, you have to get a stable network address with a stable open port so that you do not have to update them all the time. Then we our list of identities whose connection information we are tracking will be considerably smaller, but our level of indirection considerably deeper - possibly needing six or so deep in sum of the sum of ... sum of identity vectors.
## Private messaging