finally figured the connection distribution algorithm

2022-07-03 12:53:50 +10:00 · 2022-07-03 12:53:50 +10:00 · 8530ae73b0
commit 8530ae73b0
parent 522b96336c
2 changed files with 97 additions and 18 deletions
--- a/.gitmodules
+++ b/.gitmodules
@ -1,9 +1,12 @@
 [submodule "libsodium"]
 	path = libsodium
-	url = cpal.pw:~/libsodium.git
+	url = git@rho.la:~/libsodium.git
+	branch = rho-fork
 [submodule "mpir"]
 	path = mpir
-	url = cpal.pw:~/mpir.git
+	url = git@rho.la:~/mpir.git
+	branch = rho-fork
 [submodule "wxWidgets"]
 	path = wxWidgets
-	url = cpal.pw:~/wxWidgets.git
+	url = git@rho.la:~/wxWidgets.git
+	branch = rho-fork
--- a/docs/social_networking.md
+++ b/docs/social_networking.md
@ -285,7 +285,12 @@ dangerous centralization would fail under the inevitable attack.  It needs to

 I will describe the Kademlia distributed hash table algorithm not in the
 way that it is normally described and defined, but in such a way that we
- can easily replace its metric by [social distance metric], assuming that we can construct a suitable metric, which reflects what feeds a given host is following, and what feeds the hosts of which it knows the unstable network address of are following, and what stable network addresses it knows and the feeds they are following, a quantity over which a distance can be found that reflects how close a peer is to an unstable network address, or knows a peer that is likely to know a peer that is likely to know an unstable network address.
+ can easily replace its metric by [social distance metric], assuming that we
+ can construct a suitable metric, which reflects what feeds a given host is
+ following, and what network addresses it knows and the feeds they are
+ following, a quantity over which a distance can be found that reflects how
+ close a peer is to an unstable network address, or knows a peer that is
+ likely to know a peer that is likely to know an unstable network address.

 A distributed hash table works by each peer on the network maintaining a
 large number of live and active connections to computers such that the
@ -305,13 +310,6 @@ Kademlia the $log_2$  of the exclusive-or between his hash and your hash.
 and open ports, and  connections that are distant from you distant from
 each other.

- Social distance is costly and complex to calculate, and requires that
- information on a public feed showing its social connections be widely
- shared, which is a lot of information that everyone has to acquire and
- store, and perform a heavy calculation on.  If there are more than thirty
- or a hundred entities, need to use dimensional reduction.  But we do  not
- need to do it very often.
-
 The reason that the Kademlia distributed hash table cannot work in the
 face of enemy action, is that the shills who want to prevent something
 from being found create a hundred entries with a hash close to their target
@ -365,11 +363,12 @@ This handles public posts.

 ### Kademlia in social space

-The vector of each identity is a sequence ones and zeros of unbounded
-length, unboundedly large dimension, but in practice you will not need
-anything beyond the first few hundred.
+The vector of an identity is $+1$ for each one bit, and $-1$ for each zero bit.

-We deterministically generate the vector by hashing the public key of the identity.
+We don't use the entire two hundred fifty six dimensional vector, just
+enough of it that the truncated vector of every identity that anyone might
+be tracking has a very high probability of being approximately orthogonal
+to the truncated vector of every other identity.

 We do not have, and do not need, an exact consensus on how much of the
 vector to actually use, but everyone needs to use roughly the same amount
@ -381,15 +380,92 @@ Each party indicates what entities he can provide a direct link to by
 publishing the sum of the vectors of the parties he can link to - and also
 the sum of the their sums, and also the sum of their ...  to as many deep as
 turns out to be needed in practice, which is likely to  two or three such
-vector sums, maybe four or five.
+vector sums, maybe four or five.  What is needed will depend on the pattern of tracking that people engage in in practice.
+
+If everyone behind a firewall or with an unstable network address arranges
+to notify a well known peer with stable network address whenever his
+address changes, and that peer, as part of the arrangement, includes him in
+that peer's sum vector, the number of well known peers with stable
+network address offering this service is not enormously large, they track
+each other, and everyone tracks some of them, we only need the sum and
+the sum of sums.

 When someone is looking to find how to connect to an identity, he goes
 through the entities he can connect to, and looks at the dot product of
-their sum vectors with target vector.
+their sum vectors with target identity vector.

 He contacts the closest entity, or a close entity, and if that does not work
 out, contacts another.  The closest entity will likely be able to contact
-the target, or contact an entity more likely to contact the target.
+the target, or contact an entity more likely to be able to contact the target.
+
+* the identity vector represents the public key of a peer
+* the sum vector represents what identities a peer thinks he has valid connection information for.
+* the sum of sum vectors indicate what identities that he thinks he can connect to think that they can connect to.
+* the sum of the sum of the sum vectors ...
+
+A vector that provides the paths to connect to a billion entities, each of
+them redundantly through a thousand different paths, is still sixty or so 
+thirty two bit signed integers, distributed in a normal distribution with a
+variance of a million or so, but everyone has to store quite a lot of such
+vectors.  Small devices such as phones can get away with tracking a small
+number of such integers, at the cost of needing more lookups, hence not being
+very useful for other people to track for connection information.
+
+To prevent hostile parties from jamming the network by registering
+ identities that closely approximate identities that they do not want people
+ to be able to look up, we need the system to work in such a way that
+ identities that lots of people want to look up tend to heavily over
+ represented in sum of sums vectors relative to those that no one wants to
+ look up.  If you repeatedly provide lookup services for a certain entity,
+ you should track that entity that had last stable network address on the
+ path that proved successful to the target entity, so that peers that
+ provide useful tracking information are over represented, and entities that
+ provide useless tracking information are under represented.
+
+ If an entity makes publicly available network address information for an
+ identity whose vector is an improbably good approximation to an existing
+ widely looked up vector, a sybil attack is under way, and needs to be
+ ignored.
+
+To be efficient at very large scale, the network should contain a relatively
+small number of large well connected devices each of which tracks the
+tracking information of large number of other such computers, and a large
+number of smaller, less well connected devices, that track their friends and
+acquaintances, and also track well connected devices.  Big fanout on on the
+interior vertices, smaller fanout on the exterior vertices, stable identities
+on all devices, moderately stable network addresses on the interior vertices, possibly unstable network addresses on the exterior vertices.
+
+If we have a thousand identities that are making public the information
+needed to make connection to them, and everyone tracks all the peers that
+provide third party look up service, we need only the first sum, and only
+about twenty dimensions.
+
+But if everyone attempts to track all the connection information network
+for all peers that provide third party lookup services, there are soon going
+to be a whole lot shill, entryist, and spammer peers purporting to provide
+such services, whereupon we will need white lists, grey lists, and human
+judgement, and not everyone will track all peers who are providing third
+party lookup services, whereupon we need the first two sums.
+
+In that case random peer searching for connection information to another
+random peer first looks to through those for which has good connection
+information, does not find the target.  Then looks through for someone connected to the target, may not find him, then looks for someone connected to someone connected to the target and, assuming that most genuine peers providing tracking information are tracking most other peers providing genuine tracking information, and the peer doing the search has the information for a fair number of peers providing genuine tracking information, will find him.
+
+Suppose there are a billion peers for which tracking information exists.  In that case, we need the first fifty dimensions, and possibly one more level of indirection in the lookup (the sum of the sum of the sum of vectors being tracked).  Suppose a trillion peers, then about the first sixty dimensions, and possibly one more level of indirection in the lookup.
+
+That is a quite large amount of data, but if who is tracking whom is stable, even if the network addresses are unstable, updates are infrequent and small.
+
+If everyone tracks ten thousand identities, and we have a billion identities
+whose network address is being made public, and million always up peers 
+with fairly stable network addresses, each of whom tracks one thousand
+unstable network addresses and several thousand other peers who also
+track large numbers of unstable addresses, then we need about fifty
+dimensions and two sum vectors for each entity being tracked, about a
+million integers, total -- too big to be downloaded in full every time, but
+not a problem if downloaded in small updates, or downloaded in full
+infrequently.   The data can be substantially compressed by a compression algorithm that takes advantage of the fact that the values of a vector have a normal distribution around zero.
+
+But suppose no one specializes in tracking unstable network addresses.  If your network address is unstable, you only provide updates to those following your feed, and if you have a lot of followers, you have to get a stable network address with a stable open port so that you do not have to update them all the time.  Then we our list of identities whose connection information we are tracking will be considerably smaller, but our level of indirection considerably deeper - possibly needing six or so deep in sum of the sum of ... sum of identity vectors.

 ##   Private messaging