forked from cheng/wallet
4678fba3ce
modified: docs/libraries/cpp_automatic_memory_management.md modified: docs/libraries/time.md modified: docs/manifesto/consensus.md renamed: docs/notes/big_cirle_notation.md -> docs/notes/big_circle_notation.md modified: docs/writing_and_editing_documentation.md
375 lines
17 KiB
Markdown
375 lines
17 KiB
Markdown
---
|
||
lang: en
|
||
title: Estimating frequencies from small samples
|
||
# katex
|
||
...
|
||
# The problem to be solved
|
||
|
||
## distributed hash table
|
||
|
||
The Distributed hash table fails horribly in the face of a
|
||
significant likelihood of bad behaviour by the participants,
|
||
because you do not actually know the state of the network.
|
||
The usual procedure (Bittorrent network) is to treat information as
|
||
unconditionally valid for two hours, then throw it away,
|
||
which is pretty useless if a participant is behind a NAT,
|
||
and a disastrous loss of data if he has a long lived network address.
|
||
|
||
We would like to accumulate on disk very long lived
|
||
and rarely changing data about long lived participants,
|
||
the backbone of the distributed hash table.
|
||
|
||
We also want to have an arrangement with peers behind a NAT,
|
||
that each will ping the other at certain times with a keep-alive,
|
||
and if the expected keep-alive fails to arrive, the ensuing nacks and acks
|
||
will re-open the hole in the firewall, and also give us
|
||
information on how often each needs to ping the other.
|
||
When one concludes the timing of the pings could be improved,
|
||
they renegotiate the schedule with each other,
|
||
so that peers behind a nat with long lived holes do not need frequent pings.
|
||
|
||
At present, a random lookup serves the function of a keep-alive, resulting in
|
||
excessive churn in the DHT
|
||
|
||
If we represent the state of the distributed hash table with
|
||
metalogistic distributions, the resulting distributed hash table
|
||
should be tolerant to Byzantine fault.
|
||
(Because a Byzantine faulting peer eventually winds up being rated
|
||
as unreliable, and the backbone of the distributed hash table will
|
||
be long lived peers with long lived reputations, the reputation
|
||
being represented by a metalogistic distribution giving the likelihood
|
||
that the information supplied is correct.)
|
||
|
||
Each peer is identified by its durable public key. For each peer
|
||
there is its current network address, and a metalogistic distribution
|
||
of the longevity of that network address,
|
||
which no one keeps around for very long or distributes very far
|
||
if it does not indicate much longevity.
|
||
|
||
There is also a metalogistic distribution of the likelihood
|
||
that hole punching will be needed, and if likely to be needed,
|
||
a list of peers that might provide it,
|
||
and the likelihood that hole punching will work.
|
||
If the first peer in the list is up but fails, the next is not tried.
|
||
But if the first peer cannot be contacted, the next is contacted.
|
||
|
||
And, if hole punching is needed, a metalogistic distribution of
|
||
how long the hole is likely to last after punching.
|
||
|
||
And, most importantly, for our backbone of very long lived peers,
|
||
metalogistic distributions of the likelihood of Byzantine fault,
|
||
which will provide us with a Byzantine fault tolerant distributed hash table.
|
||
|
||
## protocol negotiation
|
||
|
||
We could also apply distributions to protocol negotiation,
|
||
though this is likely to be colossal overkill.
|
||
|
||
Because protocols need to be changed, improved, and fixed from time to
|
||
time, it is essential to have a protocol negotiation step at the start of every networked interaction, and protocol requirements at the start of every store
|
||
and forward communication.
|
||
|
||
But we also want anyone, anywhere, to be able to introduce new
|
||
protocols, without having to coordinate with everyone else, as attempts to
|
||
coordinate the introduction of new protocols have ground to a halt, as
|
||
more and more people are involved in coordination and making decisions.
|
||
The IETF is paralyzed and moribund.
|
||
|
||
So we need a large enough address space that anyone can give his
|
||
protocol an identifier without fear of stepping on someone else’s identifier.
|
||
But this involves inefficiently long protocol identifiers, which can become
|
||
painful if we have lots of protocol negotiation, where one system asks
|
||
another system what protocols it supports. We might have lots of
|
||
protocols in lots of variants each with long names.
|
||
|
||
So our system forms a guess as to the likelihood of a protocol, and then
|
||
sends or requests enough bits to reliably identify that protocol. But this
|
||
means it must estimate probabilities from limited data. If one’s data is
|
||
limited, priors matter, and thus a Bayesian approach is required.
|
||
|
||
### should not worry about protocol identifier size for a long time.
|
||
|
||
The above is massive overkill
|
||
|
||
A quick solution, far less clever than accurately guessing that
|
||
two entities are speaking the same language, is to find an integer such
|
||
that both parties have a Dewey decimal protocol identifier that
|
||
starts with the same integer, and then go with the smaller of the
|
||
two Dewey Decimal protocol identifiers.
|
||
Dewey decimal numbers that start with the same integer should be different
|
||
versions of the same protocol, and if one party can handle the
|
||
higher numbered version,
|
||
he has to be able to handle all lower numbered versions of that same protocol.
|
||
Dewey decimal numbers that start with different integers
|
||
represent unrelated protocols.
|
||
|
||
So if the client says 7.3.2.2.1, and the server has only been
|
||
updated to 7.2.0, he replies 7.2.0, and both parties then go
|
||
with 7.2.0,
|
||
but if he only knows 6.3.3, 1.6.0 and 219.1.0, he replies
|
||
"fail unknown protocol,"
|
||
|
||
People launching a new protocol pick an integer,
|
||
and if they are not sure what integers are in use,
|
||
they just pick a fairly large integer.
|
||
In time, we will wind up with a whole lot of integers that "in use",
|
||
the vast majority of which are no longer in use,
|
||
and no one is sure which ones are no longer in use,
|
||
so for a new protocol, they pick a sufficiently large random number.
|
||
(Assuming we represent these integers by variable length quantities
|
||
so that we can go to unlimitedly large integers, or at least integers
|
||
in the range [0 to 283 trillion](./variable_length_quantity.html){target="_blank"},
|
||
which should be unlimited enough for anyone.
|
||
In the unlikely event that there are eventually ten million protocols
|
||
floating around the internet
|
||
a random number in that range is unlikely to lead to a collision)
|
||
|
||
If there were ten million protocols floating around,
|
||
then the theoretically optimal way of representing
|
||
protocols would only be three or four bytes smaller,
|
||
so doing it this easy way is not a significant waste of space.
|
||
|
||
# Bayesian Prior
|
||
|
||
The Bayesian prior is the probability of a probability, or, if this recursion
|
||
is philosophically troubling, the probability of a frequency. We have an
|
||
urn containing a very large number of samples, from which we have taken
|
||
few or no samples. What proportion of samples in the urn will be
|
||
discovered to have property X?
|
||
|
||
Let our prior estimate of probability that the proportion of samples in
|
||
the urn that are X is ρ be $Ρ_{prior}(ρ)$
|
||
|
||
This is the probability of a probability. The probability is the sum over all the prior probabilities of probabilities.
|
||
|
||
Then our estimate of the chance $P_X$ that the first sample will be X is
|
||
$$P_X = \int_0^1 Ρ_{prior}(ρ) dρ$$
|
||
|
||
Then if we take one sample out of the urn, and it is indeed X, then we
|
||
update all our our priors by:
|
||
$$P_{new}(ρ) = \frac{ρ × Ρ_{prior}(ρ)}{P_X}$$
|
||
|
||
# Beta Distribution
|
||
|
||
Here I discuss the Beta distribution for a zero dimensional observable.
|
||
The observable is either true or false, green or not green, and $α$ and $β$
|
||
are continuous quantities, real numbers.
|
||
|
||
If the observable is a real number, then $α$ and $β$ are size two vectors, points
|
||
in a two dimensional spac, and this is is known as the gamma distribution.
|
||
If the observable is a two dimensional vector,
|
||
a point in a two dimensionals space, then $α$ and $β$ are size three vectors,
|
||
points in a three dimensional space, and this is known as the delta distribution
|
||
|
||
The Beta distribution is
|
||
$$P_{αβ}(ρ) = \frac{ρ^{α-1} × (1-ρ)^{β-1}}{B(α,β)}$$
|
||
where
|
||
$$B(α,β) = \frac{Γ(α) × Γ(β)}{Γ(α + β)}$$
|
||
$ρ$ is the *real* probability that the ball will be green, and $α$ and $β$
|
||
represent our prior for the likely probability of this probability.
|
||
In the delta distribution, the probability not of true or false, but
|
||
of a poisson distribution, which itself the probabilility of a value,
|
||
so we have another level of recursion.
|
||
|
||
|
||
$Γ(α) = (α − 1)!$ for positive integer α\
|
||
$Γ(1) = 1 =0!$\
|
||
$B(1,1) = 1$\
|
||
$B(1,2) = ½$\
|
||
$Γ(α+1) = α Γ(α)$ for all α
|
||
|
||
Let us call this probability distribution, the prior of our prior
|
||
|
||
It is convenient to take our prior to be a Beta distribution, for if our prior
|
||
the proportion of samples that are X is the Beta distribution $α,β$, and we
|
||
take three samples, one of which is X, and two of which are not X, then
|
||
our new distribution is the Beta distribution $α+1,β+2$
|
||
|
||
If our distribution is the Beta distribution α,β, then the probability
|
||
that the next sample will be X is $$\frac{α}{α+β}$$
|
||
|
||
If $α$ and $β$ are large, then the Beta distribution approximates a delta
|
||
function
|
||
|
||
If $α$ and $β$ equal $1$, then the Beta distribution assumes all probabilities
|
||
equally likely.
|
||
|
||
That, of course, is a pretty good prior, which leads us to the conclusion
|
||
that if we have seen $n$ samples that are green, and $m$ samples that are not
|
||
green, then the probability of the next sample being green is $$\frac{n+1}{(n+m+2}$$
|
||
|
||
Realistically, until we have seen diverse results there is a finite probability
|
||
that all samples are X, or all not X, but no beta function describes this
|
||
case.
|
||
|
||
If our prior for the question “what proportion of men are mortal?” was a
|
||
beta distribution, we would not be convinced that all men are mortal until
|
||
we had first checked all men – thus a beta distribution is not always a
|
||
plausible prior, though it rapidly converges to a plausible prior as more
|
||
data comes in.
|
||
|
||
So perhaps a fairly good prior is half of one, and half of the other. The
|
||
principle of maximum entropy tell us to choose our prior to be $α=1$,
|
||
$β=1$, but in practice, we usually have some reason to believe all
|
||
samples are alike, so need a prior that weights this possibility.
|
||
|
||
# Weight of evidence
|
||
|
||
The weight of evidence is the inverse of entropy of $P(ρ)$
|
||
$$\int_0^1 Ρ_{prior}\small(ρ\small) × \ln\big({Ρ_{prior} \small(ρ\small)}\big) dρ$$
|
||
the lower the entropy, the more we know about the distribution P(ρ),
|
||
hence the principle of maximum entropy – that our distribution should
|
||
faithfully represent the weight of our evidence, no stronger and no
|
||
weaker.
|
||
|
||
The principle of maximum entropy leaves us with the question of what
|
||
counts as evidence. To apply, we need to take into account *all*
|
||
evidence, and everything in the universe has some relevance.
|
||
|
||
Thus to answer the question “what proportion of men are mortal” the
|
||
principle of maximum entropy, naively applied, leads to the conclusion
|
||
that we cannot be sure that all men are mortal until we have first checked
|
||
all men. If, however, we include amongst our priors the fact that
|
||
all men are kin, then that all men are X, or no men are X has to have a
|
||
considerably higher prior weighting than the proposition that fifty
|
||
percent of men are X.
|
||
|
||
The Beta distribution is mathematically convenient, but
|
||
unrealistic. That the universe exists, and we can observe it,
|
||
already gives us more information than the uniform distribution, thus the
|
||
principle of maximum entropy is not easy to apply.
|
||
|
||
Further, in networks, we usually care about the current state of the
|
||
network, which is apt to change, thus we frequently need to apply a decay
|
||
factor, so that what was once known with extremly high probability, is now
|
||
only known with reasonably high probability. There is always some
|
||
unknown, but finite, substantial, and growing, probability of a large
|
||
change in the state of the network, rendering past evidence
|
||
irrelevant.
|
||
|
||
Thus any adequately flexible representation of the state of the network
|
||
has to be complex, a fairly large body of data, more akin to a spam filter
|
||
than a boolean.
|
||
|
||
# A more realistic prior
|
||
|
||
## The beta distribution corrected
|
||
|
||
Corrected for the expectation that our universe is orderly and predictable
|
||
|
||
The Beta distribution has the interesting property that for each new test,
|
||
the Baysian update of the Beta distribution is also a Beta distribution.
|
||
|
||
Suppose our prior, before we take any samples from the urn, is that the probability that the proportion of samples in the urn that are X is ρ is
|
||
$$\frac{1}{3}P_{11} (ρ) + \frac{1}{3}δ(ρ) + \frac{1}{3}δ(1-ρ)$$
|
||
|
||
We are allowing for a substantial likelihood of all X, or all not X,
|
||
whereas the ordinary beta prior is the absence of information is that
|
||
the likelihood of all X or no X is infinitesimal. Realistically, it
|
||
is substantial.
|
||
|
||
If we draw out $m + n$ samples, and find that $m$ of them are X, and $n$ of
|
||
them are not X, then the $δ$ terms drop out, and our prior is, as usual the
|
||
Beta distribution
|
||
$$P_{m+1,n+1}(ρ) = \frac{ρ^m × (1-ρ)^n }{B(m+1,n+1)}$$
|
||
if neither m nor n is zero.
|
||
|
||
But suppose we draw out n samples, and all of them are X, or none of
|
||
them are X.
|
||
|
||
Without loss of generality, we may suppose all of them are X.
|
||
|
||
Then what is our prior after n samples, all of them X?
|
||
|
||
After one sample, n=1, our new estimate is
|
||
|
||
$$\frac{2}{3} × \bigg(\frac{ρ}{B(1,1)} + δ(1−ρ)\bigg)$$
|
||
$$=\frac{1}{3}\frac{ρ}{B(2,1)} + \frac{2}{3}δ(1−ρ)$$
|
||
|
||
We see the beta distributed part of the probability distribution keeps
|
||
getting smaller, and the delta distributed part of the probability keeps
|
||
getting higher.
|
||
|
||
And our estimate that the second sample will also be X is
|
||
$$\frac{8}{9}$$
|
||
|
||
After two samples, n=2, our new estimate is
|
||
|
||
Probability $\frac{1}{4}$
|
||
|
||
Probability distribution $\frac{1}{4}ρ^2+\frac{3}{4}δ(1−ρ)$
|
||
|
||
And our estimate that the third sample will also be X is $\frac{15}{16}$
|
||
|
||
By induction, after n samples, all of them members of category X, our new
|
||
estimate for one more sample is
|
||
$$1-(n+2)^{-2}=\frac{(n+3)×(n+1)}{(n+2)^2}$$
|
||
|
||
Our estimate that the run will continue forever is
|
||
$$\frac{(n+1)}{n+2}$$
|
||
|
||
Which corresponds to our intuition on the question “all men are mortal” If we find no immortals in one hundred men, we think it highly improbable that we will encounter any immortals in a billion men.
|
||
|
||
In contrast, if we assume the beta distribution, this implies that the likelihood of the run continuing forever is zero.
|
||
|
||
Similarly, to correct the delta distribution, we have to add the assumption that the likelihood of certain trivial poisson distributions that have infinitesimal likelihood in delta distribution have finite likelihood. Though in many realistic
|
||
cases this is not going to be the cases, for a real is usually going to have some
|
||
finite distribution, and thus it is more likely to be OK to use the uncorrected
|
||
delta distribution
|
||
|
||
## Haldane prior $α=0$ $β=0$
|
||
|
||
This is a pathological and unusable prior, because $B(0,0)$ is infinite.
|
||
We cannot actually do anything with it, but has the desirable characteristic
|
||
that almost none or almost all are highly weighted, giving us much the same result as above.
|
||
|
||
It presupposes that we truly know nothing, but one has to start with some ground under one's feet.
|
||
One must always start with some leap of faith.
|
||
|
||
Since it is unusable, one may take $α=ɛ$, $β=ɛ$ where $ɛ$ is some small quantity, the smaller $ɛ$ is,
|
||
the smaller one's initial leap of faith If small, gives us a reasonable result for "all men are mortal".
|
||
Maybe immortals exist, but are enormously rare, and a small number of samples, assuming this prior,
|
||
gives us confidence that immortal men are extremely rare, but still predicts only very rare, not nonexistent.
|
||
|
||
$ɛ = \frac{1}{2}$ is a prior that generally gives results that are not too silly,
|
||
and avoids the complexity of our dual distribution delta function.
|
||
|
||
On the other hand, our dual distribution just gives us saner and more commonsensical conclusions.
|
||
|
||
## the metalog (metalogistic) distribution
|
||
|
||
The metalogistic distribution is like the Beta distribution in that
|
||
its Bayesian update is also a metalogistic distribution, but has more terms,
|
||
as many terms as are required for the nature of the thing being represented.
|
||
|
||
The Beta distribution plus two delta functions is a metalogistic distribution
|
||
if we stretch the definition of the metalogistic distribution slightly.
|
||
|
||
The Beta distribution represents the probability of a probability
|
||
(since we are using it for its Bayesian update capability).
|
||
For example we have a collection of urns containing red and not red balls,
|
||
and from time to time we draw a ball out of an urn and replace it,
|
||
whereupon the Beta distribution is our best guess
|
||
about the likelihood that it contains a certain ratio of red and not red balls
|
||
(also assuming the urns are enormously large,
|
||
and also always contain at least some red and at least some not red balls)
|
||
|
||
Suppose, however, the jars contain gravel, the size of each piece
|
||
of gravel in a jar being normally distributed, and we want to
|
||
estimate the size and standard deviation of the gravel in an urn,
|
||
rather than the ratio of red balls and not balls.
|
||
|
||
(Well, the size $s$ cannot be normally distributed, because $s$ is strictly non negative,
|
||
but perhaps $\ln(s)$, or $s\ln(s)$, or $(s/a -a/s)$ is normally distributed., or
|
||
perhaps we should assume a poisson distributon.)
|
||
|
||
Whereupon our Baysian updates become more complex,
|
||
and our prior has to contain difficult to justify information
|
||
(no boulders or dust in the urns), but we are still doing Bayesian updates,
|
||
hence the Beta distribution, or rather its generalization
|
||
the metalogistic distribution, still applies.
|
||
|
||
Typically we want to exclude outliers as evidence, so we assume
|
||
a metalogistic distribution that is the sum of two terms, one
|
||
narrowly distributed, and one broadly distributed with a long tail.
|