wallet/docs/estimating_frequencies_from_small_samples.md

---
lang: en
title: Estimating frequencies from small samples
# katex
---
# The problem to be solved

Because protocols need to be changed, improved, and fixed from time to
time, it is essential to have a protocol negotiation step at the start of every networked interaction, and protocol requirements at the start of every store
and forward communication.

But we also want anyone, anywhere, to be able to introduce new
protocols, without having to coordinate with everyone else, as attempts to
coordinate the introduction of new protocols have ground to a halt, as
more and more people are involved in coordination and making decisions.
The IETF is paralyzed and moribund.

So we need a large enough address space that anyone can give his
protocol an identifier without fear of stepping on someone else’s identifier.
But this involves inefficiently long protocol identifiers, which can become
painful if we have lots of protocol negotiation, where one system asks
another system what protocols it supports.  We might have lots of
protocols in lots of variants each with long names.

So our system forms a guess as to the likelihood of a protocol, and then
sends or requests enough bits to reliably identify that protocol.  But this
means it must estimate probabilities from limited data.  If one’s data is
limited, priors matter, and thus a Bayesian approach is required.

# Bayesian Prior

The Bayesian prior is the probability of a probability, or, if this recursion
is philosophically troubling, the probability of a frequency.  We have an
urn containing a very large number of samples, from which we have taken
few or no samples.  What proportion of samples in the urn will be
discovered to have property X?

Let our prior estimate of probability that the proportion of samples in
the urn that are X is ρ be $Ρ_{prior}(ρ)$

This is the probability of a probability.  The probability is the sum over all the prior probabilities of probabilities.

Then our estimate of the chance $P_X$ that the first sample will be X is
$$P_X = \int_0^1 Ρ_{prior}(ρ) dρ$$

Then if we take one sample out of the urn, and it is indeed X, then we
update all our our priors by:
$$P_{new}(ρ)  = \frac{ρ × Ρ_{prior}(ρ)}{P_X}$$

# Beta Distribution

The Beta distribution is
$$P_{αβ}(ρ) =   \frac{ρ^{α-1} × (1-ρ)^{β-1}}{B(α,β)}$$
where
$$B(α,β) = \frac{Γ(α) × Γ(β)}{Γ(α + β)}$$

$Γ(α) = (α − 1)!$ for positive integer α\
$Γ(1) = 1 =0!$\
$B(1,1) = 1$\
$B(1,2) = ½$\
$Γ(α+1) = α Γ(α)$ for all α

Let us call this probability distribution, the prior of our prior

It is convenient to take our prior to be a Beta distribution, for if our prior
the proportion of samples that are X is the Beta distribution $α,β$, and we
take three samples, one of which is X, and two of which are not X, then
our new distribution is the Beta distribution $α+1,β+2$

If our distribution is the Beta distribution α,β, then the probability
that the next sample will be X is $\frac{α}{α+β}$

If $α$ and $β$ are large, then the Beta distribution approximates a delta
function

If $α$ and $β$ equal $1$, then the Beta distribution assumes all probabilities
equally likely.

That, of course, is a pretty good prior, which leads us to the conclusion
that if we have seen $n$ samples that are green, and $m$ samples that are not
green, then the probability of the next sample being green is $\frac{n+1}{(n+m+2}$

Realistically, until we have seen diverse results there is a finite probability
that all samples are X, or all not X, but no beta function describes this
case.

If our prior for the question “what proportion of men are mortal?” was a
beta distribution, we would not be convinced that all men are mortal until
we had first checked all men  –  thus a beta distribution is not always a
plausible prior, though it rapidly converges to a plausible prior as more
data comes in.

So perhaps a fairly good prior is half of one, and half of the other. The
principle of maximum entropy tell us to choose our prior to be $α=1$,
$β=1$, but in practice, we usually have some reason to believe all
samples are alike, so need a prior that weights this possibility.

# Weight of evidence

The weight of evidence is the inverse of entropy of $P(ρ)$
$$\int_0^1 Ρ_{prior}\small(ρ\small) × \ln\big({Ρ_{prior} \small(ρ\small)}\big)  dρ$$
the lower the entropy, the more we know about the distribution P(ρ),
hence the principle of maximum entropy – that our distribution should
faithfully represent the weight of our evidence, no stronger and no
weaker.

The principle of maximum entropy leaves us with the question of what
counts as evidence.  To apply, we need to take into account *all*
evidence, and everything in the universe has some relevance.

Thus to answer the question “what proportion of men are mortal” the
principle of maximum entropy, naiely applied, leads to the conclusion
that we cannot be sure that all men are mortal until we have first checked
all men.  If, however, we include amongst our priors the fact that
all men are kin, then that all men are X, or no men are X has to have a
considerably higher prior weighting than the proposition that fifty
percent of men are X.

The Beta distribution is mathematically convenient, but
unrealistic.  That the universe exists, and we can observe it,
already gives us more information than the uniform distribution, thus the
principle of maximum entropy is not easy to apply.

Further, in networks, we usually care about the current state of the
network, which is apt to change, thus we frequently need to apply a decay
factor, so that what was once known with extremly high probability, is now
only known with reasonably high probability.  There is always some
unknown, but finite, substantial, and growing, probability of a large
change in the state of the network, rendering past evidence
irrelevant.

Thus any adequately flexible representation of the state of the network
has to be complex, a fairly large body of data, more akin to a spam filter
than a boolean.

# A more realistic prior

Suppose our prior, before we take any samples from the urn, is that the probability that the proportion of samples in the urn that are X is ρ is
$$\frac{1}{3}P_{11} (ρ) + \frac{1}{3}δ(ρ) + \frac{1}{3}δ(1-ρ)$$

We are allowing for a substantial likelihood of all X, or all not X.

If we draw out $m + n$ samples, and find that $m$ of them are X, and $n$ of
them are not X, then the $δ$ terms drop out, and our prior is, as usual the
Beta distribution
$$P_{m+1,n+1}(ρ) = \frac{ρ^m × (1-ρ)^n }{B(m+1,n+1)}$$
if neither m nor n is zero.

But suppose we draw out n samples, and all of them are X, or none of
them are X.

Without loss of generality, we may suppose all of them are X.

Then what is our prior after n samples, all of them X?

After one sample, n=1, our new estimate is

$$\frac{2}{3} × \bigg(\frac{ρ}{B(1,1)} + δ(1−ρ)\bigg)$$
$$=\frac{1}{3}\frac{ρ}{B(2,1)} + \frac{2}{3}δ(1−ρ)$$

We see the beta distributed part of the probability distribution keeps
getting smaller, and the delta distributed part of the probability keeps
getting higher.

And our estimate that the second sample will also be X is
$$\frac{8}{9}$$

After two samples, n=2, our new estimate is

Probability $\frac{1}{4}$

Probability distribution $\frac{1}{4}ρ^2+\frac{3}{4}δ(1−ρ)$

And our estimate that the third sample will also be X is $\frac{15}{16}$

By induction, after n samples, all of them members of category X, our new
estimate for one more sample is
$$1-(n+2)^{-2}=\frac{(n+3)×(n+1)}{(n+2)^2}$$

Our estimate that the run will continue forever is
$$\frac{(n+1)}{n+2}$$

Which corresponds to our intuition on the question “all men are mortal” If we find no immortals in one hundred men, we think it highly improbable that we will encounter any immortals in a billion men.

In contrast, if we assume the beta distribution, this implies that the likelihood of the run continuing forever is zero.