1
0
forked from cheng/wallet
wallet/docs/recognizing_categories_and_instances.md

260 lines
13 KiB
Markdown
Raw Normal View History

---
title: Recognizing categories, and recognizing particulars as forming a category
# katex
...
This is, of course, a deep unsolved problem in philosophy.
However, it seems to be soluble as computer algorithm. Programs that do
this, ought to look conscious.
There are a lot of programs solving things that I though were AI hard, for
example recognizing pornography, recognizing faces in images, predicting what
music, or what books, or what movies, a particular customer might like.
We have clustering algorithms that work in on points in spaces of reasonably
small dimension. However, instances are sparse vectors in space of
unenumerably large dimension.
Consider, for example, the problem of grouping like documents to like, for
spam filtering. Suppose the properties of the document are all substrings of
the document of twenty words or less and 200 characters or less. In that case,
there are as many dimensions as there are two hundred character strings.
# Dimensional reduction
The combinatorial explosion occurs because we have taken the wrong approach
to reducing problems that originate in the physical world of very large
dimension, large because each quality of the objects involved or potentially
involved is a dimension.
The cool magic trick that makes this manageable is dimensional reduction.
Johnson and Lindenstrauss discovered in the early 1980s that if one has
$O(2^n)$ points in a space of very large dimension, a random projection onto a
space of dimension $O(n)$ does not much affect distances and angles.
Achlioptas found that this is true for the not very random mapping wherein
elements of the matrix mapping the large space to the smaller space have the
form $1$, with probability $\frac{1}{6}$, $0$ with probability $\frac{4}{6}$,
$-1$ with probability $\frac{1}{6}$, though a sparse matrix is apt to
distort a sparse vector
There exists a set of $m$ points that needs dimension
$$\displaystyle{\LARGE\bigcirc\normalsize\frac{\ln(m)}{ε^2}}$$
in order to preserve the distances
between all pairs of points within a factor of$1±ε$
This is apt to be a lot. We might well have ten million points, and wish to
preserve distances within twenty five percent, in which case we need two
hundred and fifty six dimensions. So a dimensionally reduced point is not
necessarily reduced by a whole lot.
For spaces of dimension higher than fifteen or so, clever methods of
nearest neighbour search generally fail, and people generally wind up with
brute force search, comparing each point to each of the others, and then
they aggregate into groups by making near neighbours into groups, and
near groups into supergroups.
Wikipedia reports two open source methods, Locality Sensitive Hashing,
one of them used for exactly this problem, finding groups in emails.
The problem of finding near neighbours in social space, mapping the
[Kademlia]{#Kademlia} algorithm to social space, is similar but little different, since
every vertex already is more likely to have connection to neighbours, and
for an arbitrary vertex, whose connections we do not know, we want to
find a vertex among those we do know that is more likely to have a
connection to it, or to someone that has a connection to it, that is likely
to be nearer in terms of number of vertex traversals.
In which case everyone reports a value that reflects his neighbours, and
their neighbours, and their neighbours neighbours, with a neighbourhood
smell that grows more similar when we find a vertex likely to be nearest
neighbour of our target, and the problem is to construct this smell as a
moderately sized blob of data, that can be widely shared, so that each
vertex has unique smell of, say 256 or 512 bits, that reflects who it stably
has the information to quickly connect with, so you look at who you have
a connection with to find who is likely to have a connection to your target,
and he looks up those he is connected with to find someone more likely to
have a connection.
Locality sensitive hashing, LSH, including the open source email
algorithm, Nilsimsa Hash, attempts to distribute all points that are near
each other into the same bin.
So in a space of unenumerably large dimension, such as the set of substrings
of an email, or perhaps substrings of bounded length with bounds at spaces,
carriage returns, and punctuation, we deterministically hash each substring,
and use the hash to deterministically assign a mapping between the vector
corresponding to this substring, and a vector in the reduced space.
The optimal instance recognition algorithm, for normally distributed
attributes, and for already existent, already known categories, is Mahalanobis
distance
Is not the spam probability of an email just its $T.(S-G)$, where $T$ is
the vector of the email, and $S$ and $G$ are the average vectors of good
email and spam email?
Variance works, instead of probability Mahalanobis distance, but this is
most reasonable for things that have reasonable dimension, like attributing
skulls to races, while dimensional reduction is most useful in spaces of
unenumerably large dimension, where distributions are necessarily non
normal.
But variance is, approximately, the log of probability, so Mahalanobis is
more or less Bayes filtering, or at least one can be derived in terms of the other.
So we can reasonably reduce each email into twenty questions space, albeit in practice, a great deal more than twenty. Finding far from random
dimensions that reduce it to a mere twenty or so is an artificial intelligence
hard problem. If random dimensions, need $\bigcirc20\log{(n)}$ dimensions
where $n$ is the number of things. And $n$ is apt to be very large.
Finding interesting and relevant dimensions, and ignoring irrelevant and
uninteresting dimensions, is the big problem. It is the tie between
categorizing the world into natural kinds and seeing what matters in the
perceptual data while ignoring what is trivial and irrelevant. This requires
non trivial non local and non linear combinations of data, for example
adjusting the perceived colour of the apple for shadow and light colour, so
see the ample, rather than merely the light scattered by the apple into the eye.
We then, in the reduced space, find natural groupings, a natural grouping
being an elliptic region in high dimensional space where the density is
anomalously large, or rather a normal distribution in high dimensional space
such that assigning a particular email to a particular normal distribution
dramatically reduces the entropy.
We label each such natural grouping with the statistically improbable phrase
that best distinguishes members of the grouping from all other such groupings.
The end user can then issue rules that mails belonging to certain groupings
be given particular attention or lack of attention, such as being sent
direct to spam.
The success of face recognition, etc, suggests that this might be just a
problem of clever algorithms. Pile enough successful intelligence like
algorithms together, integrate them well, perhaps we will have sentience.
Analogously with the autonomous cars. They had no new algorithms, they just
made the old algorithms actually do something useful.
# Robot movement
Finding movement paths is full of singularities, looks to me that we force it
down to two and half dimensions, force the obstacles to stick figures, and
then find a path to the destination. Hence the mental limit on complex knot
problems.
Equivalently, we want to reduce the problem space to a collection of regions
in which pathfinding algorithms that assume continuity work, and then
construct graph of such regions where nodes correspond to such convex region
within which continuity works, and edges correspond an overlap between two
such convex regions. Since the space is enormous, drastic reduction is
needed.
In the case of robot movement we are likely to wind up with a very large
graph of such convex regions within which the assumption of singularity free
movement is correct, and because the graph is apt to be very large, finding
an efficient path through the graph is apt to be prohibitive, which is apt to
cause robot ground vehicles to crash because they cannot quickly figure out
the path to evade an unexpected object and makes it impractical for a robot
to take a can of beer from the fridge.
We therefore use the [sybil guard algorithm] to reduce the graph by treating
groups of highly connected vertices as a single vertex.
[sybil guard algorithm]:./sybil_attack.html
# Artificial Intelligence
[Gradient descent is not what makes Neural nets work] Comment by Bruce on
Echo State Networks.
[Gradient descent is not what makes Neural nets work]:https://scottlocklin.wordpress.com/2012/08/02/30-open-questions-in-physics-and-astronomy/
Echo state Network is your random neural network, which mixes a great pile of
randomness into your actual data, to expand it into a much larger pile of
data that implicitly contains all the uncorrupted original information in its
very great redundancy, albeit in massively mangled form. Then “You just fit
the output layer using linear regression. You can fit it with something more
complicated, but why bother; it doesnt help.”
A generalization of “fitting the output layer using linear regression” is
finding groupings, recognizing categories, in the space of very large dimension
that consists of the output of the output layer.
Fitting by linear regression assumes we already have a list of instances that
are known to be type specimens of the category, assumes that the category is
already defined and we want an efficient way of recognizing new instances as
members of this category. But living creatures can form new categories,
without having them handed to them on a platter. We want to be able to
discover that a group of instances belong together.
So we generate a random neural network, identify those outputs that provide
useful information identifying categories, and prune those elements of the
network that do not contribute useful information identifying useful categories.
That it does not help tells me you are doing a dimensional reduction on the
outputs of an echo state network.
You are generating vectors in a space of uncountably large dimension, which
vectors describe probabilities, and probabilities of probabilities (Bayesian
regress, probability of a probability of a frequency, to correct priors, and
priors about priors) so that if two vectors are distant in your space, one is
uncorrelated with the other, and if two things are close, they are
correlated.
Because the space is of uncountably large dimension, the vectors are
impossible to manipulate directly, so you are going to perform a random
dimensional reduction on a set of such vectors to a space of manageably large
dimension.
At a higher level you eventually need to distinguish the direction of
causation in order to get an action network, a network that envisages action
to bring the external world through a causal path to an intended state, which
state has a causal correlation to *desire*, a network whose output state is
*intent*, and whose goal is desire. When the action network selects one
intended state of the external world rather than another, that selection is
*purpose*. When the action network selects one causal path rather than
another, that selection is *will*.
The colour red is not a wavelength, a phenomenon, but is a qualia, an
estimate of the likelihood that an object has a reflectance spectrum in
visible light peaked in that wavelength, but which estimate of probability
can then be used as if it were a phenomena in forming concepts, such as
blood, which in turn can be used to form higher level concepts, as when the
Old Testament says of someone who needed killing “His blood is on his own
head”.
Concepts are Hegelian Neural Networks: “Neurons that fire together, wire
together”
This is related to random dimensional reduction. You have a collection of
vectors in space of uncountably large dimension. Documents, emails, what you
see when you look in a particular direction, what you experience at a
particular moment. You perform a random dimensional reduction to a space of
manageable dimension, but large enough to probabilistically preserve
distances and angles in the original space typically twenty or a hundred
dimensions.
By this means, you are able to calculate distances and angles in your
dimensionally reduced space which approximately reflect the distances and
angles in the original space, which was probably of dimension
$10^{100^{100^{100}}}$, the original space being phenomena that occurred
together, and collections of phenomena that occurred together that you have
some reason for bundling into a collection, and your randomly reduced space
having dimension of order that a child can count to in a reasonably short
time.
And now you have vectors such that you can calculate the inner product and
cross product on them, and perform matrix operations on them. This gives you
qualia. Higher level qualia are *awareness*
And, using this, you can restructure the original vectors, for example
structuring experiences into events, structuring things in the visual field
into objects, and then you can do the same process on collections of events,
and collections of objects that have something common.
Building a flying machine was very hard, until the Wright brothers said
“three axis control, pitch, yaw, and roll”
Now I have said the words “dimensional reduction of vectors in a space of
uncountably large dimension, desire, purpose, intent, and will”