1
0
forked from cheng/wallet
wallet/docs/recognizing_categories_and_instances.md

220 lines
11 KiB
Markdown
Raw Normal View History

---
title: Recognizing categories, and recognizing particulars as forming a category
# katex
...
This is, of course, a deep unsolved problem in philosophy.
However, it seems to be soluble as computer algorithm. Programs that do
this, ought to look conscious.
There are a lot of programs solving things that I though were AI hard, for
example recognizing pornography, recognizing faces in images, predicting what
music, or what books, or what movies, a particular customer might like.
We have clustering algorithms that work in on points in spaces of reasonably
small dimension. However, instances are sparse vectors in space of
unenumerably large dimension.
Consider, for example, the problem of grouping like documents to like, for
spam filtering. Suppose the properties of the document are all substrings of
the document of twenty words or less and 200 characters or less. In that case,
there are as many dimensions as there are two hundred character strings.
# Dimensional reduction
The combinatorial explosion occurs because we have taken the wrong approach
to reducing problems that originate in the physical world of very large
dimension, large because each quality of the objects involved or potentially
involved is a dimension.
The cool magic trick that makes this manageable is dimensional reduction.
Johnson and Lindenstrauss discovered in the early 1980s that if one has
$O(2^n)$ points in a space of very large dimension, a random projection onto a
space of dimension $O(n)$ does not much affect distances and angles.
Achlioptas found that this is true for the not very random mapping wherein
elements of the matrix mapping the large space to the smaller space have the
form $1$, with probability $\frac{1}{6}$, $0$ with probability $\frac{4}{6}$,
$-1$ with probability $\frac{1}{6}$, though a sparse matrix is apt to
distort a sparse vector
There exists a set of points of size $m$ that needs dimension
$$\displaystyle{O(\frac{\log(m)}{ε^2})}$$
in order to preserve the distances
between all pairs of points within a factor of$1±ε$
The time to find the nearest neighbour is logarithmic in the number of points,
but exponential in the dimension of the space. So we do one pass with rather
large epsilon, and another pass, using an algorithm proportional to the small
number of candidate neighbours times the dimensionality with a small number
of candidate neighbours found in the first pass.
So in a space of unenumerably large dimension, such as the set of substrings
of an email, or perhaps substrings of bounded length with bounds at spaces,
carriage returns, and punctuation, we deterministically hash each substring,
and use the hash to deterministically assign a mapping between the vector
corresponding to this substring, and a vector in the reduced space.
The optimal instance recognition algorithm, for normally distributed
attributes, and for already existent, already known categories, is Mahalanobis
distance
Is not the spam characteristic of an email just its $T.(S-G)$, where $T$ is
the vector of the email, and $S$ and $G$ are the average vectors of good
email and spam email?
Variance works, instead of probability Mahalanobis distance, but this is
most reasonable for things that have reasonable dimension, like attributing
skulls to races, while dimensional reduction is most useful in spaces of
unenumerably large dimension, where distributions are necessarily non
normal.
But variance is, approximately, the log of probability, so Mahalanobis is
more or less Bayes filtering.
So we can reasonably reduce each email into twenty questions space, or, just
to be on the safe side, forty questions space. (Will have to test how many
dimensions empirically retain angles and distances)
We then, in the reduced space, find natural groupings, a natural grouping
being an elliptic region in high dimensional space where the density is
anomalously large, or rather a normal distribution in high dimensional space
such that assigning a particular email to a particular normal distribution
dramatically reduces the entropy.
We label each such natural grouping with the statistically improbable phrase
that best distinguishes members of the grouping from all other such groupings.
The end user can then issue rules that mails belonging to certain groupings
be given particular attention or lack of attention, such as being sent
direct to spam.
The success of face recognition, etc, suggests that this might be just a
problem of clever algorithms. Pile enough successful intelligence like
algorithms together, integrate them well, perhaps we will have sentience.
Analogously with the autonomous cars. They had no new algorithms, they just
made the old algorithms actually do something useful.
# Robot movement
Finding movement paths is full of singularities, looks to me that we force it
down to two and half dimensions, force the obstacles to stick figures, and
then find a path to the destination. Hence the mental limit on complex knot
problems.
Equivalently, we want to reduce the problem space to a collection of regions
in which pathfinding algorithms that assume continuity work, and then
construct graph of such regions where nodes correspond to such convex region
within which continuity works, and edges correspond an overlap between two
such convex regions. Since the space is enormous, drastic reduction is
needed.
In the case of robot movement we are likely to wind up with a very large
graph of such convex regions within which the assumption of singularity free
movement is correct, and because the graph is apt to be very large, finding
an efficient path through the graph is apt to be prohibitive, which is apt to
cause robot ground vehicles to crash because they cannot quickly figure out
the path to evade an unexpected object and makes it impractical for a robot
to take a can of beer from the fridge.
We therefore use the [sybil guard algorithm] to reduce the graph by treating
groups of highly connected vertices as a single vertex.
[sybil guard algorithm]:./sybil_attack.html
# Artificial Intelligence
[Gradient descent is not what makes Neural nets work] Comment by Bruce on
Echo State Networks.
[Gradient descent is not what makes Neural nets work]:https://scottlocklin.wordpress.com/2012/08/02/30-open-questions-in-physics-and-astronomy/
Echo state Network is your random neural network, which mixes a great pile of
randomness into your actual data, to expand it into a much larger pile of
data that implicitly contains all the uncorrupted original information in its
very great redundancy, albeit in massively mangled form. Then “You just fit
the output layer using linear regression. You can fit it with something more
complicated, but why bother; it doesnt help.”
A generalization of “fitting the output layer using linear regression” is
finding groupings, recognizing categories, in the space of very large dimension
that consists of the output of the output layer.
Fitting by linear regression assumes we already have a list of instances that
are known to be type specimens of the category, assumes that the category is
already defined and we want an efficient way of recognizing new instances as
members of this category. But living creatures can form new categories,
without having them handed to them on a platter. We want to be able to
discover that a group of instances belong together.
So we generate a random neural network, identify those outputs that provide
useful information identifying categories, and prune those elements of the
network that do not contribute useful information identifying useful categories.
That it does not help tells me you are doing a dimensional reduction on the
outputs of an echo state network.
You are generating vectors in a space of uncountably large dimension, which
vectors describe probabilities, and probabilities of probabilities (Bayesian
regress, probability of a probability of a frequency, to correct priors, and
priors about priors) so that if two vectors are distant in your space, one is
uncorrelated with the other, and if two things are close, they are
correlated.
Because the space is of uncountably large dimension, the vectors are
impossible to manipulate directly, so you are going to perform a random
dimensional reduction on a set of such vectors to a space of manageably large
dimension.
At a higher level you eventually need to distinguish the direction of
causation in order to get an action network, a network that envisages action
to bring the external world through a causal path to an intended state, which
state has a causal correlation to *desire*, a network whose output state is
*intent*, and whose goal is desire. When the action network selects one
intended state of the external world rather than another, that selection is
*purpose*. When the action network selects one causal path rather than
another, that selection is *will*.
The colour red is not a wavelength, a phenomenon, but is a qualia, an
estimate of the likelihood that an object has a reflectance spectrum in
visible light peaked in that wavelength, but which estimate of probability
can then be used as if it were a phenomena in forming concepts, such as
blood, which in turn can be used to form higher level concepts, as when the
Old Testament says of someone who needed killing “His blood is on his own
head”.
Concepts are Hegelian Neural Networks: “Neurons that fire together, wire
together”
This is related to random dimensional reduction. You have a collection of
vectors in space of uncountably large dimension. Documents, emails, what you
see when you look in a particular direction, what you experience at a
particular moment. You perform a random dimensional reduction to a space of
manageable dimension, but large enough to probabilistically preserve
distances and angles in the original space typically twenty or a hundred
dimensions.
By this means, you are able to calculate distances and angles in your
dimensionally reduced space which approximately reflect the distances and
angles in the original space, which was probably of dimension
$10^{100^{100^{100}}}$, the original space being phenomena that occurred
together, and collections of phenomena that occurred together that you have
some reason for bundling into a collection, and your randomly reduced space
having dimension of order that a child can count to in a reasonably short
time.
And now you have vectors such that you can calculate the inner product and
cross product on them, and perform matrix operations on them. This gives you
qualia. Higher level qualia are *awareness*
And, using this, you can restructure the original vectors, for example
structuring experiences into events, structuring things in the visual field
into objects, and then you can do the same process on collections of events,
and collections of objects that have something common.
Building a flying machine was very hard, until the Wright brothers said
“three axis control, pitch, yaw, and roll”
Now I have said the words “dimensional reduction of vectors in a space of
uncountably large dimension, desire, purpose, intent, and will”