35 lines
3.2 KiB
HTML
35 lines
3.2 KiB
HTML
<!DOCTYPE html>
|
||
<html lang="en">
|
||
<head>
|
||
<meta content="text/html; charset=UTF-8" http-equiv="content-type">
|
||
<style>
|
||
body {
|
||
max-width: 30em;
|
||
margin-left: 2em;
|
||
}
|
||
p.center {text-align:center;}
|
||
</style>
|
||
<link rel="shortcut icon" href="../rho.ico">
|
||
<title>Spam filtering</title>
|
||
</head>
|
||
<body>
|
||
<p><a href="./index.html"> To Home page</a> </p>
|
||
<h1>Spam Filtering</h1>
|
||
|
||
<p>Divide the text into natural units – type of text, headers, paragraphs, and standard parts. Divide the larger scale units into natural units identifying word and sentence boundaries.</p>
|
||
|
||
<p>Perform a dimensional reduction to forty plus dimensions on all possible substrings of the smallest natural units, up to a maximum of ten plus words or so. Store the dimensionally reduced vector of each substring in a hash table of a few gigabytes maximum, small enough to fit in ram, preferentially keeping common substrings, and dumping old and seldom seen substrings, thereby developing a dictionary of common words and phrases. That certain substrings have been seen before implies that they can be useful in classification, while never before seen substrings are useful in classification only as a count of never before seen substrings. This is our lowest level of order detection.</p>
|
||
|
||
<p>Label each larger unit by a vector which is the sum of the dimensionally reduced vector of the previously seen substrings, plus the novelty level (random untintellible jibberish would have a high novelty level and a small, and therefore uninformative vector)</p>
|
||
|
||
<p>Perform an entropy reduction (identification of natural groups within dimensionally reduced space) on the set of larger units.</p>
|
||
|
||
<p>Repeat recursively all the way up to the largest natural unit, to get the natural groupings of the largest units. A familiar sentence has familiar substrings. A familiar paragraph has familiar sentences. A familiar message has familiar paragraphs. Of course, we notice a problem in that messages obviously belong to multiple groups – all messages originating from this address, all messages with the same text style as messages originating from this address, all messages on this topic, all messages trying to sell something. The problem is not merely grouping, but rather extracting significant dimensions from random dimensional reduction. We are looking for a reduced set of high level dimensions that predict a whole lot of low level dimensions, and a massive brute force search for correlations between the lowest level entities (substrings) is going to turn up correlations which are merely an accidental and arbitrary result of forms of order that the algorithm is incapable of perceiving.</p>
|
||
|
||
<p>Learn which natural groupings of the largest units the end user identifies as spam, and label each such natural grouping by the most common large substring that identifies the largest unit (that gives the most information about) the largest unit as a member of its class.</p>
|
||
|
||
<p style="background-color : #ccffcc; font-size:80%">These documents are
|
||
licensed under the <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution-Share Alike 3.0 License</a></p>
|
||
</body>
|
||
</html>
|
||
////// |