Spam sucks, or at least, it used to. In less than two years, filters have been developed and made available for free that work as nice as you please. I may never now the whole story, but I find this little part of it to be a nice tale of good triumphing over evil on the Web.
One day in 2002 or so, a guy named Paul Graham started playing with ways to recognize spam messages. Being a bright and inquisitive Harvard PhD, Paul eventually rediscovered that a Bayesian classification works pretty well. The idea is this: make a list of all the words that appear in a large set of real spam messages and set of real non-spam messages. For each word, compute (roughly) the odds that a message containing that word is in the spam set. That is, a probability near 1 means that a message containing the word is likely to be spam. A probability near 0 means that a message containing it is likely to be innocuous. When a message comes in, find the probabilities of all the words and multiply them all together. Bayes rule says (sort of) that this product is the probability that the message is spam. Paul tuned this up a bit. For example, he gives safe words more credit and he only multiplies the probabilities of the 15 or so most significant words (the ones with individual probabilities farthest from 0.5). Good math. Good programming tools. A good programmer to apply the one to the other and see if it works.
Well, spam messages all have to sell something (or what’s the point?) and that there are certain words characteristic of selling. The things spammers like to sell are also from a pretty narrow set. So it turns out that most messages have either a very low probability of being spam, or are clearly spam with a probability greater than 0.9. There’s not much around 0.5. Golly, it works.
Paul wrote about his results on the Web and folks noticed. Of course, other smart people thought of this, too, and Paul’s Web site became a kind of focal point for people doing work on this. A free conference sprung up at MIT, and will repeat next month.
These kinds of filters have already been incorporated into newer mail readers, and spammers are already trying (and failing) to work around them. Pretty cool. If your mail reader has a “mark as junk” button, I think it probably uses Bayesian filtering.
Now, unfortunately for working folks like me, this is a case of a good project from an altruistic wealthy software developer. No startup in sight. Years ago, Paul had been a cowboy Lisp hacker and wrote a couple of excellent books on Lisp programming. As I understand it, one day in the ‘90’s, someone asked him why, if Lisp was so good, wasn’t he rich? So Paul and his friend Robert Morris (the Internet Worm guy) set out to create a profitable WWW technology company. The result was Viaweb, which they sold to Yahoo for several tens of millions of dollars (largely in ’90’s Yahoo stock). Ironically, when you spend money on Yahoo store, the software which handles your credit card information was written by Morris, who was convicted in federal court for his earlier hacking. Actually, I’ve never heard an ill word about either Morris or Graham from anyone who knows them. Morris now does research at MIT, while Graham devotes a great deal of time speaking up for good software.