the good guys are winning

Spam sucks, or at least, it used to. In less than two years, filters have been developed and made available for free that work as nice as you please. I may never now the whole story, but I find this little part of it to be a nice tale of good triumphing over evil on the Web.

One day in 2002 or so, a guy named Paul Graham started playing with ways to recognize spam messages. Being a bright and inquisitive Harvard PhD, Paul eventually rediscovered that a Bayesian classification works pretty well. The idea is this: make a list of all the words that appear in a large set of real spam messages and set of real non-spam messages. For each word, compute (roughly) the odds that a message containing that word is in the spam set. That is, a probability near 1 means that a message containing the word is likely to be spam. A probability near 0 means that a message containing it is likely to be innocuous. When a message comes in, find the probabilities of all the words and multiply them all together. Bayes rule says (sort of) that this product is the probability that the message is spam. Paul tuned this up a bit. For example, he gives safe words more credit and he only multiplies the probabilities of the 15 or so most significant words (the ones with individual probabilities farthest from 0.5). Good math. Good programming tools. A good programmer to apply the one to the other and see if it works.

Well, spam messages all have to sell something (or what’s the point?) and that there are certain words characteristic of selling. The things spammers like to sell are also from a pretty narrow set. So it turns out that most messages have either a very low probability of being spam, or are clearly spam with a probability greater than 0.9. There’s not much around 0.5. Golly, it works.

Paul wrote about his results on the Web and folks noticed. Of course, other smart people thought of this, too, and Paul’s Web site became a kind of focal point for people doing work on this. A free conference sprung up at MIT, and will repeat next month.

These kinds of filters have already been incorporated into newer mail readers, and spammers are already trying (and failing) to work around them. Pretty cool. If your mail reader has a “mark as junk” button, I think it probably uses Bayesian filtering.

Now, unfortunately for working folks like me, this is a case of a good project from an altruistic wealthy software developer. No startup in sight. Years ago, Paul had been a cowboy Lisp hacker and wrote a couple of excellent books on Lisp programming. As I understand it, one day in the ‘90’s, someone asked him why, if Lisp was so good, wasn’t he rich? So Paul and his friend Robert Morris (the Internet Worm guy) set out to create a profitable WWW technology company. The result was Viaweb, which they sold to Yahoo for several tens of millions of dollars (largely in ’90’s Yahoo stock). Ironically, when you spend money on Yahoo store, the software which handles your credit card information was written by Morris, who was convicted in federal court for his earlier hacking. Actually, I’ve never heard an ill word about either Morris or Graham from anyone who knows them. Morris now does research at MIT, while Graham devotes a great deal of time speaking up for good software.

Why Canada's C-18 Isn't Working Out As Expected. by Harold July 24, 2023 Back at the end of June, Canada passed C-18, aka "The Online News Act," a law designed to make Google and Facebook negotiate with news…
S. Korea "Sender Pays" Is a Warning, Not a Model, or Why (Almost) Everyone Keeps Telling the EU This Is a VERY Bad Idea. by Harold October 14, 2022 Economist/NYT opinion writer Paul Krugman coined the term "Zombie idea" to describe an idea that, despite being repeatedly refuted with evidence, keeps coming back. Not…
My Insanely Long Field Guide to the Fox29 Philadelphia (WTFX-TV) License Renewal Challenge. by Harold August 29, 2023 In July, the Media and Democracy Project filed a Petition to Deny the license renewal of Fox29 (WTFX-TV) in Philadelphia. The Petition rests on a…
AI Policy and the Uncanny Valley Freakout. by Harold June 30, 2023 We have been debating, on and off, about the issues around artificial intelligence and AI governance for some time now. Here at Public Knowledge, we…
Gonzales v. Google Validates My Theory of Legislative Drafting -- Be Really, Really Detailed and Longwinded. by Harold February 15, 2023 Every now and then, I do some legislative drafting. I tend to get pushback on my habit of including a bunch of legislative findings and…
What is the FCC's Role in Artificial Intelligence? by Harold July 17, 2023 There are two types of public events here in DC. Those designed to actually educate people and those designed so that folks can display their…

About Stearns

Howard Stearns works at High Fidelity, Inc., creating the metaverse. Mr. Stearns has a quarter century experience in systems engineering, applications consulting, and management of advanced software technologies. He was the technical lead of University of Wisconsin's Croquet project, an ambitious project convened by computing pioneer Alan Kay to transform collaboration through 3D graphics and real-time, persistent shared spaces. The CAD integration products Mr. Stearns created for expert system pioneer ICAD set the market standard through IPO and acquisition by Oracle. The embedded systems he wrote helped transform the industrial diamond market. In the early 2000s, Mr. Stearns was named Technology Strategist for Curl, the only startup founded by WWW pioneer Tim Berners-Lee. An expert on programming languages and operating systems, Mr. Stearns created the Eclipse commercial Common Lisp programming implementation. Mr. Stearns has two degrees from M.I.T., and has directed family businesses in early childhood education and publishing.

View all posts by Stearns →

2 Comments

peg
December 14, 2003 at 12:44 pm

That explains why the spam sent to me in .html looks like this:

<font face=”verdana”size=”+3″>T<kwoxjrlielmaed> he o<kzribatcuttlud>nly<kvothosdvrs> so<kahbeqxbrtle>lut<ktlhcobdtlsj>ion to P<ksijxdewmgkgmcm>en<krymgjjchxbas>is E<khuvjmfmzpkzrco>nl<kfhfatmbrii>arge<ktbmyedjxkbex>me<kqzxlofdwic>nt</font>
John
December 14, 2003 at 9:48 pm

Wow, thanks. I had wondered where Bayesian filtering came from, and what its central idea was. It figures he was a lisper. I think I’ll check out that conferece too. Thanks.

Comments are closed

the good guys are winning

Related Posts:

About Stearns

2 Comments