Thursday, 02 May, 2002

Why Keyword Filtering Won't Stop Spam

It occurred to me this week, while I was evaluating anti-spam tools for Debra, and setting up the email filtering rules for her system, that any type of spam filtering based on sender's email address or message subject line is doomed to fail.  Message content filtering based on key words or phrases is also doomed to fail.

Filtering based on sender's email address is just silly.  It's trivial for me to generate a message that appears to come from any arbitrary email address.  Want a message that appears to come from Bill Gates?  No problem.  I can make the "from" line in an email address look any way I choose.  I could write a program that generates random email addresses—a different "from" address for each message.  As long as the domain name (the part after the @ sign) is valid, the message will get through.  So go ahead and block that random address.  What do I care?

There are two things to consider about key word and phrase filtering.  The spammers send mail that contains certain words and phrases that we add to our filters.  Be assured that the spammers subscribe to the anti-spam sites, and they know which words and phrases are being filtered.  So they rewrite their emails with new words or unique constructions of the old words and phrases.  We build a better mouse trap, they build better mice.  It's an escalation game in which the spammers have the upper hand.

The bigger problem with word and phrase filtering, though, is the increasing possibility of rejecting a legitimate message as spam.  As you add more words and phrases to your rejection list, you increase the probability that a good message will contain a "bad" word, and get shuttled off to the trash bin.  If you automatically delete suspected spam, then you've just lost a potentially important message.  And if you put suspected spam into a folder for later review, you've done nothing to reduce the workload.  Why spend time setting up message filters if you have to look through the messages by hand anyway?  The problem of legitimate information being lost during noise filtering is nothing new.  Audio engineers have dealt with it for decades.  That the Internet community seems to think that they can ignore the problem leaves me to wonder whether they've actually considered it thoroughly.