Wednesday, 30 April, 2003

A homebrew spam filter?

If you also read my friend Jeff Duntemann's Web diary, you know that he's been working on ways to filter spam.  His idea?  Rather than filtering on senders, domains, or specific words, parse the URLs to which messages point.  That's the point of most spam, after all—to get you to click on a URL that's embedded in the message.  If a program can identify the target URL as being a spam URL, then you have an (almost) foolproof filter.  Spammers already go through a lot of trouble to obfuscate those URLs, and also place garbage HTML in the message to confuse HTML parsers.  Unfortunately, you can't just reject all badly-formed HTML because so many mail clients and other HTML tools do such a poor job of generation.

Jeff and I have split the project along fairly logical lines.  I'm writing the communications infrastructure, and he's working on the filtering and database design.  I got the easy part.  Assuming that the Internet Direct (Indy) POP3 components work (a fair assumption, given my previous success with the Indy components), putting together the proxy won't be terribly difficult.  Testing it with the major mail clients may prove a little more interesting, and there's always the question of user interface.  It should be an interesting project, and a welcome change from my day-to-day work writing .NET training presentations.