Skip to main content

Spam wars : How to fight back.

Bayesian Filters

One of the most promising new weapons against spam is Bayesian filtering. Much interest around this statistical approach to spam filtering was inspired by Paul Graham's article in 2002 'A Plan for Spam' article.

Thomas Bayes was an English clergyman who established the basis for probability inference. This is the means of calculating, from the frequency with which an event has occurred in prior trials, the probability that this event will occur in the future i.e. from looking at past messages it should be able to determine future messages that are spam.

When a new email arrives it is scanned into tokens which are defined by white space in the email text. The 15 most interesting tokens are then analysed against a database of known tokens from previous emails that have already been classified as spam or non-spam.

Using Bayesian analysis the spam probabilities of these tokens are combined to produce an overall probability that the email is spam or non-spam. Administrator's can then deal with the mail accordingly based on the probability score. A mail with a 90% spam probability could be immediately rejected whilst one of 70% could be put in a quarantine folder.

Bayesisan analysis has a number of advantages over normal heuristic and content filters. Let's take Ann Summers as an example. Their legitimate email could well contain words such as lingerie or erotic, which under normal filters may well be flagged up as spam.

The key to Bayesian analysis, however, is that it looks at two sets of words; those that indicate spam and those that indicate legitimate email. In the case of Ann Summers, lingerie will be marked as a token that has a low spam probability because it has appeared in legitimate emails, and so will reduce the probability of the message being discarded as spam.

For a short initial period Bayesian filters may prove less effective as the system needs to be trained by building up a database of categorised mails as they pass through your system. Once this period is over, however, Bayesian analysis does not require the constant monitoring and adjusting of traditional heuristic filters, with an ability to continue self-learning.

Microsoft, Borderware and Spammunition (a freeware add on to Outlook) have all done work on Bayesian filtering and current examples are claiming impressive filtration rates of over 99% combined with very low false positive rates.

The hope is that spammers will only be able to beat Bayesian filters by making spam virtually indistinguishable from legitimate email. The resulting dilution of the spammer's message should impact on response rates, hopefully making it uneconomic to continue spamming.

At the moment dealing with spam is much like an arms race. As new methods to detect spam come along so spammers attempt to come up with new tricks to fool them. A huge amount is at stake for spammers and companies alike but what's clear is that sitting on the sidelines is no longer an option.