Saturday, May 06, 2006

Spam prevention mechanism

One of the famous Spam prevention mechanism is Bayesian Filtering.
This Bayesian filtering can filter almost any kind of data but widely applied in email world to classify a spam mail.

The algorithm operates on the classic bayes theorem.
Probability that A occurs given B has occured = Probability that B occurs given A occured X probability that B occurs divided by Probability that A occurs

P(A/B) = P(B/A)x P(B) / P(A)


In email context, it is classifying that a mail is a spam based on the words in it.

Probability that these words occur in a spam mail X probability that a given mail is a spam normalised by the probability that these words can occur in a mail.

Probability that a given mail is a spam is a user specific factor : If the user rejects lot of mail as spam, he may be so picky or he really gets a lot of spam. Then this will be high for this user.

But if he gets very less probable spam mails; but what ever is put as spam contains these words; then it does not mean that this is a spam mail. Because these can be in a mail that is not a spam. So the probability of these words in any mail (denominator) checks this factor that commonly used words in all mails will not be given high weightage in rejecting a mail as spam.

No comments: