« SpamAssassin Rules | Main | Scripting, Old School »

Banging Oprah

Recently, I came across a blog posting about the SpamAssassin rule BANG_OPRAH, which looks for a word boundary followed by 'oprah' (case-insensitive) and an exclamation mark (bang):

body BANG_OPRAH /\boprah!/i
describe BANG_OPRAH
  Talks about Oprah with an exclamation!
score BANG_OPRAH 4.300

In terms of fidelity to the regular expression that defines it, the rule really ought to be called OPRAH_BANG, but clearly someone with a sense of humor decided on the less accurate but more entertaining moniker.

I wondered how common this type of spam is in my corpus of filter spam emails, so I looked at the frequency of the BANG_OPRAH rule.

> data <- read.table("spam-all.tab",
+                    header=TRUE,
+                    sep="\t",
+                    quote="")
> sum(data$bang_oprah)
[1] 3
> length(data$bang_oprah)
[1] 27775
> sum(data$bang_oprah) / length(data$bang_oprah) * 100
[1] 0.01080108

So the BANG_OPRAH rules only fires 3 times in a corpus of 27,775 spam emails. That's a fraction of a percent. But what about other rules? I thought I'd compare it to some of the stock alert spam that is so common nowadays.

> sum(data$stock_alert)
[1] 17
> sum(data$stock_alert) / length(data$bang_oprah) * 100
[1] 0.06120612
> sum(data$stock_pick)
[1] 10
> sum(data$stock_pick) / length(data$bang_oprah) * 100
[1] 0.0360036

I'm surprised that the stock rules aren't firing more often. Clearly I need to understand better what these rules do, but here's something to think about. There is a lot of discussion of the pros and cons of the spam score increase for various rules, such as the rule BAYES_99 (which means that an email has a 99 to 100 percent likelihood of being spam according to the Bayes filter). But how common is it relative to the other Bayes probabilities? Here are the percentages according to R:

> sum(data$bayes_00)/length(data$score)*100
[1] 10.69307
> sum(data$bayes_05)/length(data$score)*100
[1] 2.027003
> sum(data$bayes_20)/length(data$score)*100
[1] 2.394239
> sum(data$bayes_40)/length(data$score)*100
[1] 3.189919
> sum(data$bayes_50)/length(data$score)*100
[1] 12.97210
> sum(data$bayes_60)/length(data$score)*100
[1] 4.792079
> sum(data$bayes_80)/length(data$score)*100
[1] 5.569757
> sum(data$bayes_95)/length(data$score)*100
[1] 4.630063
> sum(data$bayes_99)/length(data$score)*100
[1] 45.0369

TrackBack

TrackBack URL for this entry:
http://prospero.bluescarf.net/cgi-bin/mt/mt-tb.cgi/15

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on September 3, 2007 9:27 AM.

The previous post in this blog was SpamAssassin Rules.

The next post in this blog is Scripting, Old School.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35