Recently, I came across a blog posting about the SpamAssassin rule BANG_OPRAH, which looks for a word boundary followed by 'oprah' (case-insensitive) and an exclamation mark (bang):
body BANG_OPRAH /\boprah!/i describe BANG_OPRAH Talks about Oprah with an exclamation! score BANG_OPRAH 4.300
In terms of fidelity to the regular expression that defines it, the rule really ought to be called OPRAH_BANG, but clearly someone with a sense of humor decided on the less accurate but more entertaining moniker.
I wondered how common this type of spam is in my corpus of filter spam emails, so I looked at the frequency of the BANG_OPRAH rule.
> data <- read.table("spam-all.tab",
+ header=TRUE,
+ sep="\t",
+ quote="")
> sum(data$bang_oprah)
[1] 3
> length(data$bang_oprah)
[1] 27775
> sum(data$bang_oprah) / length(data$bang_oprah) * 100
[1] 0.01080108
So the BANG_OPRAH rules only fires 3 times in a corpus of 27,775 spam emails. That's a fraction of a percent. But what about other rules? I thought I'd compare it to some of the stock alert spam that is so common nowadays.
> sum(data$stock_alert) [1] 17 > sum(data$stock_alert) / length(data$bang_oprah) * 100 [1] 0.06120612 > sum(data$stock_pick) [1] 10 > sum(data$stock_pick) / length(data$bang_oprah) * 100 [1] 0.0360036
I'm surprised that the stock rules aren't firing more often. Clearly I need to understand better what these rules do, but here's something to think about. There is a lot of discussion of the pros and cons of the spam score increase for various rules, such as the rule BAYES_99 (which means that an email has a 99 to 100 percent likelihood of being spam according to the Bayes filter). But how common is it relative to the other Bayes probabilities? Here are the percentages according to R:
> sum(data$bayes_00)/length(data$score)*100 [1] 10.69307 > sum(data$bayes_05)/length(data$score)*100 [1] 2.027003 > sum(data$bayes_20)/length(data$score)*100 [1] 2.394239 > sum(data$bayes_40)/length(data$score)*100 [1] 3.189919 > sum(data$bayes_50)/length(data$score)*100 [1] 12.97210 > sum(data$bayes_60)/length(data$score)*100 [1] 4.792079 > sum(data$bayes_80)/length(data$score)*100 [1] 5.569757 > sum(data$bayes_95)/length(data$score)*100 [1] 4.630063 > sum(data$bayes_99)/length(data$score)*100 [1] 45.0369