With a little help from fellow Powersetter Lukas Biewald, I figured out how commonly various SpamAssassin rules occur in my collection of filtered spam email. I extracted the Top 10 with the following R code:
> data <- read.table("spam-all.tab",
+ header=TRUE,
+ sep="\t",
+ quote="")
> sorted_data <- sort(apply(data, 2, sum), decreasing=T)
> sorted_data[1:10]
bayes_99 rcvd_in_xbl uribl_sc_surbl
12509 8651 7924
uribl_ob_surbl html_message uribl_ws_surbl
7618 7501 6665
uribl_sbl rcvd_in_sorbs_dul uribl_ab_surbl
6560 6359 5313
rcvd_in_bl_spamcop_net
5295
I looked up these rules here:
| bayes_99 | Bayesian spam probability is 99 to 100% |
| html_message | HTML included in message |
| rcvd_in_sorbs_dul | SORBS: sent directly from dynamic IP address |
| rcvd_in_xbl | Received via a relay in Spamhaus XBL |
| uribl_ab_surbl | Contains an URL listed in the AB SURBL blocklist |
| uribl_ob_surbl | Contains an URL listed in the OB SURBL blocklist |
| uribl_sc_surbl | Contains an URL listed in the SC SURBL blocklist |
| uribl_ws_surbl | Contains an URL listed in the WS SURBL blocklist |
| uribl_sbl | Contains an URL listed in the SBL blocklist |
A number of these rules refer to SURBL blocklists. The phrase "SURBL blocklist" is a bit funny, given that SURBL stands for "Spam URI Realtime Blocklists": "SURBL (Spam URI Realtime Blocklists) blocklist". So it's a "blocklist blocklist". The phrases "PIN (Personal Identification Number) number" or "ATM (Automated Teller Machine) machine" show similar funniness. I'm sure others could be enumerated, and probably have been somewhere (maybe on Language Log).
According to http://www.surbl.org/,
"SURBLs differ from most other RBLs in that they're used to detect spam based on message body URIs (usually web sites). Unlike most other RBLs, SURBLs are not meant to identify spam senders by their message headers or connection IP addresses. Instead they allow you to identify messages by the spam sites mentioned in their message bodies."
The various SURBLs mentioned in SpamAsssassin rules are the following:
- ab.surbl.org - AbuseButler spamvertised sites
- ob.surbl.org - Outblaze spamvertised sites
- sc.surbl.org - SpamCop message-body URI domains
- ws.surbl.org - sa-blacklist domains as a SURBL
And SBL is, of course, the famous Spamhaus Block List. Go, Spamhaus!