« All Strings Considered | Main | Fallout 3 »

Spamhaus Rules

With a little help from fellow Powersetter Lukas Biewald, I figured out how commonly various SpamAssassin rules occur in my collection of filtered spam email. I extracted the Top 10 with the following R code:

> data <- read.table("spam-all.tab",
+                    header=TRUE,
+                    sep="\t",
+                    quote="")
> sorted_data <- sort(apply(data, 2, sum), decreasing=T)
> sorted_data[1:10]
              bayes_99            rcvd_in_xbl         uribl_sc_surbl
                 12509                   8651                   7924
        uribl_ob_surbl           html_message         uribl_ws_surbl
                  7618                   7501                   6665
             uribl_sbl      rcvd_in_sorbs_dul         uribl_ab_surbl
                  6560                   6359                   5313
rcvd_in_bl_spamcop_net
                  5295

I looked up these rules here:

bayes_99 Bayesian spam probability is 99 to 100%
html_message HTML included in message
rcvd_in_sorbs_dul SORBS: sent directly from dynamic IP address
rcvd_in_xbl Received via a relay in Spamhaus XBL
uribl_ab_surbl Contains an URL listed in the AB SURBL blocklist
uribl_ob_surbl Contains an URL listed in the OB SURBL blocklist
uribl_sc_surbl Contains an URL listed in the SC SURBL blocklist
uribl_ws_surbl Contains an URL listed in the WS SURBL blocklist
uribl_sbl Contains an URL listed in the SBL blocklist

A number of these rules refer to SURBL blocklists. The phrase "SURBL blocklist" is a bit funny, given that SURBL stands for "Spam URI Realtime Blocklists": "SURBL (Spam URI Realtime Blocklists) blocklist". So it's a "blocklist blocklist". The phrases "PIN (Personal Identification Number) number" or "ATM (Automated Teller Machine) machine" show similar funniness. I'm sure others could be enumerated, and probably have been somewhere (maybe on Language Log).

According to http://www.surbl.org/,

"SURBLs differ from most other RBLs in that they're used to detect spam based on message body URIs (usually web sites). Unlike most other RBLs, SURBLs are not meant to identify spam senders by their message headers or connection IP addresses. Instead they allow you to identify messages by the spam sites mentioned in their message bodies."

The various SURBLs mentioned in SpamAsssassin rules are the following:

And SBL is, of course, the famous Spamhaus Block List. Go, Spamhaus!

TrackBack

TrackBack URL for this entry:
http://prospero.bluescarf.net/cgi-bin/mt/mt-tb.cgi/20

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on September 8, 2007 11:36 AM.

The previous post in this blog was All Strings Considered.

The next post in this blog is Fallout 3.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35