« August 2007 | Main | October 2007 »

September 2007 Archives

September 2, 2007

Back to the Pacific

I was in Monterey yesterday for a family BBQ, which I briefly snuck away from to visit some of my old haunts from high school — the various used book stores downtown not far from the public library. I found Book Haven, and was delighted to discover their South Pacific collection. That's a subject area that isn't usually very represented, even in bigger used book stores. And when it is represented, it's usually in the form of WWII history. But I managed to find a few gems about the Melanesian Pacific there: Gardens of War, Cousteau's Papua New Guinea Journey, and Lightning Over Bougainville.

September 3, 2007

SpamAssassin Rules

Now that I have a better sense of the number of rules being triggered by my filtered spam (see previous postings), I'd like to know what kinds of patterns exist in the rules. Which rules occur the most? What are the most common co-occurrences of rules? In order to answer these questions, I had to find a way to look at individual rules. So I wrote a little Perl script using SpamAssassin to extract from each email the 'X-Spam-Score' header. For example, this one comes from the 9th spam email in my corpus:

X-Spam-Score: 7.293 (*******) BAYES_99,BIZ_TLD,DATE_IN_FUTURE_12_24,HOT_NASTY

This says the score assigned to the email by SpamAssassin is 7.293 and that it triggered 4 rules, which are listed by name.

So far, so good. But is this email unusual? How common are the particular rules it triggered? Is there a lot of hot and nasty spam in the corpus? To answer that, I analyzed in R the data from my Perl header extraction script.

My first step was to load all of the data into R.

> data <- read.table("spam-all.tab",
+                    header=TRUE,
+                    sep="\t",
+                    quote="")

There's a header row, with the spam score labelled as 'score' and all other columns consisting of the names of spam rules (in lowercase). We can get the full list of spam rules that way and see how many different rules there are:

> colnames(data)
  [1] "score"                  "act_now_caps"           "address_in_subject"
  [4] "addr_free"              "addr_nums_at_bigsite"   "all_natural"
  [7] "all_trusted"            "amateur_porn"           "amazing_stuff"
 [10] "as_seen_on"             "awl"                    "bang_exercise"
 [13] "bang_oprah"             "bang_quote"             "bayes_00"
 [16] "bayes_05"               "bayes_20"               "bayes_40"
 [19] "bayes_50"               "bayes_60"               "bayes_80"
 [22] "bayes_95"               "bayes_99"               "been_turned_down"
 [25] "biz_tld"                "blank_lines_70_80"      "blank_lines_90_100"
 [28] "body_enhancement"       "body_enhancement2"      "buy_direct"
 [31] "charset_faraway"        "charset_faraway_header" "click_below_caps"
 [34] "click_to_remove_1"      "compete"                "completely_free"
 [37] "confidential_order"     "confirmed_forged"       "consolidate_debt"
 [40] "credit_card"            "cum_shot"               "date_in_future_03_06"
 [43] "date_in_future_06_12"   "date_in_future_12_24"   "date_in_future_24_48"
 [46] "date_in_future_48_96"   "date_in_future_96_xx"   "date_in_past_12_24"
 [49] "date_in_past_24_48"     "date_in_past_48_96"     "date_in_past_96_xx"
 [52] "dear_friend"            "dear_something"         "diet_1"
 [55] "diet_2"                 "disguise_porn"          "dns_from_ahbl_rhsbl"
 [58] "dns_from_rfc_abuse"     "dns_from_rfc_bogusmx"   "dns_from_rfc_post"
 [61] "dns_from_rfc_whois"     "domain_4u2"             "domain_ratio"
 [64] "drugs_anxiety"          "drugs_anxiety_erec"     "drugs_anxiety_obfu"
 [67] "drugs_diet"             "drugs_diet_obfu"        "drugs_erectile"
 [70] "drugs_erectile_obfu"    "drugs_manykinds"        "drugs_muscle"
 [73] "drugs_pain"             "drugs_sleep"            "drugs_sleep_erec"
 [76] "drug_dosage"            "drug_ed_caps"           "drug_ed_combo"
 [79] "drug_ed_generic"        "drug_ed_online"         "drug_ed_sild"
 [82] "earn_per_week"          "entity_dec_alphanum"    "excuse_1"
 [85] "excuse_3"               "excuse_7"               "excuse_remove"
 [88] "extra_cash"             "extra_mpart_type"       "fake_helo_excite"
 [91] "fake_helo_lycos"        "fake_helo_mail_com_dom" "fake_helo_msn"
 [94] "fake_helo_yahoo_ca"     "fake_outblaze_rcvd"     "fin_free"
 [97] "forged_eudoramail_rcvd" "forged_hotmail_rcvd"    "forged_hotmail_rcvd2"
[100] "forged_ims_html"        "forged_ims_tags"        "forged_juno_rcvd"
[103] "forged_mua_aol_from"    "forged_mua_ims"         "forged_mua_mozilla"
[106] "forged_mua_oimo"        "forged_mua_outlook"     "forged_mua_thebat_boun"
[109] "forged_mua_thebat_cs"   "forged_outlook_html"    "forged_outlook_tags"
[112] "forged_rcvd_helo"       "forged_telesp_rcvd"     "forged_thebat_html"
[115] "forged_yahoo_rcvd"      "forward_looking"        "free_membership"
[118] "free_preview"           "free_sample"            "from_all_nums"
[121] "from_ends_in_nums"      "from_has_mixed_nums"    "from_has_mixed_nums3"
[124] "from_has_uline_nums"    "from_illegal_chars"     "from_no_lower"
[127] "from_no_user"           "from_num_at_webmail"    "from_offers"
[130] "frontpage"              "full_refund"            "gappy_subject"
[133] "get_paid"               "hardcore_porn"          "hash.0x80b5c28."
[136] "hdr_order_trimrs"       "head_illegal_chars"     "helo_dynamic_adelphia"
[139] "helo_dynamic_chello_nl" "helo_dynamic_comcast"   "helo_dynamic_dhcp"
[142] "helo_dynamic_dialin"    "helo_dynamic_hcc"       "helo_dynamic_hexip"
[145] "helo_dynamic_home_nl"   "helo_dynamic_ipaddr"    "helo_dynamic_ipaddr2"
[148] "helo_dynamic_ntl"       "helo_dynamic_ool"       "helo_dynamic_rogers"
[151] "helo_dynamic_split_ip"  "helo_dynamic_yahoobb"   "hg_hormone"
[154] "hide_win_status"        "hot_nasty"              "html_00_10"
[157] "html_10_20"             "html_20_30"             "html_30_40"
[160] "html_40_50"             "html_50_60"             "html_60_70"
[163] "html_80_90"             "html_90_100"            "html_attr_bad"
[166] "html_backhair_4"        "html_backhair_8"        "html_badtag_20_30"
[169] "html_badtag_30_40"      "html_badtag_40_50"      "html_badtag_50_60"
[172] "html_badtag_60_70"      "html_badtag_70_80"      "html_badtag_80_90"
[175] "html_charset_faraway"   "html_comment_saved_url" "html_event_unsafe"
[178] "html_font_big"          "html_font_face_bad"     "html_font_face_caps"
[181] "html_font_invisible"    "html_font_low_contrast" "html_font_size_huge"
[184] "html_font_size_large"   "html_font_size_none"    "html_font_size_tiny"
[187] "html_font_tiny"         "html_image_only_04"     "html_image_only_08"
[190] "html_image_only_12"     "html_image_only_16"     "html_image_only_20"
[193] "html_image_only_24"     "html_image_ratio_02"    "html_image_ratio_04"
[196] "html_image_ratio_06"    "html_image_ratio_08"    "html_link_push_here"
[199] "html_message"           "html_mime_no_html_tag"  "html_missing_ctype"
[202] "html_nonelement_00_10"  "html_nonelement_30_40"  "html_nonelement_60_70"
[205] "html_nonelement_90_100" "html_obfuscate_05_10"   "html_obfuscate_10_20"
[208] "html_obfuscate_30_40"   "html_obfuscate_50_60"   "html_obfuscate_80_90"
[211] "html_short_center"      "html_short_comment"     "html_short_length"
[214] "html_shouting3"         "html_shouting5"         "html_tag_exist_tbody"
[217] "html_text_after_body"   "html_text_after_html"   "html_title_empty"
[220] "html_web_bugs"          "http_ctrl_chars_host"   "http_escaped_host"
[223] "http_excessive_escapes" "impotence"              "info_tld"
[226] "initial_invest"         "invalid_date"           "invalid_date_tz_absurd"
[229] "invalid_msgid"          "invalid_tz_cst"         "invalid_tz_est"
[232] "invalid_tz_gmt"         "ip_link_plus"           "its_legal"
[235] "join_millions"          "longwords"              "lots_of_stuff"
[238] "mailto_to_remove"       "many_exclamations"      "marketing_partners"
[241] "meet_singles"           "million_usd"            "mime_base64_blanks"
[244] "mime_base64_text"       "mime_bound_dd_digits"   "mime_bound_many_hex"
[247] "mime_bound_nextpart"    "mime_charset_faraway"   "mime_header_ctype_only"
[250] "mime_html_mostly"       "mime_html_only"         "mime_html_only_multi"
[253] "mime_qp_long_line"      "mime_suspect_name"      "missing_date"
[256] "missing_headers"        "missing_mimeole"        "missing_subject"
[259] "money_back"             "more_sex"               "mortgage_best"
[262] "mortgage_rates"         "mpart_alt_diff"         "msgid_dollars"
[265] "msgid_from_mta_header"  "msgid_from_mta_id"      "msgid_outlook_invalid"
[268] "msgid_randy"            "msgid_spam_99x9xx99"    "msgid_spam_caps"
[271] "msgid_spam_letters"     "msgid_yahoo_caps"       "nasty_girls"
[274] "na_dollars"             "nigerian_body1"         "nigerian_body2"
[277] "nigerian_body3"         "nigerian_body4"         "nigerian_subject2"
[280] "normal_http_to_ip"      "not_advisor"            "no_disappointment"
[283] "no_dns_for_from"        "no_forms"               "no_obligation"
[286] "no_rdns_dotcom_helo"    "no_real_name"           "numeric_http_addr"
[289] "obfuscating_comment"    "obscured_email"         "offshore_scam"
[292] "one_time"               "opting_out"             "pay_site"
[295] "percent_random"         "pling_pling"            "pling_query"
[298] "porn_15"                "porn_16"                "porn_celebrity"
[301] "porn_url_misc"          "porn_url_sex"           "porn_url_slut"
[304] "prest_non_accredited"   "priority_no_name"       "ratware_egroups"
[307] "ratware_gecko_build"    "ratware_hash_2"         "ratware_hash_2_v2"
[310] "ratware_moz_malformed"  "ratware_rcvd_pf"        "rcvd_am_pm"
[313] "rcvd_by_ip"             "rcvd_double_ip_loose"   "rcvd_double_ip_spam"
[316] "rcvd_fake_helo_dotcom"  "rcvd_helo_ip_mismatch"  "rcvd_illegal_ip"
[319] "rcvd_in_bl_spamcop_net" "rcvd_in_bsp_other"      "rcvd_in_dsbl"
[322] "rcvd_in_njabl_dul"      "rcvd_in_njabl_proxy"    "rcvd_in_njabl_relay"
[325] "rcvd_in_njabl_spam"     "rcvd_in_sbl"            "rcvd_in_sorbs_dul"
[328] "rcvd_in_sorbs_http"     "rcvd_in_sorbs_misc"     "rcvd_in_sorbs_smtp"
[331] "rcvd_in_sorbs_web"      "rcvd_in_xbl"            "rcvd_numeric_helo"
[334] "refinance_your_home"    "remove_page"            "risk_free"
[337] "round_the_world"        "round_the_world_local"  "satis_guar"
[340] "save_thousands"         "seduction"              "see_for_yourself"
[343] "something_for_adults"   "some_breakthrough"      "stock_alert"
[346] "stock_pick"             "strong_buy"             "subject_diet"
[349] "subject_drug_gap_c"     "subject_drug_gap_l"     "subject_drug_gap_p"
[352] "subject_drug_gap_s"     "subject_drug_gap_va"    "subject_drug_gap_via"
[355] "subject_drug_gap_x"     "subject_sexual"         "subj_all_caps"
[358] "subj_buy"               "subj_dollars"           "subj_for_only"
[361] "subj_guaranteed"        "subj_has_spaces"        "subj_has_uniq_id"
[364] "subj_illegal_chars"     "subj_your_debt"         "sub_hello"
[367] "suspicious_recips"      "to_address_eq_real"     "to_empty"
[370] "to_malformed"           "tracker_id"             "unclaimed_money"
[373] "undisc_recips"          "unique_words"           "unresolved_template"
[376] "uppercase_25_50"        "uppercase_50_75"        "uppercase_75_100"
[379] "urg_biz"                "uribl_ab_surbl"         "uribl_ob_surbl"
[382] "uribl_ph_surbl"         "uribl_sbl"              "uribl_sc_surbl"
[385] "uribl_ws_surbl"         "uri_4you"               "uri_offers"
[388] "uri_redirector"         "userpass"               "us_dollars_3"
[391] "via_gap_gra"            "weird_port"             "weird_quoting"
[394] "why_pay_more"           "why_wait"               "with_lc_smtp"
[397] "work_at_home"           "x_auth_warn_faked"      "x_library"
[400] "x_message_info"         "your_income"            "you_won"

We can then see how spam emails were triggered for each of the 4 rules by using the rule's name to access its associated column, like so:

> sum(data$bayes_99)
[1] 12509
> sum(data$biz_tld)
[1] 32
> sum(data$date_in_future_12_24)
[1] 94
> sum(data$hot_nasty)
[1] 461

And to put this in terms of percentages to make it a bit easier to get a sense of proporition:

> sum(data$bayes_99)/length(data$score)*100
[1] 45.0369
> sum(data$biz_tld)/length(data$score)*100
[1] 0.1152115
> sum(data$date_in_future_12_24)/length(data$score)*100
[1] 0.3384338
> sum(data$hot_nasty)/length(data$score)*100
[1] 1.659766

It will be interesting to explore the rule space of the spam corpus. It consists of 401 unique rules, which can be freely combined in a large number of ways. The possible combinations of the rules in the corpus should be the power set of 401 (minus one, the empty set, since our corpus only contains emails that triggered at least one rule). I'm not sure how many of those possibilities are attested, but I'm sure it's only a small fraction of the space. Because the powerset of 401 is big. If you don't believe me, just ask Ruby's IRB:

[001] > puts sprintf("%9.2e", 2.power!(401))
5.16e+120
        nil
[002] > puts 2.power!(401).to_s.length
121
        nil

Banging Oprah

Recently, I came across a blog posting about the SpamAssassin rule BANG_OPRAH, which looks for a word boundary followed by 'oprah' (case-insensitive) and an exclamation mark (bang):

body BANG_OPRAH /\boprah!/i
describe BANG_OPRAH
  Talks about Oprah with an exclamation!
score BANG_OPRAH 4.300

In terms of fidelity to the regular expression that defines it, the rule really ought to be called OPRAH_BANG, but clearly someone with a sense of humor decided on the less accurate but more entertaining moniker.

I wondered how common this type of spam is in my corpus of filter spam emails, so I looked at the frequency of the BANG_OPRAH rule.

> data <- read.table("spam-all.tab",
+                    header=TRUE,
+                    sep="\t",
+                    quote="")
> sum(data$bang_oprah)
[1] 3
> length(data$bang_oprah)
[1] 27775
> sum(data$bang_oprah) / length(data$bang_oprah) * 100
[1] 0.01080108

So the BANG_OPRAH rules only fires 3 times in a corpus of 27,775 spam emails. That's a fraction of a percent. But what about other rules? I thought I'd compare it to some of the stock alert spam that is so common nowadays.

> sum(data$stock_alert)
[1] 17
> sum(data$stock_alert) / length(data$bang_oprah) * 100
[1] 0.06120612
> sum(data$stock_pick)
[1] 10
> sum(data$stock_pick) / length(data$bang_oprah) * 100
[1] 0.0360036

I'm surprised that the stock rules aren't firing more often. Clearly I need to understand better what these rules do, but here's something to think about. There is a lot of discussion of the pros and cons of the spam score increase for various rules, such as the rule BAYES_99 (which means that an email has a 99 to 100 percent likelihood of being spam according to the Bayes filter). But how common is it relative to the other Bayes probabilities? Here are the percentages according to R:

> sum(data$bayes_00)/length(data$score)*100
[1] 10.69307
> sum(data$bayes_05)/length(data$score)*100
[1] 2.027003
> sum(data$bayes_20)/length(data$score)*100
[1] 2.394239
> sum(data$bayes_40)/length(data$score)*100
[1] 3.189919
> sum(data$bayes_50)/length(data$score)*100
[1] 12.97210
> sum(data$bayes_60)/length(data$score)*100
[1] 4.792079
> sum(data$bayes_80)/length(data$score)*100
[1] 5.569757
> sum(data$bayes_95)/length(data$score)*100
[1] 4.630063
> sum(data$bayes_99)/length(data$score)*100
[1] 45.0369

Scripting, Old School


My first programming language was Icon. I'm not sure how many readers will be acquainted with the language. Probably not many, but maybe more than its predecessor, SNOBOL.

After Icon, I learned a handful of other languages. One of these was Perl. Although I've come to prefer Ruby (the language of choice at Powerset), I've been writing some Perl code lately to take advantage of the SpamAssassin Perl module. There are some features of the language that feel clumbsy now. But I'm enjoying get re-acquainted with it, warts and all. Which got me thinking... What ever happened to Perl 6?

Perl 6 has been in the works for years. I just saw that there's an upcoming book on the topic, suggesting that a milestone release is imminent. I also had a peek at perl.com and found the article "Everyday Perl 6", which says Perl 6 will be here soon. Let's hope it's true.


September 5, 2007

All Strings Considered

I've been reading about finite state machines lately, in order to better understand how lexical transducers works. It's gotten me thinking about string algorithms and the nitty gritty details of regular expressions (regexes). Since I've been doing some Perl hacking lately, I picked up a copy of Mastering Algorithms with Perl and started reading the chapter on string algorithms. It's got some practical advice about optimizing regular expressions:

* consider anchoring matches if applicable
* avoid alternation with the vertical bar
* avoid needless repetition quantifiers
* if you use alternation, combine the zero-width positive lookahead assertion with a character class
* leading or trailing wildcards are unnecessary

Wise words, indeed. But to understand the "why" behind these tips, you've got to understand how regexes are implemented. Basically, regexes are grounded in the concept of a finite state machine.

A finite state machine (FSM) has a finite number of states, including a start and end state, with transitions from one to another. An FSM performs matches by walking through a string one character at a time and changing the state of the finite machine as it goes alone. If you look for a pattern with an anchor (say, at the beginning of a line), the finite state machine will find that it cannot transitions through its states immediately as it processes each line instead of having to go through nearly all of the characters. If the lines in a text are short, it's not as much of a problem, but if they're long, performance will suffer with the unanchored regex.

But here's where things get interesting. Most regex engines go beyond strict finite machines in their implementation, as the authors of the Perl algorithms book observe:

"Perl's regular exprssions aren't, strictly speaking, regular. They're "superregular"—they include tricks that can't be implemented with the theoretical basis of regular expressions, a deterministic finite automaton [...] One of these tricks is backreferences: \1, \2. Strict regular expressions would not know how to refer back to what has been matched they have no memory of what they have seen."

Okay, but why are these backreferences necessary? One linguistic example springs to mind: reduplication ("a morphological process by which the root or stem of a word, or only part of it, is repeated"). You can try searching for example of reduplication in the Mayan language Tzeltal (also spelled 'Tseltal') using Perl regexes. With the permission of the author, Brian Stross, I put up a regular expression search interface to two text collections: Demons and Monsters and Love in the Armpit.

Let's say you wanted to search one of the texts for all of the CVC words (words of the pattern consonant-vowel-consonant). You could try this regex (where the pattern occurs between word boundaries, indicated by \b, and vowels and consonants are treated as character classes, with a non-vowel, non-whitespace character class for consonants):

\b([^aeiou\W][aeiou][^aeiou\W])\b

You'll get matches for words like puy 'snail' or sok 'with'. But let's say you wanted to find words that have a repeated syllable (e.g., jimjim 'whirling'). You could try using this regex (which uses the backreference \1 to refer back to whatever match is found for the part of the regex between parentheses):

\b([^aeiou\W][aeiou][^aeiou\W])\1\b

Of course, as an alternative, you could try finding reduplication in texts the old-fashioned way, by hand, toiling for hours. But regexes are a whole lot easier.

September 8, 2007

Spamhaus Rules

With a little help from fellow Powersetter Lukas Biewald, I figured out how commonly various SpamAssassin rules occur in my collection of filtered spam email. I extracted the Top 10 with the following R code:

> data <- read.table("spam-all.tab",
+                    header=TRUE,
+                    sep="\t",
+                    quote="")
> sorted_data <- sort(apply(data, 2, sum), decreasing=T)
> sorted_data[1:10]
              bayes_99            rcvd_in_xbl         uribl_sc_surbl
                 12509                   8651                   7924
        uribl_ob_surbl           html_message         uribl_ws_surbl
                  7618                   7501                   6665
             uribl_sbl      rcvd_in_sorbs_dul         uribl_ab_surbl
                  6560                   6359                   5313
rcvd_in_bl_spamcop_net
                  5295

I looked up these rules here:

bayes_99 Bayesian spam probability is 99 to 100%
html_message HTML included in message
rcvd_in_sorbs_dul SORBS: sent directly from dynamic IP address
rcvd_in_xbl Received via a relay in Spamhaus XBL
uribl_ab_surbl Contains an URL listed in the AB SURBL blocklist
uribl_ob_surbl Contains an URL listed in the OB SURBL blocklist
uribl_sc_surbl Contains an URL listed in the SC SURBL blocklist
uribl_ws_surbl Contains an URL listed in the WS SURBL blocklist
uribl_sbl Contains an URL listed in the SBL blocklist

A number of these rules refer to SURBL blocklists. The phrase "SURBL blocklist" is a bit funny, given that SURBL stands for "Spam URI Realtime Blocklists": "SURBL (Spam URI Realtime Blocklists) blocklist". So it's a "blocklist blocklist". The phrases "PIN (Personal Identification Number) number" or "ATM (Automated Teller Machine) machine" show similar funniness. I'm sure others could be enumerated, and probably have been somewhere (maybe on Language Log).

According to http://www.surbl.org/,

"SURBLs differ from most other RBLs in that they're used to detect spam based on message body URIs (usually web sites). Unlike most other RBLs, SURBLs are not meant to identify spam senders by their message headers or connection IP addresses. Instead they allow you to identify messages by the spam sites mentioned in their message bodies."

The various SURBLs mentioned in SpamAsssassin rules are the following:

And SBL is, of course, the famous Spamhaus Block List. Go, Spamhaus!

September 17, 2007

Fallout 3

It seems like next year is going to be a good year for video games. Not only is Spore supposed to be released, but so is Fallout 3, the latest installment in the Fallout series. I never played the original, but I spent untold hours on Fallout 2 and Fallout Tactics. I can hardly wait...

About September 2007

This page contains all entries posted to Nerd Industries: Stuart Robinson's blog in September 2007. They are listed from oldest to newest.

August 2007 is the previous archive.

October 2007 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35