« Back to the Pacific | Main | Banging Oprah »

SpamAssassin Rules

Now that I have a better sense of the number of rules being triggered by my filtered spam (see previous postings), I'd like to know what kinds of patterns exist in the rules. Which rules occur the most? What are the most common co-occurrences of rules? In order to answer these questions, I had to find a way to look at individual rules. So I wrote a little Perl script using SpamAssassin to extract from each email the 'X-Spam-Score' header. For example, this one comes from the 9th spam email in my corpus:

X-Spam-Score: 7.293 (*******) BAYES_99,BIZ_TLD,DATE_IN_FUTURE_12_24,HOT_NASTY

This says the score assigned to the email by SpamAssassin is 7.293 and that it triggered 4 rules, which are listed by name.

So far, so good. But is this email unusual? How common are the particular rules it triggered? Is there a lot of hot and nasty spam in the corpus? To answer that, I analyzed in R the data from my Perl header extraction script.

My first step was to load all of the data into R.

> data <- read.table("spam-all.tab",
+                    header=TRUE,
+                    sep="\t",
+                    quote="")

There's a header row, with the spam score labelled as 'score' and all other columns consisting of the names of spam rules (in lowercase). We can get the full list of spam rules that way and see how many different rules there are:

> colnames(data)
  [1] "score"                  "act_now_caps"           "address_in_subject"
  [4] "addr_free"              "addr_nums_at_bigsite"   "all_natural"
  [7] "all_trusted"            "amateur_porn"           "amazing_stuff"
 [10] "as_seen_on"             "awl"                    "bang_exercise"
 [13] "bang_oprah"             "bang_quote"             "bayes_00"
 [16] "bayes_05"               "bayes_20"               "bayes_40"
 [19] "bayes_50"               "bayes_60"               "bayes_80"
 [22] "bayes_95"               "bayes_99"               "been_turned_down"
 [25] "biz_tld"                "blank_lines_70_80"      "blank_lines_90_100"
 [28] "body_enhancement"       "body_enhancement2"      "buy_direct"
 [31] "charset_faraway"        "charset_faraway_header" "click_below_caps"
 [34] "click_to_remove_1"      "compete"                "completely_free"
 [37] "confidential_order"     "confirmed_forged"       "consolidate_debt"
 [40] "credit_card"            "cum_shot"               "date_in_future_03_06"
 [43] "date_in_future_06_12"   "date_in_future_12_24"   "date_in_future_24_48"
 [46] "date_in_future_48_96"   "date_in_future_96_xx"   "date_in_past_12_24"
 [49] "date_in_past_24_48"     "date_in_past_48_96"     "date_in_past_96_xx"
 [52] "dear_friend"            "dear_something"         "diet_1"
 [55] "diet_2"                 "disguise_porn"          "dns_from_ahbl_rhsbl"
 [58] "dns_from_rfc_abuse"     "dns_from_rfc_bogusmx"   "dns_from_rfc_post"
 [61] "dns_from_rfc_whois"     "domain_4u2"             "domain_ratio"
 [64] "drugs_anxiety"          "drugs_anxiety_erec"     "drugs_anxiety_obfu"
 [67] "drugs_diet"             "drugs_diet_obfu"        "drugs_erectile"
 [70] "drugs_erectile_obfu"    "drugs_manykinds"        "drugs_muscle"
 [73] "drugs_pain"             "drugs_sleep"            "drugs_sleep_erec"
 [76] "drug_dosage"            "drug_ed_caps"           "drug_ed_combo"
 [79] "drug_ed_generic"        "drug_ed_online"         "drug_ed_sild"
 [82] "earn_per_week"          "entity_dec_alphanum"    "excuse_1"
 [85] "excuse_3"               "excuse_7"               "excuse_remove"
 [88] "extra_cash"             "extra_mpart_type"       "fake_helo_excite"
 [91] "fake_helo_lycos"        "fake_helo_mail_com_dom" "fake_helo_msn"
 [94] "fake_helo_yahoo_ca"     "fake_outblaze_rcvd"     "fin_free"
 [97] "forged_eudoramail_rcvd" "forged_hotmail_rcvd"    "forged_hotmail_rcvd2"
[100] "forged_ims_html"        "forged_ims_tags"        "forged_juno_rcvd"
[103] "forged_mua_aol_from"    "forged_mua_ims"         "forged_mua_mozilla"
[106] "forged_mua_oimo"        "forged_mua_outlook"     "forged_mua_thebat_boun"
[109] "forged_mua_thebat_cs"   "forged_outlook_html"    "forged_outlook_tags"
[112] "forged_rcvd_helo"       "forged_telesp_rcvd"     "forged_thebat_html"
[115] "forged_yahoo_rcvd"      "forward_looking"        "free_membership"
[118] "free_preview"           "free_sample"            "from_all_nums"
[121] "from_ends_in_nums"      "from_has_mixed_nums"    "from_has_mixed_nums3"
[124] "from_has_uline_nums"    "from_illegal_chars"     "from_no_lower"
[127] "from_no_user"           "from_num_at_webmail"    "from_offers"
[130] "frontpage"              "full_refund"            "gappy_subject"
[133] "get_paid"               "hardcore_porn"          "hash.0x80b5c28."
[136] "hdr_order_trimrs"       "head_illegal_chars"     "helo_dynamic_adelphia"
[139] "helo_dynamic_chello_nl" "helo_dynamic_comcast"   "helo_dynamic_dhcp"
[142] "helo_dynamic_dialin"    "helo_dynamic_hcc"       "helo_dynamic_hexip"
[145] "helo_dynamic_home_nl"   "helo_dynamic_ipaddr"    "helo_dynamic_ipaddr2"
[148] "helo_dynamic_ntl"       "helo_dynamic_ool"       "helo_dynamic_rogers"
[151] "helo_dynamic_split_ip"  "helo_dynamic_yahoobb"   "hg_hormone"
[154] "hide_win_status"        "hot_nasty"              "html_00_10"
[157] "html_10_20"             "html_20_30"             "html_30_40"
[160] "html_40_50"             "html_50_60"             "html_60_70"
[163] "html_80_90"             "html_90_100"            "html_attr_bad"
[166] "html_backhair_4"        "html_backhair_8"        "html_badtag_20_30"
[169] "html_badtag_30_40"      "html_badtag_40_50"      "html_badtag_50_60"
[172] "html_badtag_60_70"      "html_badtag_70_80"      "html_badtag_80_90"
[175] "html_charset_faraway"   "html_comment_saved_url" "html_event_unsafe"
[178] "html_font_big"          "html_font_face_bad"     "html_font_face_caps"
[181] "html_font_invisible"    "html_font_low_contrast" "html_font_size_huge"
[184] "html_font_size_large"   "html_font_size_none"    "html_font_size_tiny"
[187] "html_font_tiny"         "html_image_only_04"     "html_image_only_08"
[190] "html_image_only_12"     "html_image_only_16"     "html_image_only_20"
[193] "html_image_only_24"     "html_image_ratio_02"    "html_image_ratio_04"
[196] "html_image_ratio_06"    "html_image_ratio_08"    "html_link_push_here"
[199] "html_message"           "html_mime_no_html_tag"  "html_missing_ctype"
[202] "html_nonelement_00_10"  "html_nonelement_30_40"  "html_nonelement_60_70"
[205] "html_nonelement_90_100" "html_obfuscate_05_10"   "html_obfuscate_10_20"
[208] "html_obfuscate_30_40"   "html_obfuscate_50_60"   "html_obfuscate_80_90"
[211] "html_short_center"      "html_short_comment"     "html_short_length"
[214] "html_shouting3"         "html_shouting5"         "html_tag_exist_tbody"
[217] "html_text_after_body"   "html_text_after_html"   "html_title_empty"
[220] "html_web_bugs"          "http_ctrl_chars_host"   "http_escaped_host"
[223] "http_excessive_escapes" "impotence"              "info_tld"
[226] "initial_invest"         "invalid_date"           "invalid_date_tz_absurd"
[229] "invalid_msgid"          "invalid_tz_cst"         "invalid_tz_est"
[232] "invalid_tz_gmt"         "ip_link_plus"           "its_legal"
[235] "join_millions"          "longwords"              "lots_of_stuff"
[238] "mailto_to_remove"       "many_exclamations"      "marketing_partners"
[241] "meet_singles"           "million_usd"            "mime_base64_blanks"
[244] "mime_base64_text"       "mime_bound_dd_digits"   "mime_bound_many_hex"
[247] "mime_bound_nextpart"    "mime_charset_faraway"   "mime_header_ctype_only"
[250] "mime_html_mostly"       "mime_html_only"         "mime_html_only_multi"
[253] "mime_qp_long_line"      "mime_suspect_name"      "missing_date"
[256] "missing_headers"        "missing_mimeole"        "missing_subject"
[259] "money_back"             "more_sex"               "mortgage_best"
[262] "mortgage_rates"         "mpart_alt_diff"         "msgid_dollars"
[265] "msgid_from_mta_header"  "msgid_from_mta_id"      "msgid_outlook_invalid"
[268] "msgid_randy"            "msgid_spam_99x9xx99"    "msgid_spam_caps"
[271] "msgid_spam_letters"     "msgid_yahoo_caps"       "nasty_girls"
[274] "na_dollars"             "nigerian_body1"         "nigerian_body2"
[277] "nigerian_body3"         "nigerian_body4"         "nigerian_subject2"
[280] "normal_http_to_ip"      "not_advisor"            "no_disappointment"
[283] "no_dns_for_from"        "no_forms"               "no_obligation"
[286] "no_rdns_dotcom_helo"    "no_real_name"           "numeric_http_addr"
[289] "obfuscating_comment"    "obscured_email"         "offshore_scam"
[292] "one_time"               "opting_out"             "pay_site"
[295] "percent_random"         "pling_pling"            "pling_query"
[298] "porn_15"                "porn_16"                "porn_celebrity"
[301] "porn_url_misc"          "porn_url_sex"           "porn_url_slut"
[304] "prest_non_accredited"   "priority_no_name"       "ratware_egroups"
[307] "ratware_gecko_build"    "ratware_hash_2"         "ratware_hash_2_v2"
[310] "ratware_moz_malformed"  "ratware_rcvd_pf"        "rcvd_am_pm"
[313] "rcvd_by_ip"             "rcvd_double_ip_loose"   "rcvd_double_ip_spam"
[316] "rcvd_fake_helo_dotcom"  "rcvd_helo_ip_mismatch"  "rcvd_illegal_ip"
[319] "rcvd_in_bl_spamcop_net" "rcvd_in_bsp_other"      "rcvd_in_dsbl"
[322] "rcvd_in_njabl_dul"      "rcvd_in_njabl_proxy"    "rcvd_in_njabl_relay"
[325] "rcvd_in_njabl_spam"     "rcvd_in_sbl"            "rcvd_in_sorbs_dul"
[328] "rcvd_in_sorbs_http"     "rcvd_in_sorbs_misc"     "rcvd_in_sorbs_smtp"
[331] "rcvd_in_sorbs_web"      "rcvd_in_xbl"            "rcvd_numeric_helo"
[334] "refinance_your_home"    "remove_page"            "risk_free"
[337] "round_the_world"        "round_the_world_local"  "satis_guar"
[340] "save_thousands"         "seduction"              "see_for_yourself"
[343] "something_for_adults"   "some_breakthrough"      "stock_alert"
[346] "stock_pick"             "strong_buy"             "subject_diet"
[349] "subject_drug_gap_c"     "subject_drug_gap_l"     "subject_drug_gap_p"
[352] "subject_drug_gap_s"     "subject_drug_gap_va"    "subject_drug_gap_via"
[355] "subject_drug_gap_x"     "subject_sexual"         "subj_all_caps"
[358] "subj_buy"               "subj_dollars"           "subj_for_only"
[361] "subj_guaranteed"        "subj_has_spaces"        "subj_has_uniq_id"
[364] "subj_illegal_chars"     "subj_your_debt"         "sub_hello"
[367] "suspicious_recips"      "to_address_eq_real"     "to_empty"
[370] "to_malformed"           "tracker_id"             "unclaimed_money"
[373] "undisc_recips"          "unique_words"           "unresolved_template"
[376] "uppercase_25_50"        "uppercase_50_75"        "uppercase_75_100"
[379] "urg_biz"                "uribl_ab_surbl"         "uribl_ob_surbl"
[382] "uribl_ph_surbl"         "uribl_sbl"              "uribl_sc_surbl"
[385] "uribl_ws_surbl"         "uri_4you"               "uri_offers"
[388] "uri_redirector"         "userpass"               "us_dollars_3"
[391] "via_gap_gra"            "weird_port"             "weird_quoting"
[394] "why_pay_more"           "why_wait"               "with_lc_smtp"
[397] "work_at_home"           "x_auth_warn_faked"      "x_library"
[400] "x_message_info"         "your_income"            "you_won"

We can then see how spam emails were triggered for each of the 4 rules by using the rule's name to access its associated column, like so:

> sum(data$bayes_99)
[1] 12509
> sum(data$biz_tld)
[1] 32
> sum(data$date_in_future_12_24)
[1] 94
> sum(data$hot_nasty)
[1] 461

And to put this in terms of percentages to make it a bit easier to get a sense of proporition:

> sum(data$bayes_99)/length(data$score)*100
[1] 45.0369
> sum(data$biz_tld)/length(data$score)*100
[1] 0.1152115
> sum(data$date_in_future_12_24)/length(data$score)*100
[1] 0.3384338
> sum(data$hot_nasty)/length(data$score)*100
[1] 1.659766

It will be interesting to explore the rule space of the spam corpus. It consists of 401 unique rules, which can be freely combined in a large number of ways. The possible combinations of the rules in the corpus should be the power set of 401 (minus one, the empty set, since our corpus only contains emails that triggered at least one rule). I'm not sure how many of those possibilities are attested, but I'm sure it's only a small fraction of the space. Because the powerset of 401 is big. If you don't believe me, just ask Ruby's IRB:

[001] > puts sprintf("%9.2e", 2.power!(401))
5.16e+120
        nil
[002] > puts 2.power!(401).to_s.length
121
        nil

TrackBack

TrackBack URL for this entry:
http://prospero.bluescarf.net/cgi-bin/mt/mt-tb.cgi/14

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on September 3, 2007 12:17 AM.

The previous post in this blog was Back to the Pacific.

The next post in this blog is Banging Oprah.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35