« Forgetting the Keywords | Main | It's Spam, but Is it Normal? »

Spam Exploration with R

I've starting using R to analyze the properties of the spam that winds up in my inbox. I was reading Ending Spam and struggling with Harald Baayen's Word Frquency Distributions around the same time, and it got me wondering. What does all of the spam filtered out by SpamAssassin look like?

Fortunately, I save all of my filtered spam. So I wrote a little Perl script that uses SpamAssassin modules to parse individual spam emails from my private reserve and summarize a few basic statistics (number of tokens, rules triggered, spam score, etc.).

The first thing I wondered is what the spam scores of these emails were. I also wondered how many rules were triggered to achieve that spam score. (SpamAssassin has rules that increase an email's spam score when they fire, which usually means matching some particular pattern, such as profanity in the subject header.) Here's a scatterplot with the number of rules triggered on the x-axis and the spam score assigned by SpammAssassin on the y-axis.

The relationship isn't a straight line because the weight of individual rules varies. The next thing to look at is how much the spam score varies. A histogram with an overlaid density plot should answer that question...

TrackBack

TrackBack URL for this entry:
http://prospero.bluescarf.net/cgi-bin/mt/mt-tb.cgi/10

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on August 25, 2007 9:46 AM.

The previous post in this blog was Forgetting the Keywords.

The next post in this blog is It's Spam, but Is it Normal?.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35