I've starting using R to analyze the properties of the spam that winds up in my inbox. I was reading Ending Spam and struggling with Harald Baayen's Word Frquency Distributions around the same time, and it got me wondering. What does all of the spam filtered out by SpamAssassin look like?
Fortunately, I save all of my filtered spam. So I wrote a little Perl script that uses SpamAssassin modules to parse individual spam emails from my private reserve and summarize a few basic statistics (number of tokens, rules triggered, spam score, etc.).
The first thing I wondered is what the spam scores of these emails were. I also wondered how many rules were triggered to achieve that spam score. (SpamAssassin has rules that increase an email's spam score when they fire, which usually means matching some particular pattern, such as profanity in the subject header.) Here's a scatterplot with the number of rules triggered on the x-axis and the spam score assigned by SpammAssassin on the y-axis.

The relationship isn't a straight line because the weight of individual rules varies. The next thing to look at is how much the spam score varies. A histogram with an overlaid density plot should answer that question...