Main | September 2007 »

August 2007 Archives

August 21, 2007

BioShock

Today BioShock hit retail shelves. I dutifully hiked from the Powerset office in SOMA to the Virgin Megastore and picked up a copy. I managed to get in an hour or so of play after work. The verdict? Two thumbs up. It's a good game, with great visuals (love the Art Deco). But there is seriously way too much hype surrounding it. Sure, you can genetically mutate yourself and shoot electricity from your hands, which is sweet. And there are other cool mutations, like the ability to send a swarm of bees shooting out of your hands. But is that revolutionary? Uh, not exactly. Try just about any fantasy game and you have this nifty thing called "magic", which is pretty much what these mutated abilities are (modulo a thin sci-fi veneer). BioShock may be a cut above your average first person shooter game, but revolutionary? Please.

August 23, 2007

R

I'm no math whiz, but I dig probability and statistics. I've been using R to do some data analysis at work and I'm reminded of what a nice tool it is. I wrote a tutorial about graphing in R on freshmeat.net, which describes a few of its graphing capabilities. It really only scratches the surface. If you want a more in-depth introduction, the must-have Christmas stocking stuffer is Harald Baayen's forthcoming book Analyzing Linguistic Data: A Practical Introduction to Statistics using R. For a good illustration of the fun you can have with R, check out my Powerset pal Brendan O'Connor's analysis of country names (with graphs produced in R). It got a shout-out on Language Log!

August 24, 2007

Forgetting the Keywords

I just came across a nice quote from the Taoist philosopher Zhuangzi: "Words exist because of meaning; once you've gotten the meaning, you can forget the words." With a minor twist, it sums up the mantra of semantic search: "Words exist because of meaning; once you've gotten the meaning, you can forget the keywords."

August 25, 2007

Spam Exploration with R

I've starting using R to analyze the properties of the spam that winds up in my inbox. I was reading Ending Spam and struggling with Harald Baayen's Word Frquency Distributions around the same time, and it got me wondering. What does all of the spam filtered out by SpamAssassin look like?

Fortunately, I save all of my filtered spam. So I wrote a little Perl script that uses SpamAssassin modules to parse individual spam emails from my private reserve and summarize a few basic statistics (number of tokens, rules triggered, spam score, etc.).

The first thing I wondered is what the spam scores of these emails were. I also wondered how many rules were triggered to achieve that spam score. (SpamAssassin has rules that increase an email's spam score when they fire, which usually means matching some particular pattern, such as profanity in the subject header.) Here's a scatterplot with the number of rules triggered on the x-axis and the spam score assigned by SpammAssassin on the y-axis.

The relationship isn't a straight line because the weight of individual rules varies. The next thing to look at is how much the spam score varies. A histogram with an overlaid density plot should answer that question...

August 29, 2007

It's Spam, but Is it Normal?

After looking at the relationship between spam rules and spam scores in my private spam collection, I decided it was time to look at how spam rules and spam scores are distributed.

The first thing I did was create a histogram of spam scores. It looks a bit like the normal distribution, but the spam threshold makes it hard to determine, given that we're only getting spam scores over a certain threshold.

But what about the number of rules triggered? Does it conform to the normal distribution? Again, it's close, but it is somewhat skewed to the right.

But how much does it depart from the standard distribution? And just what kind of distribution is it, anyhow?

About August 2007

This page contains all entries posted to Nerd Industries: Stuart Robinson's blog in August 2007. They are listed from oldest to newest.

September 2007 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35