« Tezuka's Buddha | Main | Symbolic versus Statistical Approaches Toward Semantic Search »

The Power of Parsing: Why a Bag of Words is Nice but a Tree of Words is Nicer

When Powerset was a media darling and getting lots of attention in the mainstream press, I set up a standing search for the keyword "Powerset" on Google news. Since I'm a bit of a news junkie, I regularly check the results. The original results were pretty good but as Powerset has fallen from the limelight, I've noticed a gradual decrease in quality owing to an expansion of "Powerset" as "power set". What's interesting is that the keyword results for "power set" typically are false positives where "set" is in fact a verb, as can be seen in the screenshot below.

This is an excellent example of where semantic search offers an improvement in precision over simple keywordese (which treats text as a bag or a list of words). A decent parser ought to be able to detect that "set" in these examples functions as a verb. Assuming that the query is analyzed as a noun (and that the analysis is respected in its expansion as "power set"), results where "set" is a verb should either fail match or at least be significantly lower ranked. A bag of words is nice, but a tree of words is even nicer.

TrackBack

TrackBack URL for this entry:
http://prospero.bluescarf.net/cgi-bin/mt/mt-tb.cgi/122

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on January 10, 2009 2:55 PM.

The previous post in this blog was Tezuka's Buddha.

The next post in this blog is Symbolic versus Statistical Approaches Toward Semantic Search.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35