« The Power of Parsing: Why a Bag of Words is Nice but a Tree of Words is Nicer | Main | The New York Times APIs »

Symbolic versus Statistical Approaches Toward Semantic Search

On Google Watch, Clint Boulton notes that Eric Schmidt was recently quoted as saying:

"Wouldn't it be nice if Google understood the meaning of your phrase rather than just the words that are in the phrase? We've made a lot of discoveries in that area that are going to roll out in the next little while."

From this comment (and some recently noted changes in search results), Boulton infers that Google will be embracing semantic search:

"Schmidt is not talking about universal search, which draws together all Web elements -- text, blogs, video, etc. -- and renders them on a page. Schmidt is alluding to smarter search -- semantic search, which uses XML and RDF data from semantic networks to disambiguate search queries and Web text to improve search results."

I think Boulton may be overinterpreting Schmidt's comment (as are others following suite, for example, Search Engine Land or About.com's Web Trends). It seems to me that Schmidt is really just acknowledging the obvious: that in order to provide better search results, Google needs to go beyond simple keyword analysis. Google already does this to some extent, by using n-grams, for example, to analyze keyword co-occurences. (It's not a terribly sophisticated language model, but it's a language model nonetheless.)

The real issue isn't whether Google is going to embrace semantic search, but how. I don't think there's any question that search results would be improved by knowing which meaning of a particular term or phrase is intended by the user when he or she types a query. But it's a leap to conclude that interest in semantics amounts to an endorsement of semantic search along the lines envisioned by Tim Berners-Lee, which I assume is what Boulton had in mind when he mentioned the use of "XML and RDF data from semantic networks to disambiguate search queries and Web text to improve search results".

There are other ways of getting to semantics. One big issues is whether it should be done with statistical analysis of co-occurence frequencies in large text collections (the n-gram route) or by hard-wiring knowledge of semantics using a symbolic system (the traditional AI route). In other words, the $64,000 question is whether the path to better analysis of semantics will be more statistical or symbolic. I say "more" of one or the other, because it isn't an all-or-none affair. Hybrid systems are the norm nowadays.

The debate over the relative merits of statistical and symbolic systems for NLP has been going on for a while, so it's surprising that there isn't wider recognition of the issue among bloggers writing about semantic search. For example, the following book on the topic was published 13 years ago: The Balancing Act: Combining Symbolic and Statistical Approaches to Language. It's been on my shelves for years, but I only recently realized that my colleague at Powerset, Ron Kaplan, wrote the endorsement of the book that appears on its MIT Press homepage.

TrackBack

TrackBack URL for this entry:
http://prospero.bluescarf.net/cgi-bin/mt/mt-tb.cgi/123

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on January 24, 2009 1:32 AM.

The previous post in this blog was The Power of Parsing: Why a Bag of Words is Nice but a Tree of Words is Nicer.

The next post in this blog is The New York Times APIs.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35