« December 2008 | Main | February 2009 »

January 2009 Archives

January 7, 2009

Google's foray into semantic search

Thanks to a blog post on ReadWriteWeb, there has recently been some scuttlebutt about Google dipping into the waters of semantic search. Marshall Kilpatrick writes:

"In what appears to us to be a new addition to many Google search results pages, queries about birth dates, family connections and other information are now being responded to with explicitly semantic structured information."

Following up on this blog post, Daniel Nations notes some pretty interesting results: Google claims Jesus born in 1963. Nations goes on to say:

"[...] it seems we are seeing a good example of the downside of the semantic web. Whether it is an algorithm trying to pull the information out of an unoptimized web page or semantic engine pulling the information from a categorized page, we run into the same basic problem that we run into with many Web 2.0 sites: Can we trust the information?"

While the issue of information trustworthiness is interesting, it's not a problem unique to semantic search. Similar issues arise in keyword search. How can you trust the keywords on a page? Web masters have been inserting misleading keywords into their web pages to "fool" keyword search engines for years. In fact, the entire field of search engine optimization is the art of gaming the system. The reason that the problem seems more acute in the case of semantic search is that we have higher standards and expect authoritative answers to our questions.

But that expectation is questionable. Semantic search isn't synonymous with question answering. In fact, semantic search doesn't have to answer questions at all. Why? Because a semantic search engine doesn't have to tell us definitively the answer to our questions. It only needs to find the web pages that most directly address the question asked, and then leave it to the user to decide whether the information provided is "true". This is where the social web meets the semantic web. Web 2.0, meet Web 3.0.

January 8, 2009

Tezuka's Buddha

I recently picked up the second volume of Osamu Tezuka's epic manga retelling of the life of Buddha. I knew of Tezuka's work from my discovery of Astro Boy in high school. But I had no idea he had written something this ambitious. Given that there are 8 volumes in total, I should have plenty of reading material whenever I need a break from dissertation writing during my upcoming month-long hiatus from work.

January 10, 2009

The Power of Parsing: Why a Bag of Words is Nice but a Tree of Words is Nicer

When Powerset was a media darling and getting lots of attention in the mainstream press, I set up a standing search for the keyword "Powerset" on Google news. Since I'm a bit of a news junkie, I regularly check the results. The original results were pretty good but as Powerset has fallen from the limelight, I've noticed a gradual decrease in quality owing to an expansion of "Powerset" as "power set". What's interesting is that the keyword results for "power set" typically are false positives where "set" is in fact a verb, as can be seen in the screenshot below.

This is an excellent example of where semantic search offers an improvement in precision over simple keywordese (which treats text as a bag or a list of words). A decent parser ought to be able to detect that "set" in these examples functions as a verb. Assuming that the query is analyzed as a noun (and that the analysis is respected in its expansion as "power set"), results where "set" is a verb should either fail match or at least be significantly lower ranked. A bag of words is nice, but a tree of words is even nicer.

January 24, 2009

Symbolic versus Statistical Approaches Toward Semantic Search

On Google Watch, Clint Boulton notes that Eric Schmidt was recently quoted as saying:

"Wouldn't it be nice if Google understood the meaning of your phrase rather than just the words that are in the phrase? We've made a lot of discoveries in that area that are going to roll out in the next little while."

From this comment (and some recently noted changes in search results), Boulton infers that Google will be embracing semantic search:

"Schmidt is not talking about universal search, which draws together all Web elements -- text, blogs, video, etc. -- and renders them on a page. Schmidt is alluding to smarter search -- semantic search, which uses XML and RDF data from semantic networks to disambiguate search queries and Web text to improve search results."

I think Boulton may be overinterpreting Schmidt's comment (as are others following suite, for example, Search Engine Land or About.com's Web Trends). It seems to me that Schmidt is really just acknowledging the obvious: that in order to provide better search results, Google needs to go beyond simple keyword analysis. Google already does this to some extent, by using n-grams, for example, to analyze keyword co-occurences. (It's not a terribly sophisticated language model, but it's a language model nonetheless.)

The real issue isn't whether Google is going to embrace semantic search, but how. I don't think there's any question that search results would be improved by knowing which meaning of a particular term or phrase is intended by the user when he or she types a query. But it's a leap to conclude that interest in semantics amounts to an endorsement of semantic search along the lines envisioned by Tim Berners-Lee, which I assume is what Boulton had in mind when he mentioned the use of "XML and RDF data from semantic networks to disambiguate search queries and Web text to improve search results".

There are other ways of getting to semantics. One big issues is whether it should be done with statistical analysis of co-occurence frequencies in large text collections (the n-gram route) or by hard-wiring knowledge of semantics using a symbolic system (the traditional AI route). In other words, the $64,000 question is whether the path to better analysis of semantics will be more statistical or symbolic. I say "more" of one or the other, because it isn't an all-or-none affair. Hybrid systems are the norm nowadays.

The debate over the relative merits of statistical and symbolic systems for NLP has been going on for a while, so it's surprising that there isn't wider recognition of the issue among bloggers writing about semantic search. For example, the following book on the topic was published 13 years ago: The Balancing Act: Combining Symbolic and Statistical Approaches to Language. It's been on my shelves for years, but I only recently realized that my colleague at Powerset, Ron Kaplan, wrote the endorsement of the book that appears on its MIT Press homepage.

About January 2009

This page contains all entries posted to Nerd Industries: Stuart Robinson's blog in January 2009. They are listed from oldest to newest.

December 2008 is the previous archive.

February 2009 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35