« Evri.com: Another Step Towards a Semantic Web | Main | Natural Language Processing with Python »

Wolfram|Alpha: A New Kind of Search Engine

In case you haven't heard about it, Wolfram|Alpha is a service that recently launched amid a great deal of "Google killer" hype (e.g., Nova Spivack breathlessly claiming that it could be as important as Google). Since I work for Powerset (now part of LiveSearch), another search engine that was briefly in the limelight as a Google killer, I know how desperate the media is to tell a good David and Goliath story, and the degree to which such a slant can distort the coverage. So I tried to approach Wolfram|Alpha with an open mind.

One thing that intrigued me when I read some of the initial press was that Stephen Wolfram (after whom the system is named) considered the natural language capabilities of the system to be one of its chief innovations. In a pre-announcement blog posting, Wolfram wrote: "But I’m happy to say that with a mixture of many clever algorithms and heuristics, lots of linguistic discovery and linguistic curation, and what probably amount to some serious theoretical breakthroughs, we’re actually managing to make it work." I'm not sure if Wolfram meant to claim that those theoretical breakthroughs were in natural language processing (NLP), but it certainly appeared so. However, some of the claims made concerning the NLP capabilities of the system are baffling. For example, on the Wolfram|Alpha blog, it says: "As of now, Wolfram|Alpha contains 10+ trillion of pieces of data, 50,000+ types of algorithms and models, and linguistic capabilities for 1000+ domains." I have no idea what linguistics capabilities for "1000+" domains means.

Once I was able to give the system a test run, I found little evidence of any serious NLP breakthroughs. The NLP is brittle and the coverage is spotty, which is the norm given the state of the art. To give an example, if you want the top speed of something—say, a cheetah—you can use the query how fast is a cheetah? and get back results that clarify the interpretation of the query and provides information about speed, in various units of measurement, with various comparisons (like the cutesy comparison to the speed of the Delorean in Back to the Future).

A robust natural language processing system should be able to handle minor linguistic variation in synonymous queries. But some expected variants provide the same results (cheetah speed, cheetah top speed), while others produce the "I'm sorry, Dave" response (speed of cheetahs, how fast does a cheetah run?). And the variant top cheetah speed gets interpreted strangely, with a ham-fisted query refinement mechanism:

The natural language interface is only one part of the system, though. How does it fare otherwise? As various bloggers have pointed out, the system seems to have a case of Asperger's Syndrome. It knows a lot about technical minutiae but doesn't seem to anything about the social universe. Even in the realm of science and technology, some gaps are surprising. As Daniel Tunkelang pointed out on Twitter, Wolfram|Alpha will give you the largest known prime number, but it won't give you the smallest. I'm sure that other examples could be adduced, but that probably amounts to kvetching (not unprecedented in the blogosphere, but not really fair). The truth is, I've enjoyed poking around in the system, discovering the many domains that it does cover. But the real issue isn't how much data it contains, but how much it will eventually contain, and how it will get there.

As Doug Lenat (one of the people behind Cyc) pointed out in his write-up of a sneak peek of the system, the data is all hand-curated: "In a small number of cases, he also connects via API to third party information, but mostly for realtime data such as a current stock price or current temperature. Rather than connecting to and relying on the current or future Semantic Web, Alpha computes its answers primarily from his own curated data to the extent possible; he sees Alpha as the home for almost all the information it needs, and will use to answer users' queries."

The fact that nearly all of the data in Wolfram Alpha is hand curated is a major weakness of the system. In fact, I'd say it is its proverbial Achilles Heel, given that it will be very difficult for the system to scale. If the system has a long-term future, it will need to update its contents as the world changes. (Larger prime numbers will no doubt be discovered.) And to gain traction it will probably need to expand its coverage in order to cover topics of greater popular interest (movies, spots results, famous people). But if expanding the knowledge base requires data entry by domain experts, progress will be slow, or prohibitively expensive. This is a well-known problem, which has been attacked from different angles. Probably the most amibitious is Freebase, which provides a huge database of sundry facts that can be modified or augmented by users, making it the Wikipedia of structured data.

But the holy grail here isn't hand curation of data (by a team of salaried employees or a distributed community of users). It's automatic extraction of structured data from the web via text mining, and we're still a long way from achieving a robust solution. As far as I can tell, Wolfram|Alpha hasn't contributed to that goal at all, though there are companies that are working on the goal—for example, Evri mines the web and builds up a structured database of named entities (mostly people, but also some places and things).

Despite my various criticisms, I think Wolfram Alpha is worthwhile. It doesn't live up to the hype, but it's rare that anything actually does. And even if it is more of a geeky toy that flatters Wolfram's ego than a commercially viable service, I do think it represents an inflection point in web search. We're moving away from the paradigm that has dominated search for more than a decade, which is the list of blue links, a SERP that consists of an list of documents that match the keywords in your query in order of descending relevance. Increasingly, users want more than just a web page. They want a fact, or an answer, or a graph, which means they need a new kind of search engine—to use Wolfram's terminology, a "computational knowledge engine".

TrackBack

TrackBack URL for this entry:
http://prospero.bluescarf.net/cgi-bin/mt/mt-tb.cgi/132

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on May 19, 2009 9:25 AM.

The previous post in this blog was Evri.com: Another Step Towards a Semantic Web.

The next post in this blog is Natural Language Processing with Python.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35