February 17, 2010

Fighting the Philosophical Impulse

For over a year now the book Ontological Semantics has been sitting on my bookshelf. I finally started reading it recently after coming across a review by John F. Sowa (originally published in Computational Linguistics). Unfortunately, I wasn't very far into the book when I came across the following sentence:

"the science of language has been largely ignored by philosophers"

This statement blew my mind. It simply beggars belief. Incorrect statements about factual matters are one thing. But this is something different. It is a sweepingly broad statement that seems completely unaware of its own overreach. The authors of the book seem to acknowledge, at least tacitly, that the statement may need to be qualified, since they do footnote the sentence, noting:

Of course, language itself has not been ignored by philosophy: much of the philosophy of the twentieth century has been interested precisely in language. Under different historical circumstances, the philosophy of language would have probably come to be known as a strain of linguistics and/or logic rather than a movement in philosophy.

But it's hard to reconcile the strength of their claim with their footnote, since the footnote basically acknowledges that there is an extensive literature on the philosophy of language and on the philosophy of science, but seems to hold that there is little in the intersection of the two areas. This is simply untrue.

In the heyday of generative grammar, philosophers paid considerable attention to language and the science of meaning. For example, one of Chomsky's most famous works is a philosophically oriented review of Skinner's book Verbal Behavior (available here). And John Searle wrote about Chomsky's linguistic revolution in the New York Review of Books in 1972 (available here). There is no shortage of ink spilled on the topic of scientific revolutions and the development of generative grammar. For example, in 1976 Language published an analysis of the history of linguistics in Kuhnian terms by W. Keith Percival (available here).

Not too surprisingly, because the authors are unaware of this literature, they mistakenly imagine that they are making original contributions to the field, as John F. Sowa points out in his review. Concerning the four components of a scientific theory that they propose, he writes:

Under various names and with varying definitions, similar components are present in most theories about theories. The authors' claims of novelty in proposing them "surprisingly, for the first time in the philosophy of science" are overstated.

Paul Feyerabend, one of the more famous philosophers of science (if fame is a word that can be rightfully applied to academic philosophers), has written extensively on the relationship between science and philosophy, often taking a rather dim view of the contributions of the latter. In "Materialism and the Mind-Body Problem" (a locus classicus for eliminative materialism), Feyerabend advises against letting philosophical objections curtail the development of detailed scientific theories concerning phenomena of interest:

It occurs only too often that attempts to arrive at a coherent picture of the world are held up by philosophical bickering and are perhaps even given up before they can show their merits. It seems to me those who originate such attempts ought to be a little less afraid of difficulties; that they ought to look through the arguments which are presented against them; and that they ought to recognise their irrelevance. Having disregarded irrelevant objections they ought then to proceed to the much more rewarding task of developing their point of view in detail, to examine its fruitfulness and thereby to get fresh insight, not only into some generalities, but into very concrete and detailed processes. To encourage such development from the abstract to the concrete, to contribute to the invention of further ideas, this is the proper task of a philosophy which aspires to be more than a hindrance to progress.

The authors of Ontological Semantics would do well to heed Feyerband's advice. Although their excursion into philosophy is unrewarding, it is also fairly harmless. It can be largely ignored without materially affecting the remainder of the book. My advice to the authors is to fight the philosophical impulse and stick to the engineering. To again quote Sowa's review:

Despite the historical and philosophical inaccuracies, this is a valuable textbook on computational linguistics. Its greatest strength is its engineering contribution, and its greatest weakness is the constant bickering with linguists and logicians who study different aspects of the rich and complex subject of language. Humans and machines require both logical and lexical processing for language understanding, and the authors could better inform students by showing what their approach does best than by trying to limit the range of topics linguists are allowed to explore.

January 29, 2010

NLP and the Semantic Web

Last night Powerset hosted the SF Semantic Web Meetup, organized by Marco Neumann, where we gave a talk about NLP and the Semantic Web. I presented some work in progress on instant answers (slides) and my colleague Scott Waterman presented ongoing work by our group (Text Processing for Semantic Applications) on triples extraction using our natural language pipeline. In addition, Bill Flitter from Dlvr.it talked about the publishing industry in the emerging world of the realtime web.

We enjoyed giving our presentations and appreciated the feedback that we received. One of the questions that came up during the question time after my talk was one that I frequently encounter, which is why finite state transducers aren't just regular expressions. What I normally say, but perhaps failed to say clearly enough on this occasion, is that finite state transducers are regular expression, but they do a lot more. As my former colleague Brendan O'Connor used to say, "Finite state transducers are regular expressions on steroids!" I think that's right, and at some point I should write up a more detailed and technical explanation of what that means.

January 2, 2010

Google and Net Neutrality

It's been a while (roughly 6 months) since I last posted something on this blog. In fact, I've been so remiss that I made one of my new year's resolutions is to try to post a bit more regularly. Towards that end, I thought I'd comment on an op-ed about Google that recently appeared in the New York Times: "Search, but You May Not Find".

It's written by Adam Raff, who apparently has an axe to grind with Google because his company (Foundem) developed a vertical search engine that didn't show up highly enough in Google search results to be sufficiently visible to compete with Google. I'm not making this up. Here are his own words:

"One way that Google exploits this control is by imposing covert “penalties” that can strike legitimate and useful Web sites, removing them entirely from its search results or placing them so far down the rankings that they will in all likelihood never be found. For three years, my company’s vertical search and price-comparison site, Foundem, was effectively “disappeared” from the Internet in this way."

It beggars belief that this op-ed piece was even published, given its relentless self-advertising based on unverifiable claims--e.g., "Google’s treatment of Foundem stifled our growth and constrained the development of our innovative search technology." The inherent bias in the article is pretty hard to miss.

But let's ignore Raff's obvious bias for a moment and consider his argument on its merits. His argument is basically that net neutrality doesn't go far enough in regulating the internet. Search engines should be regulated as much as internet service providers are, because both are the gateway to the web.

The flaws in his argument are legion. The biggest one is the unanalyzed assumption that "net neutrality" is a good thing. (That's hotly debated.) He doesn't even say what he understand the term to mean (it's not self-explantory) or argue for its merits. Then there's the lumping together or internet service providers and search engines. That's also a whopper. (Without the former, you're locked out of the internet. Without the latter, you'll just have a hard time finding what you're looking for.) And then there's the failure to recognize that there's an alternative to regulation--namely, healthy competition. And that's not going to come from government regulation. (I'm not a Libertarian, and don't believe in the unfettered free market, but that doesn't mean that I don't see the merits of competition over regulation.)

I expect better from The New York Times. (Apparently, I'm not alone, judging from other blog responses.)

July 24, 2009

Cyberling 2009

Last week I attended Cyberling 2009, "a workshop exploring how computational methods can enhance traditional linguistic inquiry". The workshop was organized by panels on different topics, ranging from tools to funding models. I co-chaired (along with Mary Beckman) the panel on "Annotation Standards". Overall, it was fun, if for no other reason that I had the opportunity to meet a few people who I knew by reputation by not by personal acquaintance (e.g., Mark Lieberman). But I fear that it might have been a case of preaching to the converted. I know from personal experience that a lot will have to change within the culture of academic linguistics before we can expect computational tools to be fully integrated into working practice as a matter of course. (Among linguists who do fieldwork, for example, a misguided neo-luddite machismo is depressingly prevalent. ) But at least there are linguists pushing the field in that direction.

June 29, 2009

Natural Language Processing with Python

Yesterday I received in the mail from O'Reilly my copy of Natural Language Processing with Python, which should hit bookshelves soon. (From what I hear, it's not available on Amazon yet.) The book is a introduction to natural language processing using the Natural Language Toolkit, an open source code library written in Python.

I've been eagerly awaiting the publication of this book for a couple of reasons. First, I think it's a significant event for the field when a major tech publisher like O'Reilly publishes a book devoted to an NLP project. Second, I've been tangentially involved in the NLTK for some time now, having contributed some code to the module that provides functionality for processing files in the format used by Shoebox/Toolbox: toolbox.py. Finally, I've written about the NLTK for freshmeat.net (Processing Corpora with Python and the NLTK") and for The Journal of Language Conservation and Documentation (Managing Fieldwork Data with Toolbox and the Natural Language Toolkit).

I love the cover, by the way, and I think that whales area pretty good animal for an O'Reilly book on NLP. Apparently, the ones on the cover are right whales, which are endangered. As long as we're on the topic, it's worth pointing out that the international commercial whaling moratorium is not as effective as it could and should be, given that there are still a handful of countries that work around it by exploiting technicalities (Japan) or simply ignore it (Norway, Iceland). The issue is in the public mind again, thanks to the TV show Whale Wars. Hopefully it will help galvanize public opinion against whale hunting.

May 19, 2009

Wolfram|Alpha: A New Kind of Search Engine

In case you haven't heard about it, Wolfram|Alpha is a service that recently launched amid a great deal of "Google killer" hype (e.g., Nova Spivack breathlessly claiming that it could be as important as Google). Since I work for Powerset (now part of LiveSearch), another search engine that was briefly in the limelight as a Google killer, I know how desperate the media is to tell a good David and Goliath story, and the degree to which such a slant can distort the coverage. So I tried to approach Wolfram|Alpha with an open mind.

One thing that intrigued me when I read some of the initial press was that Stephen Wolfram (after whom the system is named) considered the natural language capabilities of the system to be one of its chief innovations. In a pre-announcement blog posting, Wolfram wrote: "But I’m happy to say that with a mixture of many clever algorithms and heuristics, lots of linguistic discovery and linguistic curation, and what probably amount to some serious theoretical breakthroughs, we’re actually managing to make it work." I'm not sure if Wolfram meant to claim that those theoretical breakthroughs were in natural language processing (NLP), but it certainly appeared so. However, some of the claims made concerning the NLP capabilities of the system are baffling. For example, on the Wolfram|Alpha blog, it says: "As of now, Wolfram|Alpha contains 10+ trillion of pieces of data, 50,000+ types of algorithms and models, and linguistic capabilities for 1000+ domains." I have no idea what linguistics capabilities for "1000+" domains means.

Once I was able to give the system a test run, I found little evidence of any serious NLP breakthroughs. The NLP is brittle and the coverage is spotty, which is the norm given the state of the art. To give an example, if you want the top speed of something—say, a cheetah—you can use the query how fast is a cheetah? and get back results that clarify the interpretation of the query and provides information about speed, in various units of measurement, with various comparisons (like the cutesy comparison to the speed of the Delorean in Back to the Future).

A robust natural language processing system should be able to handle minor linguistic variation in synonymous queries. But some expected variants provide the same results (cheetah speed, cheetah top speed), while others produce the "I'm sorry, Dave" response (speed of cheetahs, how fast does a cheetah run?). And the variant top cheetah speed gets interpreted strangely, with a ham-fisted query refinement mechanism:

The natural language interface is only one part of the system, though. How does it fare otherwise? As various bloggers have pointed out, the system seems to have a case of Asperger's Syndrome. It knows a lot about technical minutiae but doesn't seem to anything about the social universe. Even in the realm of science and technology, some gaps are surprising. As Daniel Tunkelang pointed out on Twitter, Wolfram|Alpha will give you the largest known prime number, but it won't give you the smallest. I'm sure that other examples could be adduced, but that probably amounts to kvetching (not unprecedented in the blogosphere, but not really fair). The truth is, I've enjoyed poking around in the system, discovering the many domains that it does cover. But the real issue isn't how much data it contains, but how much it will eventually contain, and how it will get there.

As Doug Lenat (one of the people behind Cyc) pointed out in his write-up of a sneak peek of the system, the data is all hand-curated: "In a small number of cases, he also connects via API to third party information, but mostly for realtime data such as a current stock price or current temperature. Rather than connecting to and relying on the current or future Semantic Web, Alpha computes its answers primarily from his own curated data to the extent possible; he sees Alpha as the home for almost all the information it needs, and will use to answer users' queries."

The fact that nearly all of the data in Wolfram Alpha is hand curated is a major weakness of the system. In fact, I'd say it is its proverbial Achilles Heel, given that it will be very difficult for the system to scale. If the system has a long-term future, it will need to update its contents as the world changes. (Larger prime numbers will no doubt be discovered.) And to gain traction it will probably need to expand its coverage in order to cover topics of greater popular interest (movies, spots results, famous people). But if expanding the knowledge base requires data entry by domain experts, progress will be slow, or prohibitively expensive. This is a well-known problem, which has been attacked from different angles. Probably the most amibitious is Freebase, which provides a huge database of sundry facts that can be modified or augmented by users, making it the Wikipedia of structured data.

But the holy grail here isn't hand curation of data (by a team of salaried employees or a distributed community of users). It's automatic extraction of structured data from the web via text mining, and we're still a long way from achieving a robust solution. As far as I can tell, Wolfram|Alpha hasn't contributed to that goal at all, though there are companies that are working on the goal—for example, Evri mines the web and builds up a structured database of named entities (mostly people, but also some places and things).

Despite my various criticisms, I think Wolfram Alpha is worthwhile. It doesn't live up to the hype, but it's rare that anything actually does. And even if it is more of a geeky toy that flatters Wolfram's ego than a commercially viable service, I do think it represents an inflection point in web search. We're moving away from the paradigm that has dominated search for more than a decade, which is the list of blue links, a SERP that consists of an list of documents that match the keywords in your query in order of descending relevance. Increasingly, users want more than just a web page. They want a fact, or an answer, or a graph, which means they need a new kind of search engine—to use Wolfram's terminology, a "computational knowledge engine".

February 14, 2009

Evri.com: Another Step Towards a Semantic Web

Another step towards a real semantic web has been taken recently with the incorporation of Evri.com's technology into articles from The Washington Post. (One small step for the semantic web, one giant leap for Evri.)

I originally learned about evri.com on Twitter. A colleague at Powerset, Will Fitzgerald, tweeted:

Evri (http::www.evri.com) looks mighty. (A company formed around what I do for my day job)

Yesterday another Tweet alerted me to the partnership with The Washington Post:

Evri is now showing up all over the washingtonpost.com on all articles published today. Like this one: http://is.gd/jshy

If you follow the link in the tweet, you'll find a Washington Post article that has a handy little widget providing information about some of the entities discussed in it.

You can poke around on the widget and start exploring. Want to know more about the House Appropriation Committee? It's only a click away.

Maybe you want to know about The White House. Again, just another click.

(Not all of the relationships are of the same semantic type. The last time I checked, Stevie Wonder, wasn't part of Obama's cabinet. The relationships are clusters, it seems, and not a more narrowly defined one such as cabinet membership.)

But the best collection of named entities isn't going to provide very valuable information if there isn't a way of aligning its contents with relevant entities in a the free text of newspaper articles. A good illustration of this was revealed a little while ago when Marshall Kirkpatrick (from ReadWriteWeb) commented on Google's exposure of semantic data, pointing out that it sometimes got things badly wrong, like stating that Jesus was born in 1963. The problem wasn't that Jesus wasn't born in 1963. The problem was that it wasn't the right Jesus. The query was about Jesus Christ, but they returned data for a different Jesus. (It reminds me of a joke I once heard about prison inmates finding religion. The punch line went something like: "That's not Jesus, our Lord and Savior. That's Jesús [Spanish pronunciation], your cellmates."

The point is that you need some reasonably sophisticated technology in order to identify entities in a document and then figure out which entities they correspond to in your database of known entities. The Jesus example shows that relying on first name alone isn't going to do the trick. Figuring out what will is going to be one of the major tasks that researchers in natural language processing will tackle in the coming years. It's a good time to be a computational linguist.

February 13, 2009

The New York Times APIs

The New York Times isn't just a good newspaper. It's also a smart media player that understands what it takes to stay in the game as technology changes the newspaper business. (That's why they're "elite".)

As pointed out by ReadWriteWeb (among others), part of their strategy for staying ahead of the curve is to remain an indispensable content provider, and providing APIs is one way to do it.

As they announced on their blog a little more than a week ago, The New York Times has released an article search API that goes back 28 years (to 1981, if you don't feel like doing the math). It exposes a ton of article metadata: title, byline, publication date, descriptive terms, to name a few (go here for more info). What the API doesn't give you, however, is the body of the article. I point that out because the blog post doesn't say that explicitly, and I only figured it out after poking around a bit and reading some of the comments (including one from my former co-worker, Brendan O'Connor).

Not having the full article is a bummer, but you can always get it from the LDC's NY Times corpus. And there are other APIs if that doesn't feed your data hunger: congress, bestsellers, campaign finance, or movie review.

So much data, so little time...

January 24, 2009

Symbolic versus Statistical Approaches Toward Semantic Search

On Google Watch, Clint Boulton notes that Eric Schmidt was recently quoted as saying:

"Wouldn't it be nice if Google understood the meaning of your phrase rather than just the words that are in the phrase? We've made a lot of discoveries in that area that are going to roll out in the next little while."

From this comment (and some recently noted changes in search results), Boulton infers that Google will be embracing semantic search:

"Schmidt is not talking about universal search, which draws together all Web elements -- text, blogs, video, etc. -- and renders them on a page. Schmidt is alluding to smarter search -- semantic search, which uses XML and RDF data from semantic networks to disambiguate search queries and Web text to improve search results."

I think Boulton may be overinterpreting Schmidt's comment (as are others following suite, for example, Search Engine Land or About.com's Web Trends). It seems to me that Schmidt is really just acknowledging the obvious: that in order to provide better search results, Google needs to go beyond simple keyword analysis. Google already does this to some extent, by using n-grams, for example, to analyze keyword co-occurences. (It's not a terribly sophisticated language model, but it's a language model nonetheless.)

The real issue isn't whether Google is going to embrace semantic search, but how. I don't think there's any question that search results would be improved by knowing which meaning of a particular term or phrase is intended by the user when he or she types a query. But it's a leap to conclude that interest in semantics amounts to an endorsement of semantic search along the lines envisioned by Tim Berners-Lee, which I assume is what Boulton had in mind when he mentioned the use of "XML and RDF data from semantic networks to disambiguate search queries and Web text to improve search results".

There are other ways of getting to semantics. One big issues is whether it should be done with statistical analysis of co-occurence frequencies in large text collections (the n-gram route) or by hard-wiring knowledge of semantics using a symbolic system (the traditional AI route). In other words, the $64,000 question is whether the path to better analysis of semantics will be more statistical or symbolic. I say "more" of one or the other, because it isn't an all-or-none affair. Hybrid systems are the norm nowadays.

The debate over the relative merits of statistical and symbolic systems for NLP has been going on for a while, so it's surprising that there isn't wider recognition of the issue among bloggers writing about semantic search. For example, the following book on the topic was published 13 years ago: The Balancing Act: Combining Symbolic and Statistical Approaches to Language. It's been on my shelves for years, but I only recently realized that my colleague at Powerset, Ron Kaplan, wrote the endorsement of the book that appears on its MIT Press homepage.

January 10, 2009

The Power of Parsing: Why a Bag of Words is Nice but a Tree of Words is Nicer

When Powerset was a media darling and getting lots of attention in the mainstream press, I set up a standing search for the keyword "Powerset" on Google news. Since I'm a bit of a news junkie, I regularly check the results. The original results were pretty good but as Powerset has fallen from the limelight, I've noticed a gradual decrease in quality owing to an expansion of "Powerset" as "power set". What's interesting is that the keyword results for "power set" typically are false positives where "set" is in fact a verb, as can be seen in the screenshot below.

This is an excellent example of where semantic search offers an improvement in precision over simple keywordese (which treats text as a bag or a list of words). A decent parser ought to be able to detect that "set" in these examples functions as a verb. Assuming that the query is analyzed as a noun (and that the analysis is respected in its expansion as "power set"), results where "set" is a verb should either fail match or at least be significantly lower ranked. A bag of words is nice, but a tree of words is even nicer.