« January 2009 | Main | May 2009 »

February 2009 Archives

February 13, 2009

The New York Times APIs

The New York Times isn't just a good newspaper. It's also a smart media player that understands what it takes to stay in the game as technology changes the newspaper business. (That's why they're "elite".)

As pointed out by ReadWriteWeb (among others), part of their strategy for staying ahead of the curve is to remain an indispensable content provider, and providing APIs is one way to do it.

As they announced on their blog a little more than a week ago, The New York Times has released an article search API that goes back 28 years (to 1981, if you don't feel like doing the math). It exposes a ton of article metadata: title, byline, publication date, descriptive terms, to name a few (go here for more info). What the API doesn't give you, however, is the body of the article. I point that out because the blog post doesn't say that explicitly, and I only figured it out after poking around a bit and reading some of the comments (including one from my former co-worker, Brendan O'Connor).

Not having the full article is a bummer, but you can always get it from the LDC's NY Times corpus. And there are other APIs if that doesn't feed your data hunger: congress, bestsellers, campaign finance, or movie review.

So much data, so little time...

February 14, 2009

Evri.com: Another Step Towards a Semantic Web

Another step towards a real semantic web has been taken recently with the incorporation of Evri.com's technology into articles from The Washington Post. (One small step for the semantic web, one giant leap for Evri.)

I originally learned about evri.com on Twitter. A colleague at Powerset, Will Fitzgerald, tweeted:

Evri (http::www.evri.com) looks mighty. (A company formed around what I do for my day job)

Yesterday another Tweet alerted me to the partnership with The Washington Post:

Evri is now showing up all over the washingtonpost.com on all articles published today. Like this one: http://is.gd/jshy

If you follow the link in the tweet, you'll find a Washington Post article that has a handy little widget providing information about some of the entities discussed in it.

You can poke around on the widget and start exploring. Want to know more about the House Appropriation Committee? It's only a click away.

Maybe you want to know about The White House. Again, just another click.

(Not all of the relationships are of the same semantic type. The last time I checked, Stevie Wonder, wasn't part of Obama's cabinet. The relationships are clusters, it seems, and not a more narrowly defined one such as cabinet membership.)

But the best collection of named entities isn't going to provide very valuable information if there isn't a way of aligning its contents with relevant entities in a the free text of newspaper articles. A good illustration of this was revealed a little while ago when Marshall Kirkpatrick (from ReadWriteWeb) commented on Google's exposure of semantic data, pointing out that it sometimes got things badly wrong, like stating that Jesus was born in 1963. The problem wasn't that Jesus wasn't born in 1963. The problem was that it wasn't the right Jesus. The query was about Jesus Christ, but they returned data for a different Jesus. (It reminds me of a joke I once heard about prison inmates finding religion. The punch line went something like: "That's not Jesus, our Lord and Savior. That's Jesús [Spanish pronunciation], your cellmates."

The point is that you need some reasonably sophisticated technology in order to identify entities in a document and then figure out which entities they correspond to in your database of known entities. The Jesus example shows that relying on first name alone isn't going to do the trick. Figuring out what will is going to be one of the major tasks that researchers in natural language processing will tackle in the coming years. It's a good time to be a computational linguist.

About February 2009

This page contains all entries posted to Nerd Industries: Stuart Robinson's blog in February 2009. They are listed from oldest to newest.

January 2009 is the previous archive.

May 2009 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.35