26.3.08

Word Sense Disambiguation & Thesaurus Expansion @ Google

Lexical ambiguity is a source of errors in search engines and other Information Retrieval tools. In the query "bank", are we talking about finantial institutions, places to sit on, sand banks in rivers? Due to language variability, search engines often show wrong hits. Word Sense Disambiguation (WSD) is the task of selecting the suitable sense for a word in a context. For instance, to guess that the meaning of "car" is a "automobile" when used in "I drove my car", and that the meaning is "railcar" when used in "three cars had jumped the rails". Check Lesk paper [1] for a seminal paper on WSD.

It is hard to guess the right sense even if the word is in a good, informative, context. But in the case of a query, is even worse... There is no context! (Querys rarely are longer than two words). But Google researchers are exploiting our data, see:

One of the most important uses of data at Google is building language models. (...) One place we use these models is to find alternatives for words used in searches. For example, for both English and French users, "GM" often means the company "General Motors," but our language model understands that in French searches like seconde GM, it means "Guerre Mondiale" (World War), whereas in STI GM it means "Génie Mécanique" (Mechanical Engineering). Another meaning in English is "genetically modified," which our language model understands in GM corn. We've learned this based on the documents we've seen on the web and by observing that users will use both "genetically modified" and "GM" in the same set of searches.

A note is that Google is using (not only) implicit feedback, that is, storing the links users click on their results pages. Explicit feedback is nearly impossible in the real, hard Web.

OK, you can get a list of recent works of Google fellows on Natural Language Processing. Browse it, it is great!

[1] Mike Lesk, Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone, ACM Special Interest Group for Design of Communication Proceedings of the 5th annual international conference on Systems documentation, p. 24 - 26, 1986.

UPDATE: You can experiment with Google's data via their APIs and other data like their N-grams.

No hay comentarios: