Nihil Obstat: JRC-Names - A freely available, highly multilingual named entity resource

JRC-Names is a highly multilingual named entity resource for person and organisation names ('entities'). It consists of large lists of names and their many spelling variants (up to hundreds for a single person), including across scripts (Latin, Greek, Arabic, Cyrillic, Japanese, Chinese, etc.). The named entity resource file with the list of spelling variants is accompanied by Java-implemented demonstrator software that (a) allows to produce - for any input name - a list of known spelling variants, and that (b) analyses UTF8-encoded text files to find known entity mentions, returning the name variant found, the preferred display name for that entity, the unique name identifier for that name, the position of the entity name in the text, and its length in characters.

To see examples, go to any of the over one million entity pages on EMM-NewsExplorer (e.g. that for Muammar Gaddafi at http://emm.newsexplorer.eu/NewsExplorer/entities/en/262.html) to see the list of spelling variants automatically collected for that entity.

JRC-Names is a /technical/ resource that can be used to find names even if they are spelled differently and to normalise name spellings in databases or other repositories. It is also a useful ingredient for IT systems that process text, e.g. for text mining, machine translation, social network generation, and other text mining applications involving named entities.

JRC-Names is a by-product of the analysis of about 100,000 news reports per day by the *Europe Media Monitor* (EMM) family of applications (freely accessible at http://emm.newsbrief.eu/overview.html). It was mostly compiled automatically, by analysing hundreds of millions of news articles since the year 2004 in up to twenty languages, identifying names of entities (mostly persons, but also organisations, event names, and more), and detecting which of these newly found names are variant spellings of each other. Most name variants in JRC-Names are thus spellings that were found in real-life text (including frequent spelling mistakes). Additionally, for a subset of the collection of entities, software automatically extracted spelling variants in many further languages (e.g. Chinese, Thai, Japanese, ...) from the cross-lingual links in Wikipedia. For highly frequent or otherwise important names, the named entity resource was additionally manually verified. As JRC-Names was mostly produced automatically, it will contain some errors.

At http://langtech.jrc.ec.europa.eu/, you find more information on the JRC's multilingual language technology activity, a download link for JRC-Names and a reference paper explaining the named entity resource, as well as a page pointing to other multilingual resources.

Via MAVIR and Elsnet lists.

Nihil Obstat

13.10.11

JRC-Names - A freely available, highly multilingual named entity resource

No hay comentarios: