27.9.10

MAVIR Talk: Elliphant: A Machine Learning Method for Identifying Subject Ellipsis and Impersonal Constructions in Spanish

Elliphant: A Machine Learning Method for Identifying Subject Ellipsis and Impersonal Constructions in Spanish
By: Luz Rello (U. of Wolverhampton / U. Pompeu Fabra)
Date: Oct. 1st, 2010 - 12:00
Place: Sala 2.24 Fac. de Psicología, UNED c/ Juan del Rosal, 10 28040 Madrid

This seminar will present Elliphant, a machine learning system for classifying Spanish subject ellipsis as either referential or non-referential. Linguistically motivated features are incorporated in a system which performs a ternary classification: verbs with explicit subjects, verbs with omitted but referential subjects (zero pronouns), and verbs with no subject (impersonal constructions). To the best of our knowledge, this is the first attempt to automatically identify non-referential ellipsis in Spanish.

In order to enable a memory-based strategy, the ESZIC Corpus was created and manually annotated. The corpus is composed of Spanish legal and health texts and contains more than 6,800 annotated instances. A set of 14 features were defined and a separate training file was created, containing the instances represented as vectors of feature values. The training data was used with the Weka package and a set of optimization experiments was carried out to determine the best machine learning algorithm to use, the parameter optimization, the most effective combinations of features, the optimal number of instances needed to train the classifier, and the optimal settings for classifying instances occurring in different genres. A comparative evaluation of Elliphant with Connexor's Machinese Syntax parser shows the superiority of our system. The overall accuracy of the system is 86.9%.

Due to the fairly frequent elision of subjects in Spanish, this system is useful as the classification of elliptic subjects as referential or non-referential can improve the accuracy of Natural Language Processing where zero anaphora resolution is necessary, inter alia, for information extraction, machine translation, automatic summarization and text categorization.