27.9.10

MAVIR Talk: Elliphant: A Machine Learning Method for Identifying Subject Ellipsis and Impersonal Constructions in Spanish

Elliphant: A Machine Learning Method for Identifying Subject Ellipsis and Impersonal Constructions in Spanish
By: Luz Rello (U. of Wolverhampton / U. Pompeu Fabra)
Date: Oct. 1st, 2010 - 12:00
Place: Sala 2.24 Fac. de Psicología, UNED c/ Juan del Rosal, 10 28040 Madrid

This seminar will present Elliphant, a machine learning system for classifying Spanish subject ellipsis as either referential or non-referential. Linguistically motivated features are incorporated in a system which performs a ternary classification: verbs with explicit subjects, verbs with omitted but referential subjects (zero pronouns), and verbs with no subject (impersonal constructions). To the best of our knowledge, this is the first attempt to automatically identify non-referential ellipsis in Spanish.

In order to enable a memory-based strategy, the ESZIC Corpus was created and manually annotated. The corpus is composed of Spanish legal and health texts and contains more than 6,800 annotated instances. A set of 14 features were defined and a separate training file was created, containing the instances represented as vectors of feature values. The training data was used with the Weka package and a set of optimization experiments was carried out to determine the best machine learning algorithm to use, the parameter optimization, the most effective combinations of features, the optimal number of instances needed to train the classifier, and the optimal settings for classifying instances occurring in different genres. A comparative evaluation of Elliphant with Connexor's Machinese Syntax parser shows the superiority of our system. The overall accuracy of the system is 86.9%.

Due to the fairly frequent elision of subjects in Spanish, this system is useful as the classification of elliptic subjects as referential or non-referential can improve the accuracy of Natural Language Processing where zero anaphora resolution is necessary, inter alia, for information extraction, machine translation, automatic summarization and text categorization.

24.9.10

NIPS 2010 Workshop: Machine Learning in Online ADvertising (MLOAD 2010)

NIPS 2010 Workshop: Machine Learning in Online ADvertising (MLOAD 2010)

Online advertising, a form of advertising that utilizes the Internet and World Wide Web to deliver marketing messages and attract customers, has seen exponential growth since its inception over 15 years ago, resulting in a $65 billion market worldwide in 2008; it has been pivotal to the success of the World Wide Web. This success has arisen largely from the transformation of the advertising industry from a low-tech, human intensive, "Mad Men" (ref HBO TV Series) way of doing work (that were common place for much of the 20th century and the early days of online advertising) to highly optimized, mathematical, machine learning-centric processes (some of which have been adapted from Wall Street) that form the backbone of many current online advertising systems.

The dramatic growth of online advertising poses great challenges to the machine learning research community and calls for new technologies to be developed. Online advertising is a complex problem, especially from machine learning point of view. It contains multiple parties (i.e., advertisers, users, publishers, and ad platforms such as ad exchanges), which interact with each other harmoniously but exhibit a conflict of interest when it comes to risk and revenue objectives. It is highly dynamic in terms of the rapid change of user information needs, non-stationary bids of advertisers, and the frequent modifications of ads campaigns. It is very large scale, with billions of keywords, tens of millions of ads, billions of users, millions of advertisers where events such as clicks and actions can be extremely rare. In addition, the field lies at intersection of machine learning, economics, optimization, distributed systems and information science all very advanced and complex fields in their own right. For such a complex problem, conventional machine learning technologies and evaluation methodologies are not be sufficient, and the development of new algorithms and theories is sorely needed.

The goal of this workshop is to overview the state of the art in online advertising, and to discuss future directions and challenges in research and development, from a machine learning point of view. Organizers expect the workshop to help develop a community of researchers who are interested in this area, and yield future collaboration and exchanges.

Possible topics include:

  • Dynamic/non-stationary/online learning algorithms for online advertising
  • Large scale machine learning for online advertising
  • Learning theory for online advertising
  • Learning to rank for ads display
  • Auction mechanism design for paid search, social network advertising and microblog advertising
  • System modeling for ad platform
  • Traffic and click through rate prediction
  • Bids optimization
  • Metrics and evaluation
  • Yield optimisation
  • Behavioral targeting modeling
  • Click fraud detection
  • Privacy in advertising
  • Crowd sourcing and inference
  • Mobile advertising and social advertising
  • Public datasets creation for research on online advertising

The above list is not exhaustive, and organizers welcome submissions on highly related topics too.

Important Dates

  • Submission deadline: Oct. 23, 2010
  • Notification of Acceptance: Nov. 11, 2010
  • Camera ready: Nov. 22, 2010
  • Workshop Date: Dec. 10, 2010

22.9.10

Interesting Text Mining Workshop

The next of EBI's TM training workshops has been announced. It takes place on Oct 27-29, 2010 at the EBI, Hinxton, Cambridge, U.K. It is scheduled right after the SMBM 2010 conference (http://www.smbm.eu/), which takes place at the EBI.

The registration site is now open and the preliminary programm is available. The workshop is for free, but a contribution to the workshop dinner is requested. For more detail please go to http://www.ebi.ac.uk/Rebholz-srv/tmhandson_oct2010.html.

The training workshop features contributions from:

  • Prof. Sophia Ananiadou, University of Manchester, Director of NaCTeM
  • Olivier Bodenreider, National Library of Medicine, Bethesda, Md
  • Fabio Rinaldi, University of Zurich
  • Dietrich Rebholz-Schuhmann, Ian Lewin, Anika Oellrich, Senay Kafkas, Rebholz Group, EMBL-EBI

10.9.10

VirusTotal Advanced Tools

VirusTotal, a service provided by Hispasec Sistemas, that features an online scan of a file with over 20 different AntiVirus programs, has an extremely appealing section on advanced tools.

Instead of just uploading a file for it to be scanned with a traditional Web form, this section offers a number of different additional options, including:

  • An email interface that allows to send the file by email and receive a (possibly XML-formatted) report by email also.
  • An standalone client for Windows that features contextual menus in order to send them to VirusTotal, analize programs currently being executed, etc.
  • (NEW) The VTZilla browser add-on that allows to check files and URLs in Web pages.
  • An API than can be used as a Web service.

Additionally, VirusTotal has been making more and more emphasis on a relatively new service for scanning URL reputation.

Very, very interesting. This service is available also at the VTZilla add-on.

8.9.10

Skype Nigerian Scam

The (in)famous Nigerian Scam, now available for you to enjoy at Skype!