23.4.09

Third Workshop on Analytics for Noisy Unstructured Text Data

Third Workshop on Analytics for Noisy Unstructured Text Data
23-24 July 2009, Barcelona, Spain
In conjunction with the Tenth International Conference on Document Analysis and Recognition

Noisy unstructured text data is ubiquitous in real-world communications. Text produced by processing signals intended for human use such as printed/handwritten documents, spontaneous speech, and camera-captured images, are prime examples. ICR/OCR error rates on paper documents can range widely from 2-3% for clean inputs to 50% or higher depending on the quality of the page image, the complexity of the layout, aspects of the typography, etc. Individual variability in handwriting make this a particularly difficult form of input and error rates here are often substantially higher than for machine print text. Telephonic conversations between call center agents and customers often see 30-40% word error rates, even using state-of-the-art ASR techniques. In spite of the tremendous challenges such data presents, it is pervasive in applications of interest to corporations and government organizations.

Recognition errors are not the sole source of noise; natural language and the creative ways that humans use it can create problems for computational techniques. Electronic text from the Internet (emails, message boards, newsgroups, blogs, wikis, chat logs and Web pages), contact centers (customer complaints, emails, call transcriptions, message summaries), and mobile phones (text messages) is often noisy, containing spelling errors, abbreviations, non-standard words, false starts, repetitions, missing punctuation, missing case information, and pause-filling words such as "um" and "uh" in the case of spoken conversations.

The Third Workshop on Analytics for Noisy Unstructured Text Data (AND-09) is devoted to issues arising from the need to contend with noisy inputs, the impact noise can have on downstream applications, and the demands it places on document analysis. Topics of Interest (but not limited to):

  • Noise induced by document analysis techniques and its impact on downstream applications
  • Formal models for noise, including characterization and classification of noise
  • Treatment of noisy data in specific application areas, including historical texts, multilingual documents, blogs, chat / SMS logs, social network analysis, patent search, and machine translation
  • Data sets, benchmarks, and evaluation techniques for analysis of noisy text
  • All other topics arising from noise and its effects on textual data Participation

Dates

  • Submission of papers: May 4, 2009
  • Notification of Acceptance: May 20, 2009
  • Camera-Ready papers due: June 20, 2009