30.4.09

The Quick Reference Site

A very useful site for programmers is the Quick Reference Site, maintained by Tim Sinaeve. This site features:

  • Quick Reference Cards, short sheets cointaining the main commands for e.g. utils like Subversion or vi and so, and programming languages. The most valuable resource, although some cards can have 8 pages!
  • E-books, that range from manuals to learning and reference books.
  • Papers and tutorials (self described).
Hope you find it as useful as me!

29.4.09

¿Podría el PLN ayudar a José Manuel Lúcia?

Cual es mi sorpresa al consultar El Mundo, ver que la portada de la sección de televisión presenta como noticia principal la increible actuación de José Manuel Lúcia en Pasapalabra. Es digno de contemplarse.

Tras esta impresionate hazaña, que le ha valido a Jose Manuel la nada despreciable cifra de 396.000 euros de premio, viene a mi mente mi post de ayer en el que IBM se plantea como próximo reto de la Inteligencia Artificial el construir un sistema capaz de competir a nivel humano en el programa Jeopardy!

El programa Jeopardy! es un popular concurso norteamericano de preguntas de memoria, que exige no sólo una gran capacidad de almacenamiento (en el estándar humano), sino también una enorme rapidez de recuperación de datos. IBM plantea competir con su tecnología de respuesta a preguntas (Question Answering) llamada Watson. Aunque para un profano puede parecer fácil, el campo de QA es uno de los más activos por su interés práctico pero también por su dificultad.

¿Sería Pasapalabra un reto de igual magnitud para un sistema de Procesamiento de Lenguaje Natural? Indudablemente, no en mi opinión. Si en Jeopardý la dificulta consiste en interpretar la pregunta, en Pasapalabra nos remitimos a una interpretación básica (detectar "Contiene la X" vs. "Con la C"), y luego recuperar una palabra dada su definición. Esta segunda parte se puede hacer por técnicas de Recuperación de Información básicas sobre un diccionario exhaustivo previamente indexado. ¿O no?

Para la reflexión, ¿qué ocurre con las categorías gramaticales? Tomemos esta definición improvisada: "Con la V: Ejercicio del derecho de elección de los representantes políticos en los órganismos gubernamentales". ¿La respuesta es "Votar" o "Voto"?

Otra aplicación del Procesamiento del Lenguaje Natural relacionada con Pasapalabra es la de elección de las palabras que conforman el "rosco". El camino trivial es la elección aleatoria, pero la dificultad del rosco sería muy variable y a veces escandalosamente excesiva. La solución es catalogar las palabras por su dificultad. ¿Y como calcularla? Opino que la aproximación más razonable es usar un corpus representativo del español, y asignar a las palabras de un diccionario una dificultad según su frecuencia o popularidad, con atención a la lematización. La variación morfosintáctica debe contemplarse para, por ejemplo, reducir la flexión verbal al infinitivo que aparece en un diccionario. El detalle de seleccionar las palabras de manera que mantengan un nivel de dificultad que equilibre el rosco es un problema de optimización que se puede resolver con algorítmica habitual (entiendo que en este caso, podría valer una aproximación de avance rápido (greedy) o a lo sumo de programación dinámica.

Para finalizar, felicitar a José Manuel por su hazaña, y agraderle su inspiración para este post y el magnífico momento que me ha hecho pasar contemplando el vídeo.

28.4.09

Robot AIML con personalidad de niño

AIML se ha convertido casi en un estándar para el desarrollo de chat bots, es decir, programas capaces de conversar con la gente como personas reales (bueno, ese es el objetivo, pero apenas pasan el Test de Turing, aunque pueden engañar incluso un experto por un tiempo).

Estoy en el proceso de construcción de un robot basado en AIML se asemeja a un niño de habla hispana.He probado dos enfoques principales, a saber:

  • Aprendizaje de patrones AIML a partir de chats reales de niños, mediante el uso de algunas variaciones de los algoritmos propuestos por Abu Shawar Bayan (ver referencias más abajo).El problema con este enfoque es que el sistema de aprendizaje está destinado a facilitar el acceso mediante un chat a un corpus, no a parecerse a un ser humano.He probado algunas de las alternativas utilizando pequeños elementos de conocimiento lingüístico (por ejemplo, Freeling), sin éxito, sigue mostrando un habla muy poco natural.
  • La construcción de la AIML archivos manualmente (la forma habitual de hacer las cosas de este campo).Demasiado tiempo!

Pablo Gervásme sugirió, para evitar empezar de cero, usar un robot de habla inglesa.Bueno, si consigo encontrarlo!Así que mi pregunta es la siguiente:

¿Alguna vez has encontrado un bot de habla inglesa parecido a un niño, construido sobre AIML?

Referencias

Abu Shawar, Bayan; Atwell, Eric. A chatbot system as a tool to animate a corpus. ICAME Journal, vol. 29, pp. 5-24. 2005.

Abu Shawar, Bayan; Atwell, Eric. Using corpora in machine-learning chatbot systems. International Journal of Corpus Linguistics, vol. 10, pp. 489-516. 2005.

Question Answering as the new challenge for real AI: Jeopardy!

Question Answering is an increasingly popular Natural Language Processing task, defined in the wikipedia as "the task of automatically answering a question posed in natural language. To find the answer to a question, a QA computer program may use either a pre-structured database or a collection of natural language documents (a text corpus such as the World Wide Web or some local collection)." In other words, an Information Retrieval system / Search Engine does not return a list of references or documents, but it actually is able to cleverly process the question, digg into the documents, and get an appropriate answer.

Question Answering is a very active research topic among NLPers. There are competitions like the TREC and CLEF tracks on Q&A, and some systems have been deployed to real applications (from within Search Engines like Ask.com to domain-specific systems like
EAGLi for biomedicine). But, now, IBM gains popularity for their system Watson with aiming at a nearly Turing Test: beating humans at Jeopardy!

Jeopardy is a game demanding knowledge and quick recall, covering a broad range of topics, such as history, literature, politics, film, pop culture, and science. IBM believes that this is the final challenge for their system Watson, one of the leaders on this area. Do you remember Deep Blue? Ok, this time is not to beat a person in terms of specific skills (only), like reasoning and hypotetizing, but also in dialog! This is very close to a real Turing Test!

If Watson is able to beat the winner of Jeopardy, will we be closer to a real AI? Are HAL or Skynet waiting for us behind that door?

AIML bot resembling a child?

AIML has been turn into nearly an standard for writing chat bot, that is, programs able to chat with people like persons (well, that is the goal, but they hardly pass the Turing Test, although they may fool even an experft for a while).

I am in the process of building an AIML-based bot resembling an Spanish-speaking child. I have tested two main approaches, namely:

  • Trying to learn the AIML patterns from actual children chats, by using several variations of the algorithms proposed by Bayan Abu Shawar (see references below). The problem with this approach is that the learning system is intented to provide access by chat to a corpus, not to resemble a human being. I have tested some alternatives using bits of linguistic knowledge (e.g. Freeling) with no success, still the bot speaks very unnaturally.
  • Building the AIML files by hand (the usual way this things are done). Too time consuming!!!

Pablo Gervás suggested me to avoid starting from scratch by using an English-speaking bot. Nice, if I can find it! So my question is:

Have you ever found an English-speaking chat bot resembling a child, based on AIML?

References

Abu Shawar, Bayan; Atwell, Eric. A chatbot system as a tool to animate a corpus. ICAME Journal, vol. 29, pp. 5-24. 2005.

Abu Shawar, Bayan; Atwell, Eric. Using corpora in machine-learning chatbot systems. International Journal of Corpus Linguistics, vol. 10, pp. 489-516. 2005.

27.4.09

Sixteenth International Symposium on String Processing and Information Retrieval

Sixteenth International Symposium on String Processing and Information Retrieval
25-27 August 2009
Saariselka, Finland

SPIRE 2009 covers research in all aspects of string processing, information retrieval, computational biology, pattern matching, semi-structured data, and related applications. Typical topics of interest include (but are not limited to):

  • String Processing: Dictionary algorithms, Text searching, Pattern matching, Text and sequence compression, Automata based string processing.
  • Information Retrieval: Information retrieval models, Indexing, Ranking and filtering, Interface design, Visualization, Benchmarking.
  • Natural language processing: Text analysis, Text mining, Machine learning, Information extraction, Language models (both structural and semantic), Knowledge representation.
  • Search applications and usage: Cross-lingual information access systems, Multimedia information access, Digital libraries, Collaborative retrieval and Web related applications, Semi-structured data retrieval, Evaluation.
  • Interaction of biology and computation: DNA sequencing and applications in molecular biology, Evolution and phylogenetics, Recognition of genes and regulatory elements, Sequence driven protein structure prediction.

Deadline for papers: May, 1, 2009

Counter eCrime Operations Summit III

Very, very interesting meeting, organized by the Antiphishing Working Group (APWG). The press release:


Electronic crime responders, investigators and counter-electronic crime technologists will join law enforcement and public policy officials from across the globe in Barcelona in May for the APWG's Counter-eCrime Operations Summit, uniting thought-leaders worldwide to plan the next stage in the global confrontation against electronic crime.

The third annual APWG operations conference (CeCOS III), to be held on May 12-14 in Barcelona, Spain will engage questions of operational challenges and the development of common resources for the first responders and forensic professionals who protect consumers and enterprises from electronic crime threats every day.

CeCOS III will present: informative case studies by electronic crime responders and security specialists; examinations of technologies developed and used by electronic crime gangs to exploit Internet infrastructure and user's PCs and client devices; discussions about the technologies and techniques of educating and protecting consumers; and presentations about the development of shared resources like common data formats for ecrime reporting, alerting and coordinating mechanisms.

APWG Chairman David Jevans said, "National governments, international treaty organizations, law enforcement agencies and industry associations the world over are looking at coordinating data exchange for electronic crime. CeCOS III will work to build bridges between these constituencies that engage the threats that electronic crime poses against consumers and enterprises everywhere every day."

CeCOS III is an open conference for members of the electronic-crime fighting community, hosted by the APWG and sponsored by LaCaixa, Telefonica, S21sec, GMV, MarkMonitor, EMC's RSA security division, Deloitte España and Ecija. Although sponsorship is principally from industry, the CeCOS programs are considered the most vital events to investigators and managers of electronic crime from across private and public sectors.

In Tokyo last year at CeCOS II, some 250 delegates attended from law enforcement agencies, technology companies, financial services firms, security services firms, government agencies, consumer advocacy groups and research centers around the globe, bringing together some of the most advanced counter-electronic crime thought leaders from East Asia, Europe, South America and North America.

Parties interested in proposing presentations or participating in panel discussion for CeCOS III can send email to: proposals@antiphishing.org. Parties interested in sponsoring some part of the event can contact Deputy-Secretary General Foy Shiver at fshiver@antiphishing.org.

A preliminary working agenda can be found at the APWG Web page for the Summit.


Final call for participation INEX 2009

Final call for participation INEX 2009

INEX 2009 is examining focused (sub-document) retrieval using the new Wikipedia 2009 Collection. The collection size is over 50 Gigabyte with 2.5 million articles in semantically marked-up XML format. For the main ad hoc retrieval track the collection will be made available in Plain Text as well as XML so that non-XML based passage retrieval systems can also be used.

Other tracks are: Book Searching (using a collection of 50,000 full text books in XML), Efficiency, Entity Ranking, Interactive (iTrack), Question Answering (QA@INEX), Link-the-Wiki, and XML-Mining. All INEX Participants are expected to participate in topic creation, and assessment.

  • Ad hoc Track - The main track of INEX 2009 will investigate the effectiveness of XML-IR and Passage Retrieval for three ad hoc retrieval tasks (Focused, Relevant in Context, Best in Context).
  • Book Track - Investigating techniques to support users in reading, searching, and navigating full texts of digitized books.
  • Efficiency Track - Investigating the trade-off between effectiveness and efficiency of XML ranked retrieval approaches on real data and real queries.
  • Entity Ranking Track - Investigating entity retrieval rather than text retrieval: 1) Entity Ranking, 2) Entity List Completion.
  • Interactive Track (iTrack) - Investigating the behavior of users when interacting with XML documents, as well as develop retrieval approaches which are effective in user-based environments.
  • Question Answering (QA@INEX) Track - Investigating technology for accessing semi-structured data can be used to address real-world focused information needs formulated as natural language questions.
  • Link-the-Wiki Track - Investigating link discovery between Wikipedia documents, both at the file level and at the element level.
  • XML-Mining Track - Investigating structured document mining, especially the classification and clustering of semi-structured documents.

The schedule of the main ad hoc track is as follows:

  • 27/Apr/2009 Release of Topic Creation Guidelines
  • 18/May/2009 Submission deadline for candidate topics
  • 1/Jun/2009 Release of final set of topics
  • 1/Jun/2009 Release of Result Submission Specification
  • 6/Jul/2009 Submission deadline for ad hoc search results
  • 27/Jul/2009 Release of assessment pools
  • 14/Sep/2009 Submission deadline for relevance assessments
  • 2/Nov/2009 Release of ad hoc evaluation results
  • 23/Nov/2009 Submission deadline for papers for pre-proceedings (all tracks)
  • 30/Nov/2009 Release of workshop pre-proceedings
  • 6-10/Dec/2009 INEX Workshop in Brisbane, Australia

Other tracks will follow variants of this schedule. Relevance assessments will be provided by the participating groups using the INEX assessment system. Each participating organization will judge about 3 topics. Please note that the assessment of each topic may take one-person 1 to 2 days to complete!

23.4.09

VideoCLEF 2009: Video Analysis and Retrieval Benchmark Evaluation

VideoCLEF 2009
Video Analysis and Retrieval Benchmark Evaluation

VideoCLEF 2009 is a track of the CLEF benchmark campaign dedicated to developing and evaluating tasks involving access to video content in a multilingual environment. In 2009, organizers offer four video analysis and retrieval tasks, which will be carried out on Dutch television documentaries. Participants can approach these tasks using their own choice of methods and features. The provided video data will include speech recognition transcripts, shot boundaries, shot-level keyframes and archival metadata.

  1. The Subject Classification task involves automatic tagging of videos with subject labels such as 'Music', 'History', 'Politics', and 'Museums'. This task is related to video genre classification--the subject theme labels the task uses are semantically more fine grained than genres, however. The Subject Classification task ran successfully during the VideoCLEF pilot in 2008. In 2009, this task will run on the TRECVid 2007 and 2008 collections from the Netherlands Institute for Sound and Vision. Participants are encouraged to use features derived from both the speech and visual channels.
  2. The goal of the Affect and Appeal task is to move beyond the thematic content of the video and to analyze video with respect to characteristics that are important for viewers, but not related to the video topic. This task will use content from a collection of short form documentaries taken from the "Beeldenstorm" series. These are described as having "hilarious" and "moving" moments. This task comprises two subtasks, the narrative peak detection subtask, in which participants are asked to find the three funniest, most moving moments in each short documentary video, and the classification subtask, in which participants are asked to classify videos as either "popular" or "not-popular".
  3. Semantic Keyframe Extraction: Keyframes or keyframe sets allow users to preview video content without playing the video. In this task, participants carry out keyframe selection using video and speech/audio features. Selected keyframes should represent the semantic content of the video, e.g., an episode of a documentary. This task will also use the "Beeldenstorm" dataset. This task builds upon the 2008 keyframe extraction task.
  4. Finding Related Resources Across Languages: Given a short documentary in Dutch, participants are asked to identify English-language resources to support viewer comprehension for non-Dutch speaking viewers. This task is new in 2009. Participants will be given a number of selected time points for each short documentary and asked to link each time point to a relevant article from the English language Wikipedia. This task will also use the "Beeldenstorm" dataset.

VideoCLEF 2009 takes place on the following schedule:

  • April 2009 Release of training set
  • May 2009 Release of test set
  • June 2009 Submission of runs
  • July 2009 Evaluation
  • August 2009 Working notes paper
  • September 30 - October 2, 2009 CLEF Workshop

Third Workshop on Analytics for Noisy Unstructured Text Data

Third Workshop on Analytics for Noisy Unstructured Text Data
23-24 July 2009, Barcelona, Spain
In conjunction with the Tenth International Conference on Document Analysis and Recognition

Noisy unstructured text data is ubiquitous in real-world communications. Text produced by processing signals intended for human use such as printed/handwritten documents, spontaneous speech, and camera-captured images, are prime examples. ICR/OCR error rates on paper documents can range widely from 2-3% for clean inputs to 50% or higher depending on the quality of the page image, the complexity of the layout, aspects of the typography, etc. Individual variability in handwriting make this a particularly difficult form of input and error rates here are often substantially higher than for machine print text. Telephonic conversations between call center agents and customers often see 30-40% word error rates, even using state-of-the-art ASR techniques. In spite of the tremendous challenges such data presents, it is pervasive in applications of interest to corporations and government organizations.

Recognition errors are not the sole source of noise; natural language and the creative ways that humans use it can create problems for computational techniques. Electronic text from the Internet (emails, message boards, newsgroups, blogs, wikis, chat logs and Web pages), contact centers (customer complaints, emails, call transcriptions, message summaries), and mobile phones (text messages) is often noisy, containing spelling errors, abbreviations, non-standard words, false starts, repetitions, missing punctuation, missing case information, and pause-filling words such as "um" and "uh" in the case of spoken conversations.

The Third Workshop on Analytics for Noisy Unstructured Text Data (AND-09) is devoted to issues arising from the need to contend with noisy inputs, the impact noise can have on downstream applications, and the demands it places on document analysis. Topics of Interest (but not limited to):

  • Noise induced by document analysis techniques and its impact on downstream applications
  • Formal models for noise, including characterization and classification of noise
  • Treatment of noisy data in specific application areas, including historical texts, multilingual documents, blogs, chat / SMS logs, social network analysis, patent search, and machine translation
  • Data sets, benchmarks, and evaluation techniques for analysis of noisy text
  • All other topics arising from noise and its effects on textual data Participation

Dates

  • Submission of papers: May 4, 2009
  • Notification of Acceptance: May 20, 2009
  • Camera-Ready papers due: June 20, 2009

The Open Health Natural Language Processing Consortium

The goal of the Open Health Natural Language Processing Consortium is to establish an open source consortium to promote past and current development efforts and to encourage participation in advancing future efforts. The purpose of this consortium is to facilitate and encourage new annotator and pipeline development, exchange insights and collaborate on novel biomedical natural language processing systems and develop gold-standard corpora for development and testing. The Consortium promotes the open source UIMA framework and SDK as the basis for biomedical NLP systems. Applications created within UIMA consist of software components (referred to as annotators) and their associated configuration files and external resources. Within the framework, one can also create complete pipelines composed of a sequence of annotators and the data flow between them.

Via the BioNLP list.

MAVIR en Twitter

El consorcio MAVIR tiene su representación en Twitter! Muy bueno!

Data resources for Biological / Chemical Natural Language Processing

In the list I am subscribed, there are announces of new data collections from time to time. These resources are extremely valuable for Bio-NLP research, we must disseminate them and strongly appreciate their builders work:

A number of tools are available at the Bio-NLP Resources page compiled by Martin Krallinger and his group.

22.4.09

Call for reviewers: NDT, ICADIWT, Scientific Research and Essays

The following conferences and jornals are needing reviewers. I think it is a very good idea to support these conferences and journals (it does not take so much time, it matters, and it makes CV).

  • NDT: Networked Digital Technologies. Currently, a number of institutions across the countries are working to evolve better models to provide collaborative technology services for scholarship by creating shared cyberspace thro expert collaboration, but this is a challenge for the institutions for a number of reasons. In the last few years, the landscape of digital technology applications projects for the various disciplines in humanities, social sciences, and sciences appears induced by many initiatives. For the creation of research clusters, the research community has thousands of databases, websites, local computing clusters, and web-based tools around individual themes, interests and projects. In most cases, these tools and resources are and were created to meet the specific needs of a particular community. In many cases, the funding and support for these critical initiatives is fragile and temporary, and directed in piecemeal fashion. There is a need to provide concerted efforts in building federated digital technologies that will enable the formation of network of digital technologies.
  • ICADIWT: International Conference on the Applications of Digital Information and Web Technologies. A forum for scientists, engineers, and practitioners to present their latest research results, ideas, developments and applications in the areas of Computational Intelligence, Artificial Intelligence, Networking, Neural networks, Network security, Biometrics Technologies and Applications, Pattern Recognition and Biometrics Security, Bioinformatics and IT Applications in the above themes.
  • Scientific Research and Essays (SRE) publishes high-quality articles in English, in all areas of science, medicine, agriculture and engineering. All papers published by SRE are peer reviewed. SRE is a very rapid response journal with an issue published every month.

MAVIR: Satoshi Sekine & Andrew Borthwick

SEMINARIO MAVIR
Miércoles 22 de abril 2009 en la UNED

Con motivo de la presencia de Satoshi Sekine y Andrew Borthwick en el congreso WWW2009 de Madrid, hemos organizado para el próximo miércoles 22 de abril a partir de las 16h dos seminarios que se celebrarán en la ETSI Industriales de la UNED.

TÍTULO: Recent Advances on Minimally Supervised Knowledge Discovery
PONENTE: Satoshi Sekine (NYU)
HORARIO: miércoles 22/04/2009 a las 16h00

TÍTULO: Challenges on Wep People Search: the Spock experience
PONENTE: Andrew Borthwick (Spock Networks)
HORARIO: miércoles 22/04/2009 a las 17h00

LUGAR DE CELEBRACIÓN: Salón de Grados ETSI Industriales, UNED
c/ Juan del Rosal, 12 28040 Madrid

MAVIR: Semantically Enhanced Information Retrieval: an Ontology-based Approach

SEMINARIO MAVIR
Jueves 23 de abril 2009 en la UNED

TÍTULO: Semantically Enhanced Information Retrieval: an Ontology-based Approach
PONENTE: Míriam Fernández (UAM)

HORARIO: jueves 23/04/2009 a las 12h00
LUGAR DE CELEBRACIÓN: Sala 2.24 (segunda planta) Facultad de Psicología, UNED
c/ Juan del Rosal, 10 Ciudad Universitaria 28040 Madrid

21.4.09

WWW09 Best paper nomination and papers

The papers for the forthcoming World Wide Web Conference 2009 in Madrid have been released. It is possible to access them at the ePrint archive set up for that purpose. I cannot forget this conference since the hit by Brin & Page at WWW98 with their PageRank and data structures that make Google.

Among published papers, there is a selection of those nominated for the best paper award:

My favourite is Visual Diversification of Image Search Results! But I cannot vote :-)

14.4.09

Human Computation Workshop (HCOMP 2009)

Human Computation Workshop (HCOMP 2009)
KDD-09 Workshop, Paris France
June 28, 2009

The organizers invite you to participate in the first annual Human Computation Workshop (HCOMP 2009), to be held on June 28th in conjunction with the KDD-09 conference in Paris, France.

Human computation is a new research area that studies the process of channeling the vast internet population to perform tasks or provide data towards solving difficult problems that no known efficient computer algorithms can yet solve. The goal of this half-day workshop is to bring together academic and industry researchers in a stimulating discussion of existing human computation applications (e.g. games, CAPTCHAs, Mechanical Turk) and future directions of this new subject area.

The organizers solicit papers related to various aspects of both general human computation techniques and specific applications, e.g. general design principles; implementation; cost-benefit analysis; theoretical approaches; privacy and security concerns; and incorporation of machine learning / artificial intelligence techniques.

An integral part of this workshop will be a demo session where participants can showcase their human computation applications. Detailed information about the workshop and submission procedures can be found at the website.

The workshop proceedings will be included in the ACM Digital Library. Deadline for submission is April 18, 2009 8pm Eastern Time.

Parsed MEDLINE(R) data download service

Jin-Dong Kim, from the Tsujii Laboratory, University of Tokyo, has announced the start of the Parsed MEDLINE(R) data download service.

This service provides an access to syntactically-parsed MEDLINE abstracts. The abstracts were parsed with a wide-coverage HPSG parser, the Enju parser (version 2.2). The original data is the 2009 baseline release of the MEDLINE database, which includes approximately 18 million records.

For detail, refer to the usage web page.

Opinion Mining and Sentiment Analysis on Social Media

Themos Kalafatis has an interesting blog about Practical Applications of Data Mining, Text Mining and Information Extraction entitled Life Analytics, in which there a number of examples on using Twitter for Sentiment Analysis, and a recent post about ScoutLabs, which is a company that offer Social Media Monitoring for corporations.

My opinion is that there is a great opportunity right now for Opinion Mining and Sentiment Analysis on Social Media, as collecting and classifying opinions and metions to brands, trade marks, services and products, executives and other staff, etc., can be very valuable for:

  • Brand and product analysis and protection: a corporation gets knowledge about the opinions (and possible image attacks) about their products, services, brand, etc.
  • Competitive analysis: a corporation gets knowledge about the opinions of people about their competence products, services, etc.
  • Technology watch: a corporation gets knowledge about new productsa, services or technologies than can be relevant to their productive processes, technologies, comercial offer, etc.

Social Networking sites on Data Mining

In KDNuggets, there is a list of Social Networking groups about Data Mining, Analytics, Web Mining, and Text Mining:


7.4.09

Third International Workshop on Data Mining and Audience Intelligence for Advertising

Third International Workshop on Data Mining and Audience Intelligence for Advertising (ADKDD'09)
June 28, 2009, Paris, France
Iconjunction with The 15th International Conference on Knowledge Discovery and Data Mining (SIGKDD'09)

Advertising, especially online advertising, is growing rapidly and brings about large volumes of data along with challenging data mining problems. Following on the success of ADKDD 2007 and 2008, ADKDD 2009 is to be held in Paris France, in conjunction with KDD 2009, to provide a high-level international forum for the academic community and the industry to present the state of the art of algorithms and applications of advertising.

Topics

Papers on all aspects of data mining and audience intelligence for advertising are solicited. Areas of interest include, but are not limited to:

  • Mining for Ad Relevance and Ranking
    • Ad relevance measurement
    • Ad ranking algorithms
    • Ad text creation and evaluation
  • Audience Intelligence & User Modeling
    • Understanding user intent from search, browsing & social network activities
    • Behavioral targeting - modeling online user behaviors for targeted advertisement
    • User segmentation
    • Demographics & location prediction
    • Personalized advertising
  • Content Understanding
    • Content-targeted advertising
    • Opinion/sentiment mining
    • Mining social networks and blogs
    • Web scale information extraction for online advertisement
    • Text mining techniques such as named entity extraction, query classification, keyword extraction, and other topics
    • Understanding multimedia content for online advertisement
  • Search Engine Marketing, Optimization (SEMs, SEOs)
  • Other Topics in Advertising
    • Advertising on new channels such as mobile devices
    • Tracking effectiveness of advertisement campaigns
    • Consumer privacy and data use policy
    • Privacy preserving data mining approaches
    • Fraud and spam detection & prevention in online advertisements

Dates

  • May 1, 2009: Electronic submission of full papers
  • May 15, 2009: Author notification
  • May 22, 2009: Submission of Camera-ready papers
  • June 28, 2008: Workshop in Paris, France

2.4.09

XXV congreso de la SEPLN

XXV CONGRESO DE LA SEPLN
8-10 Septiembre 2009
Palacio Miramar, Donostia - San Sebastián

La XXV edición del Congreso Anual de la Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN) se celebrará en Donostia-San Sebastián los días 8, 9 y 10 de septiembre de 2009 en el Palacio de Miramar.

La ingente cantidad de información disponible en formato digital y en las distintas lenguas que hablamos hacen imprescindible disponer de sistemas que permitan acceder a esa enorme biblioteca que es Internet de manera cada vez más estructurada.

En este mismo escenario, hay un interés renovado por la solución de los problemas de accesibilidad a la información y de mejora de explotación de la misma en entornos multilingües. Muchas de las bases formales para abordar adecuadamente estas necesidades han sido y siguen siendo establecidas en el marco del procesamiento del lenguaje natural y de sus multiples vertientes: Extracción y recuperación de información, Sistemas de búsqueda de respuestas, Traducción automática, Análisis automático del contenido textual, Resumen automático, Generación textual, y Reconocimiento y síntesis de voz.

El objetivo principal del congreso es ofrecer un foro para presentar las últimas investigaciones y desarrollos en el ámbito de trabajo del Procesamiento del Lenguaje Natural (PLN) tanto a la comunidad científica como a las empresas del sector. También se pretende mostrar las posibilidades reales de aplicación y conocer nuevos proyectos I+D en este campo.

Además, como en anteriores ediciones, se desea identificar las futuras directrices de la investigación básica y de las aplicaciones previstas por los profesionales, con el fin de contrastarlas con las necesidades reales del mercado. Finalmente, el congreso pretende ser un marco propicio para introducir a otras personas interesadas en esta área de conocimiento.

Se anima a grupos, investigadores y empresas a enviar comunicaciones, resúmenes de proyectos o demostraciones en el ámbito de las tecnologías de la lengua en alguna de las áreas temáticas siguientes:

  • Modelos lingüísticos, matemáticos y psicolingüísticos del lenguaje.
  • Lingüística de corpus.
  • Desarrollo de recursos y herramientas lingüísticas.
  • Gramáticas y formalismos para el análisis morfológico y sintáctico.
  • Semántica, pragmática y discurso.
  • Resolución de la ambigüedad léxica.
  • Aprendizaje automático en PLN.
  • Generación textual monolingüe y multilingüe.
  • Traducción automática.
  • Reconocimiento y síntesis de voz.
  • Extracción y recuperación de información monolingüe y multilingüe.
  • Sistemas de búsqueda de respuestas.
  • Análisis automático del contenido textual.
  • Resumen automático.
  • PLN para la generación de recursos educativos.
  • PLN para lenguas con recursos limitados.
  • Aplicaciones industriales del PLN.

Fechas importantes

  • 24-Abril-2009: Fecha límite para el envío de artículos, proyectos y demostraciones.
  • 25-Mayo-2009: Notificación de aceptación.
  • 19-Junio-2009: Fecha límite para el envío de la versión definitiva.
  • 15-Julio-2009: Plazo para inscripción a coste reducido.
  • 07-Sept-2009: Talleres-workshop
  • 8, 9 y 10 de Sept.: XXV Congreso SEPLN

First International Conference on Networked Digital Technologies

I am reviewer for this conference:

First International Conference on Networked Digital Technologies
Ostrava, Czech Republic, July 28-31, 2009

The proposed conference aims to enable researchers build connections between different digital applications. Currently a number of institutions across the countries are working to evolve better models to provide collaborative technology services for scholarship by creating shared cyberspace thro expert collaboration, but this is a challenge for the institutions for a number of reasons. In the last few years, the landscape of digital technology applications projects for the various disciplines in humanities, social sciences, and sciences appears induced by many initiatives. For the creation of research clusters, the research community has thousands of databases, websites, local computing clusters, and web-based tools around individual themes, interests and projects. In most cases, these tools and resources are and were created to meet the specific needs of a particular community. In many cases, the funding and support for these critical initiatives is fragile and temporary, and directed in piecemeal fashion. There is a need to provide concerted efforts in building federated digital technologies that will enable the formation of network of digital technologies.

Topics include but not limited to:

  • Information and Data Management
  • Data and Network mining
  • Intelligent agent-based systems, cognitive and reactive distributed AI systems
  • Internet Modeling User Interfaces, Visualization and modeling
  • XML-based languages
  • Security and Access Control
  • Trust models for social networks
  • Information Content Security
  • Mobile, Ad Hoc and Sensor Network Management
  • Web Services Architecture, Modeling and Design
  • New architectures for web-based social networks
  • Semantic Web, Ontologies (creation , merging, linking and reconciliation)
  • Web Services Security
  • Quality of Service, Scalability and Performance
  • Self-Organizing Networks and Networked Systems
  • Data management in mobile peer-to-peer networks
  • Data stream processing in mobile/sensor networks
  • Indexing and query processing for moving objects
  • User interfaces and usability issues form mobile applications
  • Mobile social networks
  • Peer-to-peer social networks
  • Sensor networks and social sensing
  • Social search
  • Social networking inspired collaborative computing
  • Information propagation on social networks
  • Resource and knowledge discovery using social networks
  • Measurement studies of actual social networks
  • Simulation models for social networks

Important dates

  • Submission Date: April 10, 2009
  • Notification of acceptance May 1, 2009
  • Camera-ready June 15, 2009
  • Registration June 15, 2009
  • Conference dates:July 28-31, 2009

Opinion Mining tutorial by Bing Liu at WWWC 2008

Wandering on the opinion mining and sentiment analysis topic, I have discovered that Bing Liu, a referent on this topic, has posted his tutorial at the World Wide Web Conference 2008 at his home page. While the survey by Bo Pang and Lillian Lee for Foundations and Trends in Information Retrieval is a must, this stuff is (obviously) easier to read. Mi opinion :-D is first to read the tutorial slides, then the survey.

A good point with Bing Liu's stuff is that he has posted data at his Opinion Mining, Sentiment Analysis, and Opinion Spam Detection project. The data consists of (short) customer reviews of several products like cameras, routers, cell phones and so. This is an extract from the Linksys Router opinions:

router[+2]##This router does everything that it is supposed to do, so i dont really know how to talk that bad about it.
setup[+2], installation[+2] ##It was a very quick setup and installation, in fact the disc that it comes with pretty much makes sure you cant mess it up.
install[+3]##By no means do you have to be a tech junkie to be able to install it, just be able to put a CD in the computer and it tells you what to do.

Besides, Bing has worked on opinion spam!

As soon as I find the time, I will try to write a tutorial on how to train and test a baseline opinion classifier with WEKA.

1.4.09

Artículo sobre privacidad en Flickr para Linux+

Un nuevo artículo, en este caso sobre privacidad en Flickr, para Linux+:

Gómez Hidalgo, J.M. Privacidad en Flickr. Linux+ (ISSN: 1732-7121), Número 53, Abril, 2009.