30.1.09

Ninth IEEE International Conference on Data Mining

Ninth IEEE International Conference on Data Mining
December 6-9, 2009 Miami, U.S.A.

The IEEE International Conference on Data Mining (ICDM) has established itself as the world's premier research conference in data mining. The 2009 edition of ICDM provides a leading forum for presentation of original research results, as well as exchange and dissemination of innovative, practical development experiences.

The conference covers all aspects of data mining, including algorithms, software and systems, and applications. In addition, ICDM draws researchers and application developers from a wide range of data mining related areas such as statistics, machine learning, pattern recognition, databases and data warehousing, data visualization, knowledge-based systems, and high performance computing.

By promoting novel, high quality research findings, and innovative solutions to challenging data mining problems, the conference seeks to continuously advance the state-of-the-art in data mining.

Besides the technical program, the conference will feature workshops, tutorials, panels, and the ICDM data mining contest.

Topics of Interest

  • Data mining foundations
    • Novel data mining algorithms in traditional areas (such as classification, regression, clustering, probabilistic modeling, pattern discovery, and association analysis)
    • Models and algorithms for new, structured, data types, such as arising in chemistry, biology, environment, and other scientific domains
    • Developing a unifying theory of data mining
    • Mining sequences and sequential data
    • Mining spatial and temporal datasets
    • Mining textual and unstructured datasets
    • Distributed data mining
    • High performance implementations of data mining algorithms
    • Privacy and anonymity-preserving data analysis
  • Mining in emerging domains
    • Stream data mining
    • Mining moving object data, RFID data, and data from sensor networks
    • Ubiquitous knowledge discovery
    • Mining multi-agent data
    • Mining and link analysis in networked settings: web, social and computer networks, and online communities
    • Mining the semantic web
    • Data mining in electronic commerce, such as recommendation, sponsored web search, advertising, and marketing tasks
  • Methodological aspects and the KDD process
    • Data pre-processing, data reduction, feature selection, and feature transformation
    • Quality assessment, interestingness analysis, and post-processing
    • Statistical foundations for robust and scalable data mining
    • Handling imbalanced data
    • Automating the mining process and other process related issues
    • Dealing with cost sensitive data and loss models
    • Human-machine interaction and visual data mining
    • Integration of data warehousing, OLAP and data mining
    • Data mining query languages
    • Security and data integrity
  • Integrated KDD applications, systems, and experiences
    • Bioinformatics, computational chemistry, ecoinformatics
    • Computational finance, online trading, and analysis of markets
    • Intrusion detection, fraud prevention, and surveillance
    • Healthcare, epidemic modeling, and clinical research
    • Customer relationship management
    • Telecommunications, network and systems management
    • Sustainable mobility and intelligent transportation systems

Important Dates

  • April 13, 2009 - Deadline for workshop proposals
  • June 26, 2009 - Deadline for paper submission, tutorial submission, and panel proposals
  • September 4, 2009 - Notification to authors
  • September 28, 2009 - Deadline for camera-ready copies
  • December 6-9, 2009 Conference

29.1.09

Workshop on Web Search Result Summarization and Presentation

Workshop on Web Search Result Summarization and Presentation
Co-located with the 18th World Wide Web Conference
April 20th, 2009, Madrid, Spain

Goals

Providing a satisfying web search experience can be a challenging task for a search engine. Numerous disciplines -- search, summarization, user interface design, usability, metrics, machine learning and modeling -- all have to come together in order to deliver the final experience. Effective summarization is part of the challenge. In this workshop we will focus on various aspects of web summarization, presentation, and user satisfaction metrics and models. The kinds of questions and issues we would like to address are:

  • What makes an effective web search result summary? What is a summary for?
  • Technological challenges and opportunities for innovation in how summaries are generated
  • What should be optimized during summarization?
  • Defining and measuring effectiveness of summarization
  • What real-time measurements (e.g. click logs) and offline (e.g. human rater judgments) can be used as surrogates for user satisfaction models?
  • How does one model the differences in human scanning and reading behaviors?
  • How can eye tracking technology be utilized to help understand the balance between scanning and reading behaviors?
  • What useful qualitative insights can we gauge from usability and field studies with respect to how users utilize summaries in their search for information?
  • What are good scalable metrics for summarization?
  • Can one learn layout and presentation using appropriate machine learning techniques and targets?
  • How do we optimize SERP UI to support user workflow?
  • What's good and not good about presenting ranked results linearly?
  • Do new presentation strategies overcome the limitations?
  • Future trends and directions in search results presentation

Topics

Main topics of interest include but are not limited to:

  • Web summarization and related natural language processing
  • Information presentation, exploration, and design
  • Usability and eye tracking studies of web search results presentation
  • Machine learning for summarization and presentation
  • User models: learning from clicks and human rater judgments
  • Metrics for individual pieces and the final experience

Dates

Submission deadline: 20th February 2009

28.1.09

JWKTL – Java Wiktionary Library

JWKTL - Java Wiktionary Library, Version 0.1

Wiktionary is a multilingual, web-based, freely available dictionary, thesaurus and phrase book, designed as the lexical companion to Wikipedia. Lately, it has been recognized as a promising lexical semantic resource for natural language processing applications.

JWKTL is a Java-based API that enables efficient programmatic access to the information contained in the English and German language editions of Wiktionary:

  • glosses
  • part of speech
  • etymology
  • examples
  • quotations
  • references
  • word language
  • translations
  • internal and external links
  • categories
  • related words
  • antonyms, holonyms, hypernyms, hyponyms, meronyms, synonyms, troponyms, "see also" terms, characteristic word combinations, coordinate terms, derived terms, descendants, etymologically related terms

JWKTL is freely available for non-profit and non-commercial use from the following website:
http://www.ukp.tu-darmstadt.de/software/JWKTL. It was developed by the Ubiquitous Knowledge Processing Lab at the Darmstadt University of Technology, Germany.

Reference publication:

Torsten Zesch, Christof Müller, Iryna Gurevych: Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary. In Proceedings of the Conference on Language Resources and Evaluation (LREC). European Language Resources Association, 2008.

Seen in the SIG-IR List.

Repeatability Guideline at KDD 2009

It was a (very positive) surprise for me to read the repeatability guideline at the Knowledge Discovery in Databases CFP for the 2009 edition (the most respected conference in Machine Learning, IMHO):

Repeatability is a cornerstone of any scientific endeavor. To ensure the long term viability of the research output of the SIGKDD community, we require open-source/public distribution of the code and the datasets. In those cases where this is not possible due to proprietary considerations, every effort should be made to provide the binary executable. If proprietary datasets are used, every effort should also be made to apply the approach to similar publicly available datasets. Furthermore, the description of experimental results in submitted papers should be accompanied by all relevant implementation details and exact parameter specifications.

It is great guideline, that will be considered in the evaluation of the submitted papers. I suppose that most of us has done the same mistake sometimes (or always), that is avoiding to provide enough details to make our experiments reproducible. And also we have seen this in hundreds of others' papers. If this is more than an event, if it becomes a trend, I believe that the field will improve greatly its methods. And not only this field, but others!

Information Systems Frontiers - Special Issue on Terrorism Informatics

Information Systems Frontiers
Special Issue on Terrorism Informatics

Since September 11th, the multidisciplinary field of terrorism informatics has experienced tremendous growth, and research communities as well as local, state, and national governments are facing increasingly more complex and challenging issues. The challenges facing the intelligence and national security communities worldwide include accurately and efficiently monitoring, analyzing, predicting and preventing terrorist activities. The development and use of advanced information technologies, including methodologies, models and algorithms, infrastructure, systems, and tools for national/international and homeland security related applications have provided promising new directions for study.

Terrorism informatics has been defined as the application of advanced methodologies, information fusion and analysis techniques to acquire, integrate process, analyze, and manage the diversity of terrorism-related information for international and homeland security-related applications. It is a highly interdisciplinary and comprehensive field. The wide variety of methods used in terrorism informatics are derived from Computer Science, Informatics, Statistics, Mathematics, Linguistics, Social Sciences, and Public Policy, and these methods are involved in the collection of huge amounts of many types of multi-lingual information from varied and multiple sources. Information fusion and information technology analysis techniques, which include data mining, data integration, language translation technologies, and image and video processing, play central roles in the prevention, detection, and remediation of terrorism.

The purpose of this special issue is to bring together international researchers, engineers, policy makers, and practitioners working on terrorism informatics as well as related fields such as the organizational and social sciences. This special issue will outline the major challenges in supporting terrorism prevention, detection and response worldwide, as well as future perspectives on counterterrorism research in the information age.

Topics

The special issue will cover the scope of research relevant to terrorism informatics, including but not limited to the following topics:

  • Terrorism knowledge portals and databases
  • Terrorist incident chronology databases
  • Terrorism social network analysis, visualization and simulation
  • Terrorism analytical tools and methodologies
  • Terrorism data mining and text mining
  • Terrorism root cause analysis
  • Bioterrorism
  • Cyber terrorism
  • Forecasting terrorism
  • Countering terrorism
  • Impact of terrorism on society
  • National and international security and webmetrics
  • Web-based intelligence terrorism monitoring and event detection

Important Dates

  • Submission deadline: July 31, 2009
  • Notification of first round reviews: October 31, 2009
  • Revised manuscripts due: December 31, 2009
  • Final acceptance notification: March 31, 2010
  • Submission of final paper: May 31, 2010
  • Publication date: Fall 2010

My first (received) spam in Twitter

Great, today I received my very first spam message through Twitter:

Hi, jmgomez (jmgomez).

Discount Blackberry (urgenttttsss) is now following your updates on Twitter.

Check out Discount Blackberry's profile here:

http://twitter.com/urgenttttsss

You may follow Discount Blackberry as well by clicking on the "follow" button.

Best,
Twitter

A pity :_(

27.1.09

Adicción tecnológica por "Refuerzo Variable Intermitente"

Jesús Encinar, CEO de Idealista.com, hace una reflexión muy interesante en su artículo "El email, como las tragaperras, funciona como un Refuerzo Variable Intermitente".

Esta reflexión se basa en las teorías conductistas de Skinner, por las cuales es más adictivo un refuerzo que se da de manera imprevisible que uno que se da de manera constante y consecuente:

Curiosamente, ese efecto de "recompensa variable -> enganche constante", es el mismo efecto que nos hace refrescar la pagina en twitter o facebook constamente o que nos hace mirar el email docenas de veces al día.

Unas preguntas personales: ¿Cuantas veces consultas tu correo al día? ¿Te sientes insatisfecho si no llega correo nuevo? ¿Y que hay de Twitter, Facebook, LinkedIn, etc.?


New version of the MPQA Opinion Corpus

A new version (2.0) of the MPQA Opinion Corpus, which contains news articles and other text documents manually annotated for opinions and other private states (i.e., beliefs, emotions, sentiments, speculations, etc.), is available for download (http://www.cs. pitt.edu/ mpqa/databaserel ease/).

The main changes in this version of the MPQA Opinion Corpus are:

  • The extension of the MPQA annotation scheme to include two new types of annotations: attitude annotations (e.g., positive and negative sentiment & positive and negative arguing) and target annotations (i.e., the objects of the opinions).
  • The addition of 157 new annotated documents (in the full MPQA annotation scheme, including attitudes and targets) growing the size of the corpus to 692 documents.
  • The inclusion of annotations for answers to a set of fact and opinion questions for the OpQA subset of the corpus.
  • The refinement of some annotations.

Seen in SentimentAI - Sentiment & Affect in Text Yahoo! Group.

Opensource Text Analytics article by Seth Grimes

Seth Grimes has written an interesting (and brief) article on Opensource Text Analytics (AKA Text Mining), that quicky reviews four tools: Gate, NLTK, R and RapidMiner. Apart from the pointers to the tools themselves, and his own experience with them, the article contains several links to other articles that are worth reading.

Despidos en empresas tecnológicas

Visto en el blog de Jose Carlos Cortizo, un listado con empresas tecnológicas que están despidiendo a marchas forzadas:

A comienzos de Diciembre hablaba de cómo estaba afectando la crisis a la tecnología, pero las cosas han seguido avanzando. Despidos, despidos y más despidos puede ser el resumen global del comienzo del 2009 con respecto a las empresas tecnológicas (y las no tecnológicas también :P). El hecho que un descenso del 70% en los beneficios por parte de empresas como Google sea recibido con verdaderas ovaciones por los inversores, demuestra la gravedad del momento económico y lo delicado de la situación. A continuación un resumen de los despidos "más sonados" en tecnología, ordenado por número de "patitas en la calle": HP, Sony, IBM, AT&T, Sun, Ericsson, Pioneer, Microsoft, Intel, BT, Motorola, AMD, Lenovo, EMC, Yahoo!, Nokia, Telefónica, Adobe, Google, Digg, ...

Definitivamente no es un buen momento para nadie :-(

26.1.09

Internet population reaches 1 billion

Acording to a recent press release by comScore, the Internet population has reached the magic 1 billion people. Perhaps unexpectedly (;-D), the country that adds most to this number is China, while the US is the second one. My headlines are:

  • Spain is the 13th country acording to these statistics, adding 17,893,000 net citizens, a 1,8% of the total number. If Spanish population is about 45 million, this makes a 39,76% of it. Good!
  • Google sites are the most visited, by 77% of the net population, and other popular sites are Microsoft (64%) and Yahoo! (56%) ones.
  • Wikimedia sites reach a very respectable number of 272,998,000 visitors, which make a 27% of the total population. Great!

The difference between Information Access and Information Retrieval

Recently, José David López, a software engineer at one of the biggest Spanish consultancy/software firms, has asked me about the difference between Information Retrieval and Information Access. The difference that I have often stated in my lectures is based on the opinions of the great researcher Marti Hearst. However, scanning her writings can lead to an unsatisfactory answer:

In her paper "Untangling Text Data Mining", she states:

It is important to differentiate between text data mining and information access (or information retrieval, as it is more widely known). The goal of information access is to help users find documents that satisfy their information needs. The standard procedure is akin to looking for needles in a needlestack - the problem isn't so much that the desired information is not known, but rather that the desired information coexists with many other valid pieces of information.

According to this, Information Access and Information Retrieval are synonyms. However, in her lectures on "Current Topics in Information Access", she defines:

Information Access is the process by which users use information technology to seek, organize and understand information.

Information Retrieval is to retrieve documents that users are likely to find relevant to their queries.

In consequence, Information Access subsumes Information Retrieval as a subtask. Other subtasks of Information Access are Question Answering, Text Summarization, Text Clustering, etc. Let us see several examples of applications that involve organization and understanding of information, and not just search:

  • For instance, when a user builds an automatic filter in his/her email client (e.g. Thunderbird) in order to organize the messages he/she receives, he/she is performing an Information Access operation: organization (in particular, Text Categorization or Text Filtering).
  • Also, when a user takes a long document in Openoffice and selects the option to generate a summary or an abstract, he/she is performing an Information Access operation: understanding (in particular, Text Summarization).
  • Adversarial Text Classification tasks like spam filtering or Web content filtering (e.g. pornography blocking on the Web) can be seen as organization tasks (in particular, Text Categorization or Negative Text Filtering).

Perhaps the master of Text Categorization, Dave Lewis, presented in his thesis "Representation and Learning in Information Retrieval", a description of a wide number of operations that can be seen as Information Access operations, that include:

  • Text Categorization
  • Document Clustering
  • Text Routing
  • Term Categorization
  • Term Clustering
  • Latent Semantic Indexing

In fact, I review and organize a number of text classification tasks in my tutorial on Text Mining:

Gómez Hidalgo, J.M. Tutorial on Text Mining and Internet Content Filtering. 13th European Conference on Machine Learning (ECML'02) and 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'02), Helsinki, Finland, 19-23 August 2002.

Moreover, given that users learn during the search process, Marti Hearst states in her chapter about "User Interfaces and Evaluation" in the book Modern Information Retrieval by Ricardo Baeza-Yates et al.:

Bates proposes the `berry-picking' model of information seeking, which has two main points. The first is that, as a result of reading and learning from the information encountered throughout the search process, the users' information needs, and consequently their queries, continually shift. Information encountered at one point in a search may lead in a new, unanticipated direction. The original goal may become partly fulfilled, thus lowering the priority of one goal in favor of another. This is posed in contrast to the assumption of 'standard' information retrieval that the user's information need remains the same throughout the search process. The second point is that users' information needs are not satisfied by a single, final retrieved set of documents, but rather by a series of selections and bits of information found along the way. This is in contrast to the assumption that the main goal of the search process is to hone down the set of retrieved documents into a perfect match of the original information need.

In other words, the standard cycle of query-retrieve documents is just a part of a more general process, Information Access, that involves avoiding historic assumptions like those stated above.

I hope that this discussion helps to clarify the difference between both concepts.

25.1.09

Second International Conference on the Theory of Information Retrieval

Second International Conference on the Theory of Information Retrieval
September 10-12, 2009
Cambridge, UK

The International Conference on the Theory of Information Retrieval (ICTIR) aims to provide a forum for discussion and interaction among those with theoretical and applicative research interests in mathematical/formal aspects of Information Retrieval (IR), including, e.g., foundational issues, description or integration of models, retrieval applications, mathematical/formal techniques, existing and/or new theories and theoretical aspects.

ICTIR has grown out of the Mathematical/Formal Methods workshops held annually at SIGIR between 2000 and 2005. These workshops demonstrated that the mathematical/formal results achieved in IR could be organized into a coherent theoretical framework, bringing new knowledge to IR, and that mathematical/formal research can stand as a specialized research area of IR. The first ICTIR conference was held in 2007 in Budapest, Hungary, with the overall aim to explore the multi-valued meaning of IR, combining areas like mathematics and linguistics.

The second ICTIR conference in 2009 aims to continue in the same spirit, promoting research in the wider contexts of IR. Reflecting this, in addition to the established fields and approaches in IR, research papers on new approaches inspired from sociology, mathematics, physics, linguistics, biology, philosophy, and other areas are sought. Papers that demonstrate a high level of research adventure or which break out of the traditional IR paradigms are particularly welcome. Experimental and/or practical results from new paradigms are also of interest.

Topics

They seek high-quality and original research papers and posters that have not been previously published and are not under review for another conference or journal. Submissions will be reviewed by experts on the basis of the originality of the work, the validity of the chosen methodology and their results, quality of writing and the overall contribution to the field of IR. Topics of interest include, but are not limited to, the theories and formal models appropriate to the following areas:

  • Foundations
    • Mathematical foundations of IR
    • Probabilistic, logical, language, and social IR models, and quantum mechanics based models
    • Information, meaning, entropy
    • Properties and structures in IR
    • IR architectures: peer-to-peer, distributed IR, grid
    • Content representation and indexing
    • Algorithms, complexity
    • New models, frameworks and approaches to IR
  • Techniques
    • Evaluation methodologies, test collections, metrics
    • User modelling and user interactions
    • Context issues
    • Browsing, semantic search, meta-search
    • Bibliometrics for IR and citation analysis
    • Social networks and media, on-line community analysis, social tagging
    • Classification, categorization, and clustering
    • Machine learning
    • Visualisation
  • Applications
    • Web IR
    • Enterprise search
    • Expert search
    • Interactive IR
    • Text mining
    • Digital libraries
    • XML retrieval
    • Multimedia retrieval
    • Domain-specific IR (blog, legal, biomedical, book, etc.)
    • Recommender systems
    • Filtering
    • Semantic Web
    • Mobile IR
  • Wider context
    • Philosophy of IR
    • Sociology of IR
    • Pedagogy of IR
    • Linguistics of IR

Important dates and submissions

Authors are invited to submit research papers up to 12 pages representing original and previously unpublished work, on or before the 17th of April 2009. Poster submission of up to 4 pages can be submitted until the 1st of May 2009.

22.1.09

Advances in Computers 76: Web Content Filtering

A survey on Web Content Filtering, written by the team that worked on the Spanish Filter at the project POESIA (Francisco Carrero, Enrique Puertas, Manuel de Buenaga and me), is included in the upcoming volume of the Series Advances In Computers. The full reference should be:

Gómez Hidalgo, J.M., Puertas Sánz, E., Carrero García, F., Buenaga Rodríguez, M. de. Web Content Filtering. In Marvin Zelkowitz (Ed.) Advances In Computers, 76, ISBN-13: 978-0-12-374811-9, Elsevier Academic Press, in press (expected Jun. 2009).

Seminario MAVIR: Automated Learning by Reading (Ed. Hovy)

SEMINARIO MAVIR
17 de febrero de 2009 en la UNED

TÍTULO: Automated Learning by Reading
PONENTE: Ed Hovy (Information Sciences Institute, USC)
HORARIO: martes 17/02 a las 10h00

LUGAR DE CELEBRACIÓN:
Sala 1.26 (primera planta)
Facultad de Psicología, UNED
c/ Juan del Rosal, 10
Ciudad Universitaria
28040 Madrid

Named Entities Workshop

Named Entities Workshop
Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing

Named Entities (NEs) play a critical role in Natural Language Processing (NLP) and Information Retrieval (IR) tasks, such as search, machine translation, document clustering, summarization, information extraction, etc. While identifying and analyzing NEs in a given natural language is a challenging research problem by itself, the phenomenal growth in the Internet user population, especially among the non-English speaking parts of the world, has extended this problem to the cross-language arena, making the handling of NEs in multiple languages critically important. The purpose of this workshop is to bring together researchers interested in various aspects of NEs in natural language text. In addition, the NEWS workshop will feature a shared task on Machine Transliteration of NEs.

This workshop invites original research contributions on all aspects of Named Entities (NEs), including identification, analysis, extraction, mining, transformation and applications to NLP and IR systems. The topics of interest include, but are not limited to the following:

  • NE Analysis
    • Distributional characteristics of NEs in mono- & multi-lingual corpora
    • Orthographic/phonetic characteristics of NEs
    • NE origin/genre recognition
    • Social network analysis and entity resolution
  • NE extraction
    • Language-independent monolingual NE extraction
    • Cross-language NE extraction
      • General techniques
      • Specific datasets (such as, Wikipedia, news, etc.)
    • Unsupervised and semi-supervised methods for NE extraction
    • Complex NEs, domain-specific term extraction
    • NE set expansion
    • Creation of annotated data
  • Machine Transliteration
    • Computational phonology, including modeling of phonological rules, structure, behavior, etc.
    • Transliteration modeling
      • Phonetic, phonetic-semantic transliteration, grapheme > phoneme and phoneme > grapheme conversions
      • Statistical and machine learning based approaches, transliteration unit alignment
      • Forward and backward transliterations
      • Learning transliteration from comparable corpora, transliteration lexicon construction
      • Romanization of Asian languages
    • Transliteration evaluation metrics
  • Applications
    • Monolingual and Cross-Language IR
    • Machine Translation
    • Information Extraction and Management
    • Question Answering
    • Computational Journalism

Dates

  • Research Paper Submission Deadline: 1-May-2009
  • Acceptance Notification: 1-Jun-2009
  • Camera-Ready Copy Deadline: 7-Jun-2009
  • Workshop Date: 7 Aug 2009

20.1.09

ImgSeek para detectar pornografía, reposteado

Hace tiempo describí cómo utilizar ImgSeek como clasificador de imágenes para detección de pornografía, y los resultados de unos pequeños experimentos al respecto. A raiz de ello, escribí un artículo para la revista Linux+ y retiré los posts temporalmente por comromiso con la revista. Como ya ha pasado bastante tiempo desde ello, y de acuerdo con el compromiso, los posts han sido repuestos.

19.1.09

Juan Salom: investigación del cibercrimen

El Comandante en Jefe del Grupo de Delitos Telemáticos de la Guardia Civil, Juan Salóm, impartió una interesante conferencia sobre la investigación de los delitos tecnológicos en el Tercer Día Internacional de la Seguridad de la Información (DISI 2008) en la UPM, disponible en YouTube:

Aunque este grupo tiene resultados excelentes dados los medios materiales y personales de que disponen, el panorama es desolador. Lo más preocupante en mi opinión es la inseguridad jurídica de todo el proceso de investigación, ya que no se cuenta con herramientas homologadas, ni existen reglas bien definidas sobre la obtenciónd e las evidencias (e.g. estas deben obtenerse en presencia de un secretario judicial o un notario, etc.), y se confía en terceras partes (los proveedores de servicio) cuyos procesos internos son desconocidos y posiblemente no confiables.

En resumen, cualquier día prospera un recurso contra una sentencia de índole tecnológica, y algún pedófilo sienta precedente para mal de todos.

Recent Yahoo! Research works in data mining for online advertising

As a great part of Yahoo! revenue comes from online advertising, it is not surprising that Yahoo! Research devotes important efforts to optimize their models in order to get more and better clicks to the ads they serve. They have an ongoing project named "Squeeze Every Drop of Meaning from Data", with several goals including: What are the most appropriate advertisements to maximize click-through rates on a particular web page?

Also, they have recently changed their user data retention policy in order to keep the minimum data they need to effectively improve their models. As stated in the news story:

"(They) would retain individual user data for only three months, down from 13 months. Google keeps individualized search data of its users for nine months and Microsoft for 18 months."

Here is a list of recent papers (2008-) from Yahoo! Research dealing with online advertising:

Sixth International Workshop on Text-Based Information Retrieval

Sixth International Workshop on Text-Based Information Retrieval
Linz, Austria, August 31 - September 4

Intelligent algorithms for mining and retrieval are the key technology to cope with the information need challenges in our media-centered society. Methods for text-based information retrieval receive special attention, which results from the important role of written text, from the high availability of the Internet, and from the enormous importance of Web communities.

Advanced information retrieval and extraction uses methods from different areas: machine learning, computer linguistics and psychology, user interaction and modeling, information visualization, Web engineering, artificial intelligence, or distributed systems. The development of intelligent retrieval tools requires the understanding and combination of the achievements in these areas, and in this sense the workshop provides a common platform for presenting and discussing new solutions.

The following list organizes classic and ongoing topics from the field of text-based IR for which contributions are welcome:

  • Theory. Retrieval models, language models, similarity measures, formal analysis
  • Mining and Classification. Category formation, clustering, entity resolution, document classification, learning methods for ranking
  • Web. Community mining, social network analysis, structured retrieval from XML documents
  • NLP. Text summarization, keyword extraction, topic identification
  • User Interface. Paradigms and algorithms for information visualization, personalization, privacy issue
  • User Context. Context models for IR, context analysis from user behaviour and from social networks
  • Multilinguality. Cross-language retrieval, multilingual retrieval, machine translation for IR
  • Evaluation. Corpus construction, experiment design, conception of user studies
  • Semantic Web. Meta data analysis and tagging, knowledge extraction, inference, and maintenance
  • Software Engineering. Frameworks and architectures for retrieval technology, distributed IR

The workshop is held for the sixth time. In the past, it was characterized by a stimulating atmosphere, and it attracted high quality contributions from all over the world. In particular, the organizers encourage participants to present research prototypes and demonstration tools of their research ideas.

Dates

  • April 1, 2009 Deadline for paper submission
  • April 20, 2009 Notification to authors
  • May 15, 2009 Camera-ready copy due
  • August 31, 2009 Workshop opens

Contributions will be peer-reviewed by at least two experts from the related field. Accepted papers will be published as IEEE proceedings by IEEE CS Press.

16.1.09

Cisco 2008 Annual Security Report

The Cisco Annual Security Report provides a comprehensive overview of the combined security intelligence of the entire Cisco organization. Encompassing threat and trends information collected between January and October 2008, this document provides a snapshot of the state of security for that period. The report also provides recommendations from Cisco security experts and predictions of how identified trends will continue to unfold in 2009.

This year's report reveals that online and data security threats continue to increase in number and sophistication. They propagate faster and are more difficult to detect. Key report findings include:

  • Spam accounts for nearly 200 billion messages each day, which is approximately 90 percent of email sent worldwide.
  • The overall number of disclosed vulnerabilities grew by 11.5 percent over 2007.
  • Vulnerabilities in virtualization products tripled to 103 in 2008 from 35 in 2007, as more organizations embraced virtualization technologies to increase cost-efficiency and productivity.
  • Over the course of 2008, Cisco saw a 90 percent growth rate in threats originating from legitimate domains; nearly double what the company saw in 2007.
  • Spam due to email reputation hijacking from the top three webmail providers accounted for just under 1 percent of all spam worldwide, but constituted 7.6 percent of all these providers' mail.

Fortunately, responses to these threats and trends are improving. Advances in attack response stem from the increased collaboration between vendors and security researchers to review, identify, and combat vulnerabilities.

You must register to gain access to the report.

15.1.09

Handbook of Research on Web Log Analysis

Handbook of Research on Web Log Analysis
ISBN: 978-1-59904-974-8; 628 pp; September 2008
Published under the imprint Information Science Reference (formerly Idea Group Reference)

Edited by: Bernard J. Jansen, The Pennsylvania State University, USA; Amanda Spink, Queensland University of Technology, Australia; and Isak Taksa, Baruch College, City University of New York.

DESCRIPTION

Whether searching, shopping, or socializing, Web users leave behind a great deal of data revealing their information needs, mindset, and approaches used, creating vast opportunities for Web service providers as well as a host of security and privacy concerns for consumers. The Handbook of Research on Web Log Analysis reflects on the multifaceted themes of Web use and presents various approaches to log analysis. This expansive collection reviews the history of Web log analysis and examines new trends including the issues of privacy, social interaction and community building. Over 20 research contributions from 44 international experts comprehensively cover the latest user-behavior analytic and log analysis methodologies, and consider new research directions and novel applications. An essential holding for library reference collections, this Handbook of Research will benefit academics, researchers, and students in a variety of fields, as well as technology professionals interested in the opportunities and challenges presented by the massive collection of Web usage data.

TOPICS

  • Adaptive dialogue-driven search
  • Connector Web site
  • Dynamic Web pages customization
  • Interaction design
  • Machine learning approach
  • Query log analysis
  • Search log analysis
  • Search query classification
  • Search query logs
  • Transaction log analysis
  • Very-scale conversation
  • Web analytics
  • Web information seeking behavior
  • Web log analysis
  • Web log privacy
  • Web logging data
  • Web sites
  • Web usage studies
  • Web-traffic measurement

For more information about Handbook of Research on Web Log Analysis, you can view the title information sheet.

CFP: Hypertext 2009

Hypertext 2009, The Twentieth ACM Conference on Hypertext and Hypermedia
Torino, Italy, June 29 - July 1, 2009

The ACM Hypertext Conference is the main venue for high quality peer-reviewed research on "linking." The Web, the Semantic Web, the Web 2.0, and Social Networks are all manifestations of the success of the link. The Hypertext Conference provides the forum for all research concerning links: their semantics, their presentation, the applications, as well as the knowledge that can be derived from their analysis and their effects on society.

Dates

  • Technical tracks paper submission deadline: February 2nd, 2009
  • Notification to authors: March 16th, 2009
  • Camera-ready (final papers to ACM): April 6th, 2009

Andrew Clegg Thesis on Biomedical Text Mining

Andrew Clegg has made available at his web site his thesis entitled "Computational-Linguistic Approaches to Biological Text Mining", and supervised by Dr. Adrian Shepherd.

Apart from his own contributions to parsing biomedical text, generating dependency graphs and mining these graphs for Information Extraction, the introduction chapter presents an excellent state of the art of Natural Language Processing of Biological Texts.

The thesis is available at: http://biotext.org.uk/static/thesis.pdf.

14.1.09

Seventh Workshop on Intelligent Techniques for Web Personalization & Recommender Systems

Seventh Workshop on Intelligent Techniques for Web Personalization & Recommender Systems
In conjunction with IJCAI 2009
July 11-17, 2009 - Pasadena, California, USA
Submission Deadline: March 6, 2009

Web Personalization can be defined as any set of actions that can tailor the Web experience to a particular user or set of users. The experience can be something as casual as browsing a Web site or as (economically) significant as trading stocks or purchasing a car. The actions can range from simply making the presentation more pleasing to anticipating the needs of a user and providing customized and relevant information. To achieve effective personalization, organizations must rely on all available data, including the usage and click-stream data (reflecting user behaviour), the site content, the site structure, domain knowledge, as well as user demographics and profiles. Efficient and intelligent techniques are needed to mine this data for actionable knowledge, and to effectively use the discovered knowledge to enhance the users' Web experience. These techniques must address important challenges emanating from the size of the data, the fact that they are heterogeneous and very personal in nature, as well as the dynamic nature of user interactions with the Web. These challenges include the scalability of the personalization solutions, data integration, and successful integration of techniques from machine learning, information retrieval and filtering, databases, agent architectures, knowledge representation, data mining, text mining, statistics, information security and privacy, user modelling and human-computer interaction.

Recommender systems represent one special and prominent class of such personalized Web applications, which particularly focus on the user-dependent filtering and selection of relevant information and - in an e-Commerce context - aim to support online users in the decision-making and buying process. Recommender Systems have been a subject of extensive research in AI over the last decade, but with today's increasing number of e-commerce environments on the Web, the demand for new approaches to intelligent product recommendation is higher than ever. There are more online users, more online channels, more vendors, more products and, most importantly, increasingly complex products and services. These recent developments in the area of recommender systems generated new demands, in particular with respect to interactivity, adaptivity, and user preference elicitation. These challenges, however, are also in the focus of general Web Personalization research.

In the face of this increasing overlap of the two research areas, the aim of this workshop is to bring together researchers and practitioners of both fields, to foster an exchange of information and ideas, and to facilitate a discussion of current and emerging topics related to "Web Intelligence". Organizers invite original contributions in a variety of areas related to Web personalization and Recommender Systems, including Data Modeling and Integration; Systems and Architectures; Enabling Technologies; and Evaluation Methodologies, Metrics, and Case Studies.

Important dates

  • March 6, 2009: Deadline for electronic submission
  • April 17, 2009: Author Notification
  • May 8, 2009: Submission of camera-ready
  • July 11-13, 2009: IJCAI-09 Workshop Program

SIGIR Digital Museum of Information Retrieval Research

ACM SIGIR presents the first results of a project to digitize the older literature in the information retrieval field. So far 14 of the old reports, such as the Cranfield reports and the SMART reports have been scanned, along with Karen Sparck Jones's Information Retrieval Experiment book. The PDF versions of these are available from the SIGIR Digital Museum of Information Retrieval Research. The museum provides room for exhibits, and allows searching of the material using the PF/Tijah XML search system.

The complete library is available for download on request. Requests can be directed to the SIGIR Information Director by sending an email to infodir_sigir@acm.org. See also:

Donna Harman and Djoerd Hiemstra. "Saving and Accessing the Old IR Literature". SIGIR Forum 42(2), pages 19-24, December 2008.

In my humble opinion, this is a beautiful resource. You can even read the Rocchio thesis and you will be able to cite his relevance feedback algorithm having read it from the very source!

Very nice, add it to your bookmarks (next to the ACL Anthology), and send it to your students!

13.1.09

Third Edition of the Novática Award

The 3rd Edition of the Novática Award has been presented to the best article published in 2007 by Novática, journal of the Spanish CEPIS society ATI (Asociación de Técnicos de Informática), publisher of UPGRADE on behalf of CEPIS.

From a final shortlist of five articles, the Jury, comprising the 49 editors of Novática's various technical sections, selected the winning one: "Adversarial Information Retrieval in the Web", authored by Ricardo Baeza-Yates (Head of the Research Labs of Yahoo! in Santiago, Chile, and Barcelona, Spain), Paolo Boldi (Professor in the Dept. of Information Systems at the University of Milan, Italy) and José-María Gómez-Hidalgo (R&D Director at Optenet, Spain). This article was published in Novática issue no. 185 (January-February 2007), and in UPGRADE, Vol. VIII, issue no. 1 (February 2007).

You can access the English version by clicking here, and the Spanish one by clicking here.

Detailed information about the award, including references to the other four articles that reached the final phase, can be found, in Spanish, by clicking here.

For further information, please contact Llorenç Pagés-Casas, Chief Editor of Novática and UPGRADE, at novatica AT ati DOT es

ECIR Workshop on Information Retrieval over Social Networks

ECIR Workshop on Information Retrieval over Social Networks
In conjunction with ECIR 2009, Toulouse (France), April 6th, 2009

Popular online communities and services such as Flickr, Youtube, Facebook or LinkedIn are spearheading an emerging type of information on the Web. This information is composed of classical textual and multimedia data, in concert with additional data (tags, annotations, comments, ratings). Perhaps most significantly, the information is overlaid on an explicit social network created by the participants of each of these communities. The result is a rich structure of inter-relationships between content items, participants and services. Although the size of such networks requires the use of advanced Information Retrieval techniques, classical IR models are not tailored for this type of content as they do not (in general) take advantage of the particular structure and unique aspects of this socially-driven content.

This workshop proposes to report about the state-of-the-art in this direction and to gather a relevant panel of researchers working in the field. We look for contributions in all aspects of Information Retrieval over Social Networks, including:

  • Applications of Information Retrieval over Social Network
  • Adapted IR models for Social Networks
  • Mining Social Network data
  • Privacy issues in Social Network information retrieval
  • Trust and Reliability issues in Social Network information retrieval
  • Knowledge and Content Discovery in Social Networks
  • Information diffusion over Social Networks
  • Performance evaluation for the above (measures, test collections,...)

All submitted papers will then be peer-reviewed by the workshop programme committee. The paper selection criteria are the same as that of the main conference. Submissions must be written in English following the ACM SIG Proceedings style, not exceeding 6 pages including references and figures. It is considered the edition of a Special Issue in a known IR journal as a result of the workshop.

Important dates:

  • January 30th, 2009: Submission deadline
  • February 20th, 2009: Acceptance notification

12.1.09

ACM Multimedia 2009 Call for Papers

ACM Multimedia 2009 Call for Papers
Beijing, China, October 19-24, 2009
http://www.acmmm09.org

ACM Multimedia 2009 invites you to participate in the premier annual multimedia conference, covering all aspects of multimedia, from underlying technologies to applications, from theory to practice. ACM Multimedia 2009 will be held at the Beijing Hotel, October 19-24, 2009, Beijing, China.

The technical program will consist of the paper/poster sessions and talks with topics of interest in:

  • Multimedia content analysis, processing, and retrieval
  • Multimedia networking, sensor networks, and systems support
  • Multimedia tools, end-systems, and applications
  • Human-Centered Multimedia Important

Dates

  • April 10, 2009 Full Paper Registration (Abstract -Submission) Deadline
  • April 17, 2009 Full Paper/Panel/Workshop Submission Deadline
  • May 8, 2009 Short Paper Submission Deadline
  • June 5, 2009 Video Program/Interactive Art Program/Open Source/ Doctoral Program/Demo Proposal Submission Deadline
  • July 3, 2009 Notification of Acceptance for Full & Short Papers
  • July 10, 2009 Notification of Acceptance for Video Program/Interactive Art Program/Open Source/Doctor Program/Demo Proposal
  • July 24, 2009 Camera-ready deadline for all papers

CFP: AIRWeb 2009

Fifth International Workshop on Adversarial Information Retrieval on the Web

Adversarial Information Retrieval addresses tasks such as gathering, indexing, filtering, retrieving and ranking information from collections wherein a subset has been manipulated maliciously. On the Web, the predominant form of such manipulation is "search engine spamming" (or "spamdexing" ), i.e., malicious attempts to influence the outcome of ranking algorithms, aimed at getting an undeserved high ranking for some items in the collection.

The following types of submissions on any aspect of adversarial information retrieval on the Web are solicited:

  • Full papers, decribing contributions to the field,
  • Short papers, presenting work in progress, and
  • Problem statements, explaining relevant, but unsolved or not adequately solved problems.

Particular areas of interest include, but are not limited to:

  • Link spam
  • Content spam
  • Cloaking
  • Blog/forum/wiki spam
  • Tag spam
  • Review and rating spam
  • Click fraud detection
  • Reverse engineering of ranking algorithms
  • Web content filtering
  • Online advertisement blocking
  • Stealth crawling

The proceedings of the workshop will be included in the ACM Digital Library. Full and short papers are limited to 8 and 4 pages, respectively; problem statements will be permitted 2 pages. Papers should be formatted using the WWW 2009 proceedings
style and submitted via <http://www.easychair.org/conference s/?conf=airweb20 09>.

Dates

  • 6 February 2009: Deadline (optional, but helpful) for abstract submissions
  • 13 February 2009: Deadline for paper submissions
  • 4 March 2009: Notification of paper acceptance
  • 15 March 2009: Camera-ready version due date
  • 20 or 21 April 2009: Date of the workshop