26.1.09

The difference between Information Access and Information Retrieval

Recently, José David López, a software engineer at one of the biggest Spanish consultancy/software firms, has asked me about the difference between Information Retrieval and Information Access. The difference that I have often stated in my lectures is based on the opinions of the great researcher Marti Hearst. However, scanning her writings can lead to an unsatisfactory answer:

In her paper "Untangling Text Data Mining", she states:

It is important to differentiate between text data mining and information access (or information retrieval, as it is more widely known). The goal of information access is to help users find documents that satisfy their information needs. The standard procedure is akin to looking for needles in a needlestack - the problem isn't so much that the desired information is not known, but rather that the desired information coexists with many other valid pieces of information.

According to this, Information Access and Information Retrieval are synonyms. However, in her lectures on "Current Topics in Information Access", she defines:

Information Access is the process by which users use information technology to seek, organize and understand information.

Information Retrieval is to retrieve documents that users are likely to find relevant to their queries.

In consequence, Information Access subsumes Information Retrieval as a subtask. Other subtasks of Information Access are Question Answering, Text Summarization, Text Clustering, etc. Let us see several examples of applications that involve organization and understanding of information, and not just search:

  • For instance, when a user builds an automatic filter in his/her email client (e.g. Thunderbird) in order to organize the messages he/she receives, he/she is performing an Information Access operation: organization (in particular, Text Categorization or Text Filtering).
  • Also, when a user takes a long document in Openoffice and selects the option to generate a summary or an abstract, he/she is performing an Information Access operation: understanding (in particular, Text Summarization).
  • Adversarial Text Classification tasks like spam filtering or Web content filtering (e.g. pornography blocking on the Web) can be seen as organization tasks (in particular, Text Categorization or Negative Text Filtering).

Perhaps the master of Text Categorization, Dave Lewis, presented in his thesis "Representation and Learning in Information Retrieval", a description of a wide number of operations that can be seen as Information Access operations, that include:

  • Text Categorization
  • Document Clustering
  • Text Routing
  • Term Categorization
  • Term Clustering
  • Latent Semantic Indexing

In fact, I review and organize a number of text classification tasks in my tutorial on Text Mining:

Gómez Hidalgo, J.M. Tutorial on Text Mining and Internet Content Filtering. 13th European Conference on Machine Learning (ECML'02) and 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'02), Helsinki, Finland, 19-23 August 2002.

Moreover, given that users learn during the search process, Marti Hearst states in her chapter about "User Interfaces and Evaluation" in the book Modern Information Retrieval by Ricardo Baeza-Yates et al.:

Bates proposes the `berry-picking' model of information seeking, which has two main points. The first is that, as a result of reading and learning from the information encountered throughout the search process, the users' information needs, and consequently their queries, continually shift. Information encountered at one point in a search may lead in a new, unanticipated direction. The original goal may become partly fulfilled, thus lowering the priority of one goal in favor of another. This is posed in contrast to the assumption of 'standard' information retrieval that the user's information need remains the same throughout the search process. The second point is that users' information needs are not satisfied by a single, final retrieved set of documents, but rather by a series of selections and bits of information found along the way. This is in contrast to the assumption that the main goal of the search process is to hone down the set of retrieved documents into a perfect match of the original information need.

In other words, the standard cycle of query-retrieve documents is just a part of a more general process, Information Access, that involves avoiding historic assumptions like those stated above.

I hope that this discussion helps to clarify the difference between both concepts.

6 comentarios:

JoSeK dijo...

Excelente post, la verdad que en muchos casos se usa IA e IR como sinónimos, o sin tener muy en cuenta lo que realmente significan, y hace falta aclarar los términos.

Jose Maria Gomez Hidalgo dijo...

The sources itselves are not clear. While I see it clear. In fact, I have not found the true reference in which Lewis or Hearst say that IA is a wider concept... :-(

JoSeK dijo...

Un par de enlaces, que te los has ganado:

* http://weblogs.madrimasd.org/sistemas_inteligentes/archive/2009/01/28/111902.aspx

* http://machine-learning.blogspot.com/2009/01/information-access-vs-information.html

Jose Maria Gomez Hidalgo dijo...

Thank you very much!!!

Jorge Serrano-Cobos dijo...

Then, reading these definitions, please take into account some more concepts related:

"Information seeking" would be related with "Information access", because "inf. seeking" would be the discipline that studies "inf. access" (as a process).

Also, "information retrieval" (if implies users "likely to find relevant") would be related to info seeking, because taking the users point of view, intentions, field expertise, gender, age, etc., into the study of relevancy is bringing subjectivity to the general subject, don´t you think?

This is getting fuzzy... ;-)

Cheers,

Jose Maria Gomez Hidalgo dijo...

Dear Jorge

Yes, it gets more and more fuzzy...

OK, I will take it as a verb meaning problem:

1. "Access" implies a general process, involves understanding and organizing information (apart from searching for it).

2. "Retrieval" is a short time process: it implies possing a query and *getting* some kind of answer (usually, a set of document surrogates).

3. "Seeking" stress the role of the user, it is like retrieval but paying more attention to the user actions, expectations, etc. To my view, it is more general and user focused than "retrieval", but it still does not cover as many things as "access". A mid-way...

Honestly, I do not feel I have the answer, and apart from quoting some authorities, I just find that most of the literature deals with these terms as synonyms...

Perhaps I should send an email to Marti :-)

Thanks a lot for your comment!