Recently, José David López, a software engineer at one of the biggest Spanish consultancy/software firms, has asked me about the difference between Information Retrieval and Information Access. The difference that I have often stated in my lectures is based on the opinions of the great researcher Marti Hearst. However, scanning her writings can lead to an unsatisfactory answer:
In her paper "Untangling Text Data Mining", she states:
It is important to differentiate between text data mining and information access (or information retrieval, as it is more widely known). The goal of information access is to help users find documents that satisfy their information needs. The standard procedure is akin to looking for needles in a needlestack - the problem isn't so much that the desired information is not known, but rather that the desired information coexists with many other valid pieces of information.
According to this, Information Access and Information Retrieval are synonyms. However, in her lectures on "Current Topics in Information Access", she defines:
Information Access is the process by which users use information technology to seek, organize and understand information.
Information Retrieval is to retrieve documents that users are likely to find relevant to their queries.
In consequence, Information Access subsumes Information Retrieval as a subtask. Other subtasks of Information Access are Question Answering, Text Summarization, Text Clustering, etc. Let us see several examples of applications that involve organization and understanding of information, and not just search:
- For instance, when a user builds an automatic filter in his/her email client (e.g. Thunderbird) in order to organize the messages he/she receives, he/she is performing an Information Access operation: organization (in particular, Text Categorization or Text Filtering).
- Also, when a user takes a long document in Openoffice and selects the option to generate a summary or an abstract, he/she is performing an Information Access operation: understanding (in particular, Text Summarization).
- Adversarial Text Classification tasks like spam filtering or Web content filtering (e.g. pornography blocking on the Web) can be seen as organization tasks (in particular, Text Categorization or Negative Text Filtering).
Perhaps the master of Text Categorization, Dave Lewis, presented in his thesis "Representation and Learning in Information Retrieval", a description of a wide number of operations that can be seen as Information Access operations, that include:
- Text Categorization
- Document Clustering
- Text Routing
- Term Categorization
- Term Clustering
- Latent Semantic Indexing
In fact, I review and organize a number of text classification tasks in my tutorial on Text Mining:
Gómez Hidalgo, J.M. Tutorial on Text Mining and Internet Content Filtering. 13th European Conference on Machine Learning (ECML'02) and 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'02), Helsinki, Finland, 19-23 August 2002.
Moreover, given that users learn during the search process, Marti Hearst states in her chapter about "User Interfaces and Evaluation" in the book Modern Information Retrieval by Ricardo Baeza-Yates et al.:
Bates proposes the `berry-picking' model of information seeking, which has two main points. The first is that, as a result of reading and learning from the information encountered throughout the search process, the users' information needs, and consequently their queries, continually shift. Information encountered at one point in a search may lead in a new, unanticipated direction. The original goal may become partly fulfilled, thus lowering the priority of one goal in favor of another. This is posed in contrast to the assumption of 'standard' information retrieval that the user's information need remains the same throughout the search process. The second point is that users' information needs are not satisfied by a single, final retrieved set of documents, but rather by a series of selections and bits of information found along the way. This is in contrast to the assumption that the main goal of the search process is to hone down the set of retrieved documents into a perfect match of the original information need.
In other words, the standard cycle of query-retrieve documents is just a part of a more general process, Information Access, that involves avoiding historic assumptions like those stated above.
I hope that this discussion helps to clarify the difference between both concepts.