IBM has released a very interesting tool for researchers in language through their "early access to innovation" website alphaWorks. The tool I focus here is Many Aspects Document Summarization Tool, but there are many more related to a number of computer science issues (including grid computing, information management, Java, etc.).
IBM Many Aspects Document Summarization Tool is a document summarization system that ingests a text document and automatically highlights a set of sentences that are expected to cover the different aspects of the document's content. The user decides the number of sentences to be included in the summary. These sentences are picked using the following two criteria:
- Coverage: The sentences should span a large portion of the spectrum of the document's subject matter.
- Orthogonality: Each sentence should capture different aspects of the document's content. That is, the sentences in the summary should be as orthogonal to each other as possible.
The demand of a summary with high coverage and high orthogonality is amplified by today's Web 2.0 applications. For example, in online comments and discussions following blogs, videos, and news articles, it is desirable to have a summary that highlights different angles of these comments because each often has a different focus. With IBM Many Aspects Document Summarization Tool, you can get a concise yet comprehensive overview of the document without having to spend lots of time drilling down into the details.
As the tool is designed for covering a number of aspects of a document. So I wonder, would a summary made with this tool provide enough information for enough quality text classification? May a summary of a blog post and its comments reveal which of them are spam?
PD1. It has a closed license.
PD2. For language technologies, click this link.