Sample Code for Text Indexing with WEKA

Following the example in which I demonstrated how to develop your own classifier in Java based on WEKA, I propose an additional example on how to index a collection of texts in you Java code. This post is inspired and supported by the WEKA "Use WEKA in your Java code" wiki page. To index a text collection is to generate a mapping between docs and words (or other indexing units) as represented in the next graph:

The fundamental class for text indexing in WEKA is weka.filters.unsupervised.attribute.StringToWordVector. This class provides an impressive range of indexing options that include using custom tokenizers, stemmers and stoplists; binary, Term Frequency and TF.IDF weights, etc. For some applications, its default options may be enough -- however I recommend to get familiar with all its options, in order to get full advantage of it.

With the purpose of showing how to use StringToWordVector in your code, I have created a simple class named IndexTest.java, stored in my GitHub repository. Apart from the relatively simple methods for loading and storing Attribute-Relation File Format (ARFF) files, the core of the class is the method void index(), which creates and employs a StringToWordVector object. The first piece of the code is the following one:

// Set the tokenizer
NGramTokenizer tokenizer = new NGramTokenizer();

This snippet creates and configures a tokenizer, that is the object responsible for breaking the original text into individual strings named tokens, representing the indexing units (typically words). In this case I am using a weka.core.tokenizers.NGramTokenizer, which I find more useful than the usual weka.core.tokenizers.WordTokenizer, as I describe in the post about sentiment analysis with WEKA. This tokenizer is able to recognize n-grams, that is, sequences of tokens. Here I use the methods void setNGramMaxSize(int value) and void setNGramMinSize(int value) to define the size of the n-grams as unigrams.

Another interesting aspect of the tokenizer part is that we setup the regular expression "\\W" as delimiters or separators. This regex defines that any character not being alphanumeric is considered a delimiter. As a result, only alphanumeric character strings will be considered tokens. For a detailed reference on regular expression in Java, check the lesson on the topic in the Java Tutorial.

The second code snippet is the following one:

// Set the filter
StringToWordVector filter = new StringToWordVector();

This second snippet creates and configures the StringToWordVector object, which is a subclass of the weka.filters.Filter class. Any filter has to make reference to a dataset, which is the inputInstances dataset in this case, as done with the filter.setInputFormat(inputInstances) call.

We setup the tokenizer and some other options as an example. Both DoNotOperateOnPerClassBasis and WordsToKeep should be standard in most of text classifiers. The first one tells the filter to extract the tokens from all classes as a whole, instead of doing it class per class (default option). I simply fail to understand why one should want to get different indexing tokens per class in a text classification problem. The second option sets the number of words to keep, and I recommend to define a big integer here in order to cover all possible tokens.

The third and last code snippet shows the invocation of the filter on the inputInstances reference:

// Filter the input instances into the output ones
outputInstances = Filter.useFilter(inputInstances,filter);

This is the standard method for applying a filter, according to the "Use WEKA in your Java code". The output of calling this class on a simple dataset as smsspam.small.arff is the next one:

$> javac IndexTest.java
$>java IndexTest
Usage: java IndexTest <fileInput> <fileOutput>
$>java IndexTest smsspam.small.arff result.arff
===== Loaded dataset: smsspam.small.arff =====
Started indexing at: 1371939800703
===== Filtering dataset done =====
Finished indexing at: 1371939800812
Total indexing time: 109
===== Saved dataset: result.arff =====
$>more result.arff
@relation 'sms_test-weka.filters.unsupervised.attribute.StringToWordVector-R2-W1000000-prune-rate-1.0-N0-L-stemmerweka.core.stemmers.NullStemmer-M1-O-
tokenizerweka.core.tokenizers.NGramTokenizer -delimiters "\\W" -max 1 -min 1'

@attribute spamclass {spam,ham}
@attribute 000 numeric
@attribute 03 numeric
@attribute 07046744435 numeric
@attribute 07732584351 numeric

As a note, the name of the relation in the generated ARFF file (tag @relation) encodes the properties of the applied filter, including some default options I have not configured in it.

So that is all. More examples on this topics coming in the next weeks. And as always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

5 comentarios:

Altleo dijo...
Este comentario ha sido eliminado por el autor.
Altleo dijo...

Thank you very much for this simple and clear tutorial.

I was wondering if/how WEKA can be used to classify documents against a vocabulary of words/terms contained in a dictionary.

Eg. As a simple case we have a document:

Words Class
------ -------
Marketing MKT
Technical TEC
Finance FIN
Human Resources HR

And we want to classify documents based on occurrence of words from the dictionary.

Some pointers on how to go about this will be most helpful.

Thank you for your posts.


Jose Maria Gomez Hidalgo dijo...

Dear AltLeo

Thank you for your encouraging comments.

Regarding your first comment, now deleted, I believe it was a matter of classpath and WEKA libraries.

Regarding your latest question, WEKA provides StringToWordVector (STWV) to build an index by using the text/words in sample documents. However you already have an index of classes, so WEKA does not fit this need very well.

It may be possible to simulate this by building an ARFF file, one document per class, includiong the words for that document. It would be something like:

@relation testRelation

@attribute text string
@attribute class {MKT,TEC,FIN,HR}

"Human Resources",HR

If you use the STWV filter, you get an index that you can use to train any classifier - all of them may be equally valid, as you have one document per class.

However, this approach assumes that classes are not overlapping; that is, one document is either in MKT or in TEC but never in both. For multiclass classification, you can use MEKA.

There is also another problem: what if a word occurs in several classes? With this approach the behavior may depend on the number of words per class (=document representing the class) and the training algorithm, leading to unexpected results.

Summarizing, if classes and vocabularies per class are not overlapping, you can use WEKA to generate an index using the former approach. In other cases, you may use MEKA or WEKA but with care in order to ensure you get the results you expect.

Regards, JM

Altleo dijo...

Dear JM,

Thank you for your prompt response.

Regarding my first comment, sorry, I was in a hurry trying to resolve problems in an newly installed upgraded version of Eclipse where classpath and build libraries were not updated. I deleted my comment as soon as I realized that. A minor distraction to the problem that occupies my mind.

The example I gave you about dictionary-based classification is very simplistic - to demonstrate a classification goal in text-processing at the first order.

In real life, it is a cluster of words/phrases associated with a document (along with semantics) that can determine the class to which a human may assign the document.
(And if we have a pre-defined thesaurus for that, it will help).

My question is how can we mimic this in machine learning?

I have looked at MEKA, but before I go to that, I wanted to study the case of a simple vocabulary using WEKA to see what can be achieved.

I have used WEKA on a set of pre-classified documents based on the simplified dictionary, and have observed success rates not more than 73%.

Hence, this problem domain captures my attention.

I like your posts.



Thank you very much for your help

I would like to add delimiters in order to avoid keep symbols or numbers, in the GUI i use AlphabeticTokenizer.

I tried to add delimiters like


but in this way the results are a file with just a null.

How i should do to skip the numbers and symbols?

Best regards