Nihil Obstat: Performance Analysis of N-Gram Tokenizer in WEKA

The goal of this post is to analyze the WEKA class NGramTokenizer in terms of performance, as it depends on the complexity of the regular expression used during the tokenization step. There is a potential trade-off between more simple regex (which lead to more tokens) and more complex regexes (which take more time to be evaluated). This post intends to provide experimental insights on this trade-off, in order to save your time when using this extremely useful class with the WEKA indexer StringToWordVector.

Motivation

The WEKA weka.core.tokenizers.NGramTokenizer class is responsible for tokenizing a text into pieces, which depending on the configuration of its size, they can be token unigrams, bigrams and so on. This class relies on the method String[] split(String regex) for splitting a text string into tokens, which are further combined into ngrams.

This method, in turn, depends on the complexity of the regular expression used to split the text. For instance, let us examine this simple example:

public class TextSplitTest { public static void main(String[] args) { String delimiters = "\\W"; String s = "This is a text &$% string"; System.out.println(s); String[] tokens = s.split(delimiters); System.out.println(tokens.length); for (int i = 0; i < tokens.length; ++i) System.out.println("#"+tokens[i]+"#"); } }

In this call to the split() method, we are using the regex "\\W", which matches any non-alphanumeric character as a delimiter. The output of this class execution is:

$> java TextSplitTest This is a text &$% string 9 #This# #is# #a# #text# ## ## ## ## #string#

This is due that every individual non-alphanumeric character is a match, and we have five delimiters between "text" and "string". In consequence, we find four empty (but not null) strings among these five matches. If we use the regex "\\W+" as the delimiters string, which matches sequences of one or more non-alphanumeric characters, we get the following output:

$> java TextSplitTest This is a text &$% string 5 #This# #is# #a# #text# #string#

Which is closer to what we expected at the beginning.

When tokenizing a text, it seems wise to avoid computing empty strings as potential tokens, because we have to invest some time to discard them -- and we can have thousands of instances!. On the other side, it is clear that a more complex regular expression leads to more computation time. So there is a trade-off between using a one-character delimiter versus using a more sophisticated regex to avoid empty strings. To which extent does this trade-off impacts on the StringToWordVector/NGramTokenizer classes?

Experiment Setup

I run these experiments on my laptop, with: CPU - Intel Core2 Duo, P8700 @ 2.53GHz; RAM: 2.90GB (1.59 GHz). For some of the tests, specially those involving a big number of ngrams, I need to make use of the -Xmx option in order to increase the heap space.

I am using the class IndexText.java available at my GITHub repository. I have commented all the output to retain only the computation time for the method index(), which creates the tokenizer and the filter objects and performs the filtering process. This process actually indexes the documents, that is, it transforms the text strings in each instance into a dictionary-based representation -- each instance is an sparse list of pairs (token_number,weight) where the weight is binary-numeric. I have also modified the class to set lowercasing to false, in order to accumulate as many tokens as possible.

I have perfomed experiments using the two next collections:

The SMS Spam Collection, which is a dataset of 5,568 short messages classified as spam/ham (not spam).
The classical Reuters-21578 text collection (ModApte split), which is a dataset of 21,578 relatively short news stories, classified according a number of economic classes (acquisitions, earning reports, products like rubber, tin or sugar, etc.). I have downloaded it from the NLTK data directory.

I am comparing using the strings "\\W" and "\\W+" as delimiters in the NGramTokenizer instance of the index() method, for unigrams, uni-to-bigrams, and uni-to-trigrams. In the case of the SMS Spam Collection, I have divided the dataset into pieces of 20%, 40%, 60%, 80% and 100% in order to evaluate the effect of the collection size.

Finally, I have run the program 10 times per experiment, in order to average and get more stable results. All numbers are expressed in milliseconds.

Results and Analysis

We will examine the results on the SMS Spam Collection. The results obtained for unigrams are the next ones:

It is a bar diagram which shows the time in milliseconds for each collection size (20%, 40%, etc.). The results for the bigrams are:

And the results for trigrams on the SMS Spam Collection are the next ones:

So the times for unigrams, uni-to-bigrams and uni-to-trigrams are exponetially higher (as it can be expected). While on unigrams, using the simple regex "\\W" is more efficient, the more sophisticated regex "\\W+" is more efficient for bigrams and trigrams. There is one anomaly point (at 60% on trigrams), but I believe it is an outlier. So it seems that the cost of using a more sophiticated regex does not pay for unigrams, in which the cost of matching this regex is higher than discarding empty strings. However it is the opposite in the case of uni-to-bigrams and uni-to-trigrams, where the empty strings seem to hurt the algorithm for building the bi- and trigrams.

The results on the Reuters-21578 collection are the next ones:

These results are fully aligned with the results obtained on the SMS Spam Collection, with the advantage of increasing the difference in the case of uni-to-trigrams, as the number of different tokens on the Reuters-21578 test collection is much bigger (as there are more texts, and they are longer).

But all in all, the biggest increment in performance we get are 4.59% in the SMS Spam Collection (uni-to-trigrams, 40% sub-collection) and 4.15% on the Reuters-21578 collection, which I consider marginal. So all in all, there is not a big difference between using these two regexes after all.

Conclusions

In the potential trade-off between using simple regular expressions to recognize text tokens, and using a more sophisticated regular expression in the WEKA indexer classes for avoiding spurius tokens, my simple experiment shows that both approaches are more or less equivalent in terms of performance.

However, when using only unigrams, it is better to use simple regular expressions because the time to match tokens in a more sophisticated regex does not pay.

On the other side, the algorithm for building bi- and trigrams seems to be sensitive to the empty strings generated by a simple regex, and you can get around a 4% increase of performance when using more sophisticated regular expressions and avoiding those empty strings.

As always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

Nihil Obstat

8.7.13

Performance Analysis of N-Gram Tokenizer in WEKA

No hay comentarios: