A note on WEKA limitations and big data

I love WEKA since it was first introduced to me by my friend Enrique Puertas back in 1999, when he used it for programming a Usenet News client with spam filtering capabilities based on Machine Learning (what we usually call a bayesian spam filter now). I got impressed by its flexibility and functionality, and the ease of experimenting with WEKA and using it in my Java programs. I quickly got familiar with it and I used it for making my very first experiments on spam filtering.

Over the years, WEKA has being updated, getting more algorithms and making some tasks easier for text miners. For instance, the StringToWordVector filter allows to get a Vector Space Model (or bag of words) representation of your problem texts, a task that I had to do manually (with my own programs or scripts) at the beginning. Another example: the Sparse ARFF format allows to get a compact representation of your word vectors, instead of getting thousands of attribute values per instance, most of them being "0" or "no". Moreover, WEKA has attracted so much attention that other platforms have integrated it (e.g. GATE) or provided covering environments that augment its functionality (e.g. RapidMiner).

However, our needs as researchers have evolved as well. One of the most important issues now is data size. While working with average computers in my early experiments was enough, given the size of standard collections (20 Newsgroups, Reuters-21578, LingSpam, etc. - all of the order of tens of thousand instances), now that is nearly impossible. Most of my experiments involve from hundreds of thousand to millions of instances. In those cases, WEKA can spend days for a single learn-and-test cycle, or it can simply run out of memory; and not with an average machine, even with a really big server!

So now, what?

Before dealing with this question, I must say that I have been a heavy user of the WEKA command line and the Explorer GUI . However, I have never considered or used the WEKA Experimenter GUI . I know from friends and diagonal readings that the Experimenter allows to distribute experiments over a number of machines. However, if I am going to distribute my experiments, why not using newer technologies (less ad-hoc, WEKA-dependent), just 100% compatible/standard/implemented with-in cloud providers? Why not getting advantage of elastic cloud capabilities (grow and pay as you need)?

Given said this, and keeping up with the latest news and trends in data and text mining, I see two options:

  • Going for R . This language/platform has grown incredibly in the latest years, and it has quickly became a standard language for data mining, present in many curricula, and much often considered an absolute requirement in data science job offers. There are nice books about it as well, like "R in a Nutshell", and other strategical books recommend/use it (like "The Elements of Statistical Learning"). R supports map reduce algorithms over Hadoop for distributed experiments with tons of data. And R interfaces with Java as well.
  • Choosing Mahout (plus Lucene/SOLR ). This platform is Java-based, tightly integrated with Hadoop, and it makes use of Lucene for text representation tasks -- Lucene could be considered a standard for deploying search engines now. There are good books on Mahout and Lucene/SOLR as well ("Mahout in Action", "Lucene in Action", "Apache SOLR Cookbook").

But still I do not feel any option is better than the other one. Both are challenging and appealing, and I have not taken a decision yet. And I am willing to hear your opinion, of course.

2 comentarios:

Mark Hall dijo...


As of late last year, Weka now has some support for distributed algorithms running in Hadoop. The distributedWekaHadoop package implements classification/regression model building, cross-validation, scoring and correlation analysis in Hadoop. In particular, naive Bayes (multinomial), least squares linear regression two-class logistic regression, two-class linear SVMs and random forests are fully distributed. The remaining Weka classifiers/regressors are handled in the distributed context by building a dagged ensemble. There is more information on what is implemented in this blog post:



Jose Maria Gomez Hidalgo dijo...

Hi, Mark

Thanks for the information, Mark. I will give it a try.

Actually I follow your blog and I twitted your posts on the topic! See https://twitter.com/jmgomez/statuses/392162099600052224

Thanks again and regards