Nihil Obstat: Text Mining in WEKA Revisited: Selecting Attributes by Chaining Filters

11.2.13

Text Mining in WEKA Revisited: Selecting Attributes by Chaining Filters

Two weeks ago, I wrote a post on how to chain filters and classifiers in WEKA, in order to avoid misleading results when performing experiments with text collections. The issue was that, when using N Fold Cross Validation (CV) in your data, you should not apply the StringToWordVector (STWV) filter on the full data collection and then perform the CV evaluation on your data, because you would be using words that are present in your test subset (but not in your training subset) for each run. Moreover, the STWV filter can extract and use simple statistics to filter out the terms (e.g. minimum number of occurrences), but those statistics over the full collection are not valid because in each CV run you use only a subset of it.

Now I would like to deal with a more general setting in which you want to apply dimensionality reduction because, in general text classification tasks, the documents or examples are represented by hundreds (if not thousands) of tokens, what makes the classification problem very hard for many learners. In WEKA, this involves using the AttributeSelection filter along with the STWV one. Before thinking about dimensionality reduction, we must reflect a bit about it.

Dimensionality reduction is a typical step in many data mining problems, which involves transforming our data representation (the schema of our table, the list of current attributes) into a shorter, more compact, and hopefully, more predictive one. Basically, this can be done in two ways:

With feature reduction, which maps the original representation (list of attributes) onto a new and more compact one. The new attributes are synthetic, that is, they somehow combine the information from subsets of the original ones which share statistical properties. Typical feature reduction techniques include algebraic analysis methods like Principal Component Analysis (PCA) and Singular Value Decomposition (SVD). In text analysis, the most popular method is, by far, Latent Semantic Analysis, which involves obtaining the principal components or buckets into the term-to-document sparse matrix.
With feature selection, which just selects a subset of the original representation attributes, according to some Information Theory quality metric like Information Gain or X^2 (Chi-Square). This method can be far more simple and less time consuming than the previous one, as you only have to compute the value of the metric for each attribute, and rank the attributes. Then you simply decide a threshold in the metric (e.g. 0 for Information Gain) and keep the attributes with a value over it. Alternatively, you can choose a percentage of the number of original attributes (e.g. 1% and 10% are typical numbers in text classification), and just keep those top ranking ones. However, there are other more time consuming alternatives, like exploring the predictive power of subsets of attributes using search algorithms.

A major difference between both methods is that feature reduction leads to synthetic attributes, but feature selection just keeps some of the original ones. This may affect the ability of the data scientist to understand the results, as synthetic attributes can be statistically relevant but meaningless. Another difference is that feature reduction does not make use of the class information, while feature selection does. In consequence, the second method is very likely to lead to a more predictive subset of attributes than the original one. But beware, more theoretical predictive power does not always mean more effectiveness. I recommend to read the old (?) but always helpful paper by Yimming Yang & Jan Pedersen on the topic.

The WEKA package supports both methods, mainly with the weka.attributeSelection.PrincipalComponents (feature reduction) and weka.filters.supervised.attribute.AttributeSelection (feature selection) filters. But an important question is: Do you really need to make dimensionality reduction in text analysis? There are two clear arguments against it:

Some algorithms get no hurt with using all the features, even if they are really many and very sparse. For instance, Support Vector Machines excel in text classification problems exactly for that: they are able to deal with thousands of attributes, and they get better results when no reduction is performed. A typical text classification problem in which dimensionality reduction can be a big mistake is spam filtering.
If it is a matter of computing time, like e.g. in symbolic learners like decision trees (C4.5) or rules (Ripper), then there is no worry. Big Data techniques come to help, as you can configure cheap and big clusters over e.g. Hadoop to perform your computations!

But having the algorithms in my favourite data analysis package, and knowing that sometimes they lead to effectiveness improvements, why not using them?

Because of the reasons above, I will focus on feature selection. In consequence, I will deal with the AttributeSelection filter, leaving the PrincipalComponents one for another post. Let us start with the same text collection that I used in my previous post about chaining filters and classifiers in WEKA. It is an small subset of the SMS Spam Collection, made with the first 200 messages for brevity and simplicity.

Our goal is to perform a 3-fold CV experiment with any algorithm in WEKA. But, in order to do it correctly, we know we must chain the STWV filter with the classifier by using the FilteredClassifier learner in WEKA. However, we want to perform feature selection as well, and the FilteredClassifier allows us to chain a single filter and a single classifier. So, how to combine both the STWV and the AttributeSelection filters into a single one?

Let us start doing it manually. After loading the dataset into the WEKA Explorer, applying the STWV filter with the default settings, and setting the class attribute to the "spamclass" one, we get something like this:

Now we can either go to the "Select attributes" tab, or just stay in the "Preprocess" tab and choose the AttributeSelection filter. I opt for the second way, so you can browse the filters folder by clicking on the "Choose" button at the "Filters" area. After selecting the "weka > filters > supervised > attribute > AttributeSelection", you can see the selected filter in the "Filters" area, as shown in the next picture:

In order to set up the filter, we can click on the name of the filter. The "weka.gui.GenericObjectEditor" window we get is a generic window that allows to configure filters, classifiers, etc. according to a number of object-defined properties. In this case, it allows us to set up the AttributeSelection filter configuration options, which are:

The evaluator, which is the quality metric we use to evaluate the predictive properties of an attribute or a set of them. There you can choose among a wide number of them (which depends on your WEKA version), including specially Chi Square (ChiSquaredAttributeEval), Information Gain (InfoGainAttributeEval), and Gain Ratio (GainRatioAttributeEval).
The search algorithm, which is the way we will select the remaining group of attributes, and includes very clever but time consuming group search algorithms, and my favourite one, the Ranker (weka.attributeSelection.Ranker). This one just ranks the attributes according to the chosen quality metric, and keeps those meeting some criterion (like e.g. having a value over a predefined threshold).

In the next picture, you can see the AttributeSelection configuration window with the evaluator set up to Information Gain, and the search set up as Ranker, with the default options.

The Ranker evaluator has two main properties:

The numToSelect property, which defines the number of attributes to keep, an Integer number that is -1 (all) by default.
The threshold property, which defines the minimum value that an attribute has to get in the evaluator in order to be kept. The default value for this property is the minimum Long integer in Java.

In consequence, if we want to keep those attributes scoring over 0, we have just to write that number in the threshold area of the window we get when we click on the Ranker at the previous window:

By clicking OK on all the previous windows, we get a configuration of the AttributeSelection filter which involves keeping those attributes with Information Gain score over 0. If we apply that filter to our current collection, we get the following result:

As you can see, we get a ranked list of 82 attributes (plus the class one), in which the top scoring attribute is the token "to". This attribute occurs in 69 messages (value 1), but many of them are spam ones, so it is quite predictive for this particular class. We can see as well that we only keep a 5.93% of the original attributes (82 over 1382).

Now we can go to the "Classify" tab and select the rule learner PART ("weka > classifiers > rules > PART") to be evaluated on the training collection itself ("Test options" area, "Use training set option"), getting the next result:

We get an accuracy of 95.5%, much better than the results I reported in my previous post. Of course, these results cannot be compared because this quick experiment is a test on the training collection, not done with 3-fold CV and the FilteredClassifier. But if we want to run a CV experiment, how to do it as we have 2 filters instead of one, in our set up?

What we need now is to start with the original text collection in ARFF format (no STWV yet), and to use the MultiFilter that WEKA provides for these situations. We start then with the original collection, and go to the "Classify" tab. If we try to choose any classic learner (J48 for the C4.5 decision tree learner, SMO for Support Vector Machines, etc.), it will be impossible because we have just one attribute (the text of the SMS messages) along with the class, but we can use the weka.classifiers.meta.FilteredClassifier. After selecting it, we will see something similar to the next picture:

If we click on the name of the classifier at the "Classifier" area and we select weka.classifiers.rules.PART as the classifier (with default options), we get the next set up in the FilteredClassifier editor window:

Then we can choose the weka.filters.MultiFilter in the filter area, which starts with a dummy AllFilter. Time to set up our filter combining STWV and AttributeSelection. We click on the filter name area and we get a new filter edition window with an area to define the filters to be applied. If we click on it, we get a new window that allows to add, configure and delete filters. The selected filters will be applied in the order we add them, so we start deleting the AllFilter and adding a STWV filter with the default options, getting something similar to the next picture:

Filters are added by clicking on the "Choose" button to select them, and clicking on the "Add" button to add them to the list. We can now add the AttributeSelection filter with the Information Gain evaluator and the Ranker with threshold 0 search, by editing the filter when clicking on the "Edit" button with the AttributeSelection filter selected in the list. If you manually re-dimension the window, you can see a set up similar to this one:

The set up is nearly finished. We close this window by clicking on the "X" button, and click on the "OK" button at the MultiFilter and FilteredClassifier windows. In the "Classify" tab at the explorer, we select "Cross-validation" in the "Test options" area, entering 3 as the number of folds, and we select the class attribute as "spamclass". Having done this, we can just click on the "Start" button to get the next result:

So we get an accuracy of 83.5%, which is worse than the one we got without using feature selection (which was 86.5%). Oh oh, all this clever (?) set up to get a drop of 3 points in accuracy! :-(

But what happens if, instead of using a relatively weak learner on text problems like PART, we turn to Support Vector Machines? WEKA includes the weka.classifiers.functions.SMO classifier, which implements John Platt's sequential minimal optimization algorithm for training a support vector classifier. If we choose this classifier with default options, we get a quite different results:

Using only the STWV filter, we get an accuracy of 90.5% with 18 spam messages classified as legitimate ("ham"), and 1 false positive.
Using the MultiFilter with AttributeSelection in the same setup, we get an accuracy of 91% with 16 spam messages classified as ham, and 2 false positives.

So we get an improvement of accuracy on a more accurate learner, what is nice. However, the difference is just 0.5% (1 message in our 200 instances collection), so it is moderate. Moreover, we get one more false positive, what is bad for this particular problem. In spam filtering, it is much worse to make a false positive (sending a legitimate message to the spam folder) than the opposite, because the user has the risk to miss an important message. Check my paper on cost sensitive evaluation of spam filtering at ACM SAC 2002.

But all in all, I expect this post shows the merits of feature selection in text classification problems, and how to do it with my favourite library, WEKA. Thanks for reading, and please feel free to leave a comment if you think I can improve this article!

29 comentarios:

Sandrage dijo...: Very nice and useful post. Thank you :); 11:17 p. m.
Jose Maria Gomez Hidalgo dijo...: Thanks for your comment, Sandra.

brs; 3:57 p. m.
Anónimo dijo...: Dear Jose,
Hello,
I've read your weblog and it was really helpful and clarifies some of my problems and answers them.
I'm working on text classification problem now and I have a problem in classifying data.
I have a set of data that is divided to 2 different folders for training and test subsets(I can't change them).
I use STWV filter to get the attributes of training and test set separately that is not true.
Because I should have the same attributes for both training and test sets.
It means that I should use STWV filter in batch mode.
I know that there are some codes to extract the attributes from both subsets in batch mode.
I was wondering to know if I can use Weka ,without coding ,to implement batch mode for both subsets.
Would you please let me know your professional idea.; 11:23 a. m.
Jose Maria Gomez Hidalgo dijo...: Hi, Anonymous

Yes, you can do it without coding, by making command-line calls to WEKA. That is exactly the topic of my post: http://jmgomezhidalgo.blogspot.com.es/2013/05/mapping-vocabulary-from-train-to-test.html

It can be made easily by using the option -b.

Please check it and tell me if you have any doubt.

Regards; 1:26 p. m.
Harold Valdivia Garcia dijo...: Hi Jose.

Your post is really, really helpful. I've been reading your posts all night.

I am working with Soft Bug Repositories (e.g. Bugzilla, Jira) and one of my features is the bug's description.

I was looking for some method for convert or transform this feature into a numerical value. I thought PCA or LSA could help me, but they don't take into account the class info.

Do you know if there are other methods that use the class information?.; 4:42 a. m.
Jose Maria Gomez Hidalgo dijo...: Dear Harold

I do not know exactly which kind of number you want to generate from the bug description text. Specifically, I do not understand why it should depende on the class of a bug.

So just guessing: if you need to transform the text in order to predict the class (which I do not know), the best strategy is to decompose the text into words, then selecting predictive words. So you should apply the StringToWordVector filter, then the AttributeSelection filter. You can do it programmatically as explained here: http://jmgomezhidalgo.blogspot.com.es/2013/04/a-simple-text-classifier-in-java-with.html, or you can do it in the command line like here: http://jmgomezhidalgo.blogspot.com.es/2013/04/command-line-functions-for-text-mining.html

If you provide me with more details, I may give you better advice...

Regards; 9:06 a. m.
kashif dijo...: Hi, First this post is very informative and definitely helped alot. I just want to ask that you mentioned above to share the dimensionality reduction way in another post, can you redirect me to that post (if its written)?

Secondly if i have more than two classes let's say 20, would the way of applying feature selection technique same as discussed in this post ?

Thanx alot in advance for your information..; 4:09 p. m.
Jose Maria Gomez Hidalgo dijo...: Dear Kashif

Regarding other posts dealing with dimmensionality reduction, there are several of them; check the label WEKA in the blog: http://jmgomezhidalgo.blogspot.com.es/search/label/WEKA. Most of them are application-oriented.

Regarding your second question, the answer is yes; the same technique can be applied to multiclass problems. Entropy, and thus, Information Gain, applies to multiclass problems as well, so the same commands should work in a multiclass problem with no-overlapping classes.

In case you have N overlapping classes, you may think about dealing with it by using N binary classifiers (one class against the rest), or using the library MULAN: http://mulan.sourceforge.net/starting.html.

Regards; 10:49 a. m.
kashif dijo...: Dear Jose Maria,
aah thanx but i realized i misspelled my question above and missed what i really wanted to ask.

I want to ask about feature reduction like PCA or LSI(latent semantic indexing). I searched in weka but could'nt find any filter which can be applied like the way we applied in feature selection via information gain....!! Do you know of any filter or way with which i can apply LSI for text classification ?

Thanx alot for valued information..; 11:17 a. m.
Jose Maria Gomez Hidalgo dijo...: Dear Kashif

Do not worry, perhaps I missunderstood the question.

WEKA supports PCA via another filter, which is weka.filters.unsupervised.attribute.PrincipalComponents. I will write another post on the topic in the near future, because it is a very interesting technique for Text Mining, but for the moment, I just shoot this example:

java weka.filters.unsupervised.attribute.PrincipalComponents -i weather.numeric.arff

If you apply Naive Bayes to this dataset without PCA, you get an accuracy of 64.2857 %, however if you apply it after PCA, you get 85.7143 %.

Regards; 11:44 a. m.
kashif dijo...: Dear Jose Maria,
ok great, perhaps you can help me with a thought(it is right or not).

I am thinking to apply both in text classification
i) feature selection via information Gain
ii) Feature reduction/transformation via
LSI or PCA

When i apply the feature selections filter. It work very well however the feature reductions filter is too slow for performance.

Do we need to apply both of them or only one of them is enough for text classification?

Reason for asking this question is i found different views i.e. some prefer to apply both while some only end up with feature selection (IG or ChiSqr)

Thanx for valued information

Regard, Kashif; 2:27 p. m.
Anónimo dijo...: Dear Jose Maria,

I have a question concerning this approach. When combining the STW filter and IG filter through the MultiFilter in a FilteredClassifier for cross-validation, will the IG filter be applied to both the test and training sets?

I would think you want to apply the IG filter only to the training sets, since the IG filter is supervised and shouldn't be applied to the test data.

Best regards; 4:39 p. m.
Jose Maria Gomez Hidalgo dijo...: Hi

Yes, it will be applied to training and test datasets in CV, but correctly. I mean, for instance in 10-fold CV, WEKA auto atically generates 10 subsets and it uses 9 for training and 1 for testing, round-robin. If you make use of FilteredClassifier with MultiFilter and AttributeSelection, WEKA will apply AttributeSelection on the training folders as input and map it over the testing folder. This way:

- Only infirmation from the training folders will be used for selecting the attributes.
- The testing folder will be represented according to the attributes selected using the training folders.

This is equivalent to having separate trainin and test datasets, and using AttributeSelection (and/or StringToWordVector) in batch mode as I demonstrate, for instance, in my following posts:

http://jmgomezhidalgo.blogspot.com.es/2013/05/mapping-vocabulary-from-train-to-test.html
http://jmgomezhidalgo.blogspot.com.es/2013/04/command-line-functions-for-text-mining.html

Regards and thanks for reading!; 5:22 p. m.
Unknown dijo...: Dear Jose,

you blog is very informative and helpful. I commend you for your efforts. We have developped a machine learning based method for information retrieval. For this, different attributes like TF.IDF, length, etc are used to classify textual documents. Now I want to know what attributes are the most useful in process of classification. I think you use AttributeSelection with InfoGainAttributeEval as attribute evaluator and Ranker as search method. Using these classes in java code, I would like to know how to get the different attributes with their corresponding information gain. Sample code would be appreciated.
Thanks in advance for any help.
Best regards,

khadim; 9:44 a. m.
Anónimo dijo...: Dear Jose,
Thanks for your informative sharing.Your posts focused on feature selection/reduction in text classification, which was an supervised problem. But when it comes to unsupervised problems,such as text clustering, what's your solutions for feature selection?Thank you for your reply in advance!; 10:12 a. m.
Jose Maria Gomez Hidalgo dijo...: Very nice question. One of the techniques I have suggested in the article is unsupervised: Latent Semantic Indexing. This technique does not make use of the class information, and it is intended to reduce the dimmensionality of the (otherwise sparse) text representation, along with capturing the semantic dimmensions of the texts. It does fit clustering perfectly.

Other alternatives are the utilization of lexical databases or thesauri to map related terms or synonyms into single classes of semantically related words. For instance, using Roget's Thesaurus, we could map all occurrences of ball, bat, pitcher, etc. into a single class [baseball]. This reduces dimmensionality as well, but it has the problem of Word Sense Disambiguation (which still has limmited effectiveness). Thesauri can be constructed manually or automatically (see classic books by Salton or van Rijsbergen).

Most unsupervised dimmensionality reduction techniques come from Information retrieval, as text retrieval is an usupervised task (in general - some subtasks are supervised: Relevance Feedback, Learning to Rank, etc.).

From these alternatives, WEKA only supports LSI by SVD. It is not trivial, but you can use Babelnet or Wordnet as thesauri in Java, they both have APIs. Babelnet provides a very simple WSD algorithm as well, implemented in Java.; 10:32 a. m.
Anónimo dijo...: Hi!
First of all I would like to say that your blog posts are amazing! They really helped me resolve some issues for my research. However, now I wish to apply LSA on my dataset which is divided into a train and test set. Could you please give me some directions in how I can apply LSA on both train as test set?

thanks in advance!; 9:23 p. m.
Jose Maria Gomez Hidalgo dijo...: For applying LSA to a train/test splitted dataset with WEKA, you only need to make use of the LatentSemanticAnalysis evaluator with Ranker search in the AttributeSelection (AS) filter. Something like:

java weka.filters.supervised.attribute.AttributeSelection -E "weka.attributeSelection.LatentSemanticAnalysis -R 0.95 -A 5" -S "weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1"

As AS is a filter, you can apply it in batch mode (-b option) to a training & a test subset at once, thus representing the test dataset according to the dimensions defined in the training set.; 10:36 p. m.
Sir_Kay dijo...: Dear Jose,

Your discussion is really marvelous. It has helped me in getting better accuracy for my classification task.
But I had an error message when I applied the MulfFilter concept. The error is"Problem evaluating classifier:weka.attributeSelection.InfoGainAttributeEval: Cannot handle string attribute

Please, kindly help out.
Thanks; 5:48 p. m.
Rana dijo...: Hello,

Thank you for this post.
I'm facing problems with removing attributes from my sparse arff file. But first I would appreciate your help with a problem in the Explorer when I try using it with my sparse arff file. It hangs if I try to set my class attribute; the 'Edit' tab takes a long time to get the list of attributes; then if I right click the attribute to set it as my class attribute it just hangs and stops responding.
Can the Explorer handle a sparse file of around 22800 attributes (mostly unigrams generated out of around 6000 tweets).

Thank you in advance

Rana; 1:53 p. m.
Jose Maria Gomez Hidalgo dijo...: Hi, Rana

Well, it depends on your PC and Operating System; for datasets as yours, I usually make use of the Explorer on a very small subset (e.g. 100 tweets) in order to set up the config I want (filters, classifiers, etc.) and then I write them in a script (bat or bash) and run the script on the full dataset.

There is some guidance on this in the WEKA page: http://www.cs.waikato.ac.nz/ml/weka/bigdata.html

Best regards,

JM; 9:16 a. m.
Jose Maria Gomez Hidalgo dijo...: Hi, Sir_Kay

I guess there is a problem in the script or the config at the Explorer, most likely related with setting up the class attribute. However, unless I see an example or more details, I cannot be sure :(

Regards,

JM; 9:17 a. m.
Rana dijo...: Dear Jose,

Inspired by your post :), I have used it to updat your MyFilteredLearner.java such that a MultiFilter is used to chain the STWV and the Attribute selection filters as shown below, but I’m having a null pointer exception at the statement: filters[0] = filter1;

Here is the code in the try block of evaluate() method:
trainData.setClassIndex(0);
multiFilter = new MultiFilter();

filter1 = new StringToWordVector();
filter1.setAttributeIndices("last");
filters[0] = filter1;

filter2 = new AttributeSelection();
ASEvaluation attEvaluator = new InfoGainAttributeEval();
filter2.setEvaluator(attEvaluator);
Ranker ranker = new Ranker();
ranker.setThreshold(0); //<0 ignored
ASSearch asSearch = ranker;
filter2.setSearch(asSearch);
filters[1] = filter2;

//Apply the chained filters
classifier = new FilteredClassifier();
multiFilter.setFilters(filters);
classifier.setFilter(multiFilter);
classifier.setClassifier(new NaiveBayes());

Evaluation eval = new Evaluation(trainData);
eval.crossValidateModel(classifier, trainData, 4, new Random(1));

The rest is the same as yours.
Also, do I have to repeat the same setup in the learn() method? especially that applying the evaluation and search is expensive. Can't I just use the same multiFilter defined in evalute()?

I would appreciate your soonest reply

Regards; 11:39 a. m.
Rana dijo...: With reference to my previous message, I forgot to mention that I had the following instance variables defined in addition to those already defined in the MyFilteredLearner class:

MultiFilter multiFilter;
Filter[] filters;
StringToWordVector filter1;
AttributeSelection filter2;

Thanks; 11:48 a. m.
Unknown dijo...: Hello! I found your post very useful and informative. I have a .arff file and by using the same approach as yours, I have achieved an efficiency of 45% on a 10 fold cross validation using rule based PART classifier. Is there a way where I could send you my results? Would it be possible on your part to give me suggestions to improve the efficiency to about 85%?

Thanks; 10:50 p. m.
Unknown dijo...: Hello! I found your post very useful and informative. I have a .arff file and by using the same approach as yours, I have achieved an efficiency of 45% on a 10 fold cross validation using rule based PART classifier. Is there a way where I could send you my results? Would it be possible on your part to give me suggestions to improve the efficiency to about 85%?

Thanks; 10:50 p. m.
Jose Maria Gomez Hidalgo dijo...: Hi, Abhishek

I am currently very busy and not able to perform self-supported research, and very limited freelancing. I am afraid I cannot help you, sorry for that.

Good luck; 8:47 a. m.
Unknown dijo...: Hi Jose!
Usually in literature, LSA is a method related to unstructured data. I used LSA as an attribute selector for my structured data, increasing the accuracy of the classifier. It's correct to use it to structured data? Thanks!; 12:15 p. m.
Rahul dijo...: Very informative posts to get one started with Weka based ML. I had a very basic question - Assume that I have multiple TEXT attributes (bug report, summary , description, related conversation etc) , and the class can have multiple values (instead of a yes/no or spam/ not a spam) ..
let's say my ARFF looks like -

@attribute Summary string
@attribute Desc string
@attribute @@PROBLEM-CLASS@@ {Severe,Moderate,Regular,NoFixNeeded}

@data

...
...

Now the questions -
Can I handle such cases in Weka.
I had been trying with taking one String attribute at a time and apply StringToWord (with TF-IDF) transformation followed by SMO (had to do SMOTE on the class as there were too few SEVERE records in the sample). My confusion comes from the fact that the class attribute is not 1/0 type.. there are multiple values.
If you could kindly guide me with this.

Regards,; 8:51 a. m.

Publicar un comentario