Two weeks ago, I wrote a post on how to chain filters and classifiers in WEKA, in order to avoid misleading results when performing experiments with text collections. The issue was that, when using N Fold Cross Validation (CV) in your data, you should not apply the StringToWordVector (STWV) filter on the full data collection and then perform the CV evaluation on your data, because you would be using words that are present in your test subset (but not in your training subset) for each run. Moreover, the STWV filter can extract and use simple statistics to filter out the terms (e.g. minimum number of occurrences), but those statistics over the full collection are not valid because in each CV run you use only a subset of it.
Now I would like to deal with a more general setting in which you want to apply dimensionality reduction because, in general text classification tasks, the documents or examples are represented by hundreds (if not thousands) of tokens, what makes the classification problem very hard for many learners. In WEKA, this involves using the AttributeSelection filter along with the STWV one. Before thinking about dimensionality reduction, we must reflect a bit about it.
Dimensionality reduction is a typical step in many data mining problems, which involves transforming our data representation (the schema of our table, the list of current attributes) into a shorter, more compact, and hopefully, more predictive one. Basically, this can be done in two ways:
- With feature reduction, which maps the original representation (list of attributes) onto a new and more compact one. The new attributes are synthetic, that is, they somehow combine the information from subsets of the original ones which share statistical properties. Typical feature reduction techniques include algebraic analysis methods like Principal Component Analysis (PCA) and Singular Value Decomposition (SVD). In text analysis, the most popular method is, by far, Latent Semantic Analysis, which involves obtaining the principal components or buckets into the term-to-document sparse matrix.
- With feature selection, which just selects a subset of the original representation attributes, according to some Information Theory quality metric like Information Gain or X^2 (Chi-Square). This method can be far more simple and less time consuming than the previous one, as you only have to compute the value of the metric for each attribute, and rank the attributes. Then you simply decide a threshold in the metric (e.g. 0 for Information Gain) and keep the attributes with a value over it. Alternatively, you can choose a percentage of the number of original attributes (e.g. 1% and 10% are typical numbers in text classification), and just keep those top ranking ones. However, there are other more time consuming alternatives, like exploring the predictive power of subsets of attributes using search algorithms.
A major difference between both methods is that feature reduction leads to synthetic attributes, but feature selection just keeps some of the original ones. This may affect the ability of the data scientist to understand the results, as synthetic attributes can be statistically relevant but meaningless. Another difference is that feature reduction does not make use of the class information, while feature selection does. In consequence, the second method is very likely to lead to a more predictive subset of attributes than the original one. But beware, more theoretical predictive power does not always mean more effectiveness. I recommend to read the old (?) but always helpful paper by Yimming Yang & Jan Pedersen on the topic.
The WEKA package supports both methods, mainly with the weka.attributeSelection.PrincipalComponents (feature reduction) and weka.filters.supervised.attribute.AttributeSelection (feature selection) filters. But an important question is: Do you really need to make dimensionality reduction in text analysis? There are two clear arguments against it:
- Some algorithms get no hurt with using all the features, even if they are really many and very sparse. For instance, Support Vector Machines excel in text classification problems exactly for that: they are able to deal with thousands of attributes, and they get better results when no reduction is performed. A typical text classification problem in which dimensionality reduction can be a big mistake is spam filtering.
- If it is a matter of computing time, like e.g. in symbolic learners like decision trees (C4.5) or rules (Ripper), then there is no worry. Big Data techniques come to help, as you can configure cheap and big clusters over e.g. Hadoop to perform your computations!
But having the algorithms in my favourite data analysis package, and knowing that sometimes they lead to effectiveness improvements, why not using them?
Because of the reasons above, I will focus on feature selection. In consequence, I will deal with the AttributeSelection filter, leaving the PrincipalComponents one for another post. Let us start with the same text collection that I used in my previous post about chaining filters and classifiers in WEKA. It is an small subset of the SMS Spam Collection, made with the first 200 messages for brevity and simplicity.
Our goal is to perform a 3-fold CV experiment with any algorithm in WEKA. But, in order to do it correctly, we know we must chain the STWV filter with the classifier by using the FilteredClassifier learner in WEKA. However, we want to perform feature selection as well, and the FilteredClassifier allows us to chain a single filter and a single classifier. So, how to combine both the STWV and the AttributeSelection filters into a single one?
Let us start doing it manually. After loading the dataset into the WEKA Explorer, applying the STWV filter with the default settings, and setting the class attribute to the "spamclass" one, we get something like this:
Now we can either go to the "Select attributes" tab, or just stay in the "Preprocess" tab and choose the AttributeSelection filter. I opt for the second way, so you can browse the filters folder by clicking on the "Choose" button at the "Filters" area. After selecting the "weka > filters > supervised > attribute > AttributeSelection", you can see the selected filter in the "Filters" area, as shown in the next picture:
In order to set up the filter, we can click on the name of the filter. The "weka.gui.GenericObjectEditor" window we get is a generic window that allows to configure filters, classifiers, etc. according to a number of object-defined properties. In this case, it allows us to set up the AttributeSelection filter configuration options, which are:
- The evaluator, which is the quality metric we use to evaluate the predictive properties of an attribute or a set of them. There you can choose among a wide number of them (which depends on your WEKA version), including specially Chi Square (ChiSquaredAttributeEval), Information Gain (InfoGainAttributeEval), and Gain Ratio (GainRatioAttributeEval).
- The search algorithm, which is the way we will select the remaining group of attributes, and includes very clever but time consuming group search algorithms, and my favourite one, the Ranker (weka.attributeSelection.Ranker). This one just ranks the attributes according to the chosen quality metric, and keeps those meeting some criterion (like e.g. having a value over a predefined threshold).
In the next picture, you can see the AttributeSelection configuration window with the evaluator set up to Information Gain, and the search set up as Ranker, with the default options.
The Ranker evaluator has two main properties:
- The numToSelect property, which defines the number of attributes to keep, an Integer number that is -1 (all) by default.
- The threshold property, which defines the minimum value that an attribute has to get in the evaluator in order to be kept. The default value for this property is the minimum Long integer in Java.
In consequence, if we want to keep those attributes scoring over 0, we have just to write that number in the threshold area of the window we get when we click on the Ranker at the previous window:
By clicking OK on all the previous windows, we get a configuration of the AttributeSelection filter which involves keeping those attributes with Information Gain score over 0. If we apply that filter to our current collection, we get the following result:
As you can see, we get a ranked list of 82 attributes (plus the class one), in which the top scoring attribute is the token "to". This attribute occurs in 69 messages (value 1), but many of them are spam ones, so it is quite predictive for this particular class. We can see as well that we only keep a 5.93% of the original attributes (82 over 1382).
Now we can go to the "Classify" tab and select the rule learner PART ("weka > classifiers > rules > PART") to be evaluated on the training collection itself ("Test options" area, "Use training set option"), getting the next result:
We get an accuracy of 95.5%, much better than the results I reported in my previous post. Of course, these results cannot be compared because this quick experiment is a test on the training collection, not done with 3-fold CV and the FilteredClassifier. But if we want to run a CV experiment, how to do it as we have 2 filters instead of one, in our set up?
What we need now is to start with the original text collection in ARFF format (no STWV yet), and to use the MultiFilter that WEKA provides for these situations. We start then with the original collection, and go to the "Classify" tab. If we try to choose any classic learner (J48 for the C4.5 decision tree learner, SMO for Support Vector Machines, etc.), it will be impossible because we have just one attribute (the text of the SMS messages) along with the class, but we can use the weka.classifiers.meta.FilteredClassifier. After selecting it, we will see something similar to the next picture:
If we click on the name of the classifier at the "Classifier" area and we select weka.classifiers.rules.PART as the classifier (with default options), we get the next set up in the FilteredClassifier editor window:
Then we can choose the weka.filters.MultiFilter in the filter area, which starts with a dummy AllFilter. Time to set up our filter combining STWV and AttributeSelection. We click on the filter name area and we get a new filter edition window with an area to define the filters to be applied. If we click on it, we get a new window that allows to add, configure and delete filters. The selected filters will be applied in the order we add them, so we start deleting the AllFilter and adding a STWV filter with the default options, getting something similar to the next picture:
Filters are added by clicking on the "Choose" button to select them, and clicking on the "Add" button to add them to the list. We can now add the AttributeSelection filter with the Information Gain evaluator and the Ranker with threshold 0 search, by editing the filter when clicking on the "Edit" button with the AttributeSelection filter selected in the list. If you manually re-dimension the window, you can see a set up similar to this one:
The set up is nearly finished. We close this window by clicking on the "X" button, and click on the "OK" button at the MultiFilter and FilteredClassifier windows. In the "Classify" tab at the explorer, we select "Cross-validation" in the "Test options" area, entering 3 as the number of folds, and we select the class attribute as "spamclass". Having done this, we can just click on the "Start" button to get the next result:
So we get an accuracy of 83.5%, which is worse than the one we got without using feature selection (which was 86.5%). Oh oh, all this clever (?) set up to get a drop of 3 points in accuracy! :-(
But what happens if, instead of using a relatively weak learner on text problems like PART, we turn to Support Vector Machines? WEKA includes the weka.classifiers.functions.SMO classifier, which implements John Platt's sequential minimal optimization algorithm for training a support vector classifier. If we choose this classifier with default options, we get a quite different results:
- Using only the STWV filter, we get an accuracy of 90.5% with 18 spam messages classified as legitimate ("ham"), and 1 false positive.
- Using the MultiFilter with AttributeSelection in the same setup, we get an accuracy of 91% with 16 spam messages classified as ham, and 2 false positives.
So we get an improvement of accuracy on a more accurate learner, what is nice. However, the difference is just 0.5% (1 message in our 200 instances collection), so it is moderate. Moreover, we get one more false positive, what is bad for this particular problem. In spam filtering, it is much worse to make a false positive (sending a legitimate message to the spam folder) than the opposite, because the user has the risk to miss an important message. Check my paper on cost sensitive evaluation of spam filtering at ACM SAC 2002.
But all in all, I expect this post shows the merits of feature selection in text classification problems, and how to do it with my favourite library, WEKA. Thanks for reading, and please feel free to leave a comment if you think I can improve this article!