Nihil Obstat: Command Line Functions for Text Mining in WEKA

In previous posts I have explained how to chain filters and classifiers in WEKA, in order to avoid incorrect results when evaluating text classifiers by using cross-fold validation, and how to integrate feature selection in the text classification process. For this purpose, I have used the FilteredClassifier and the MultiFilter in the Explorer GUI provided by WEKA. Now it is time to do so in the command line.

WEKA essentially provides three usage modes:

Using the Explorer, and other GUIs like the Experimenter, which allow to setup experiments and to examine the results graphically.
Using the command line functions, which allow to setup filters, classifiers and clusterers with plenty of configuration options.
Using the classes programmatically, that is, in your own programs in Java.

One major difference between modes 1 and 2 is that in the first mode, you spend some of the memory in the GUI, while in the second one, you do not. That can be a significant difference when you load big datasets. In both cases you can control the memory assigned to WEKA using Java command line options like -Xms, -Xms and so, but it may be interesting to save the memory used in the graphic elements in order to be able to deal with bigger datasets.

I will deal with the usage of WEKA in your programs in the future, in this post I focus on the command line. Before trying the following examples, please ensure weka.jar is added to your CLASSPATH. The first thing we must know is that WEKA filters and classifiers can be called in the command line, and that the call without arguments will show their configuration options. For instance, when you call a rule learner like PART (which I used in my previous posts), you get the following options:

$>java weka.classifiers.rules.PART Weka exception: No training file and no object input file given. General options: -h or -help Output help information. -synopsis or -info Output synopsis for classifier (use in conjunction with -h) -t <name of training file> Sets training file. -T <name of test file> Sets test file. If missing, a cross-validation will be performed on the training data. ... Options specific to weka.classifiers.rules.PART: -C <pruning confidence> Set confidence threshold for pruning. (default 0.25) ...

I omit the full list of options. Options are divided into two groups, those that are accepted by any classifier and those specific to the PART classifier. General options include three usage modes:

Evaluating the classifier on the training collection it self, possibly using cross validation, or on a test collection.
Training a classifier and storing the model in a file for further use.
Training a classifier and getting its output (classification of instances) on a test collection.

However, when calling a filter in the command line, the input file (the dataset) is read from the standard input, so you have to redirect the input from your file by using the appropriate operator (<), or to use the option -h to get the options of the filter.

In my previous post on chaining filters and classifiers, I performed an experiment running a PART classifier on an ARFF-formatted subset of the SMS Spam Collection, namely the smsspam.small.arff file. As every instance is of the form [spam|ham],"message text", we have to transform the text of the message into a term weight vector by using the StringToWordVector filter. You can combine the filter and the classifier evaluation into one command by using the FilteredClassifier class as in the following command:

$>java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.rules.PART

To get the following output:

=== Stratified cross-validation === Correctly Classified Instances 173 86.5 % Incorrectly Classified Instances 27 13.5 % Kappa statistic 0.4181 Mean absolute error 0.1625 Root mean squared error 0.3523 Relative absolute error 58.2872 % Root relative squared error 94.9031 % Total Number of Instances 200 === Confusion Matrix === a b <-- classified as 13 20 | a = spam 7 160 | b = ham

Which is exactly the one I showed in my previous post. I have used the following general options:

-t smsspam.small.arff to specify the dataset to train (and on default, to evaluate on by using cross-validation).
-c 1 to specify the first attribute as the class.
-x 3 to specify that the number of folds to be used in the cross-validation evaluation is 3.
-v and -o to avoid outputting the classifiers and statistics on the training collection, respectively.

Plus the specific options of the FilteredClassifier -F to define the filter, and -W to define the classifier.

In my subsequent post on chaining filters, I proposed to make use of attribute selection to improve the representation of our learning problem. This can be done by issuing the following command:

$>java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F "weka.filters.MultiFilter -F weka.filters.unsupervised.attribute.StringToWordVector -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.rules.PART

To get the following output:

=== Stratified cross-validation === Correctly Classified Instances 167 83.5 % Incorrectly Classified Instances 33 16.5 % Kappa statistic 0.1959 Mean absolute error 0.1967 Root mean squared error 0.38 Relative absolute error 70.53 % Root relative squared error 102.3794 % Total Number of Instances 200 === Confusion Matrix === a b <-- classified as 6 27 | a = spam 6 161 | b = ham

Which in turn, it is the same I got in that post. If we replace PART by the SMO implementation of Support Vector Machines included in WEKA (by changing weka.classifiers.rules.PART to weka.classifiers.functions.SMO), we get the accuracy figure of 91%, as described in the post.

While most of the options are the same as in the previous command, two things deserve special attention in this one:

We chain the StringToWordVector and the AttributeSelection filters by using the MultiFilter described in the previous post. The order of calls is obviously relevant, as we first need to tokenize the messages into words, and then selecting the most informative words. Moreover, while we apply StringToWordVector with the default options, the AttributeSelection filter makes use of the InfoGainAttributeEval function as quality metric, and the Ranker class as the search method. The Ranker class is applied with the option -T 0.0 in order to specify that the filter has to rank the attributes (words or tokens) according to the quality metric, but to keep only which score is over the threshold defined by T, that is 0.0.
As the order of options is not relevant, it is required to link the options to the appropriate class by using the quotation mark symbol ("). Unfortunately, we have three nested expressions:
- The whole MultiFilter filter, enclosed by the isolated quotation marks (").
- The AttributeSelection filter, enclosed by the escaped quotation mark (\").
- The Ranker search method, enclosed by the double escaped quotation mark (\\\"). Here we escape the escape symbol itself (\) along with the quotation mark.
So many escaping symbols make it a bit dirty, but still functional.

Si I have shown how we can chain filters and classifiers, and apply several chained filters as well, in the command line. In next posts I will explain how to train, store and then evaluate a classifier by using the command line, and how to make use of WEKA filters and classifiers in your Java programs.

Thanks for reading, and please feel free to leave a comment if you think I can improve this article!

NOTE: You can find the collection I used in this post, along with other stuff related to WEKA and text mining in my Text Mining in WEKA page.

Nihil Obstat

1.4.13

Command Line Functions for Text Mining in WEKA

No hay comentarios: