Text Mining in WEKA: Chaining Filters and Classifiers

One of the most interesting features of WEKA is its flexibility for text classification. Over the years, I have had the chance to make a lot of experiments on text collections with WEKA, most of them in supervised tasks that are commonly mentioned as Text Categorization, that is, classifying text segments (documents, paragraphs, collocations) into a set of predefined classes. Examples of Text Categorization tasks include assigning topics labels to news items, classifying email messages into folders, or, more close to my research, classifying messages as spam or not (Bayesian spam filters) and web pages as inappropriate or not (e.g. pornographic content vs. educational resources).

WEKA support for Text Categorization is impressive. A prominent feature is that this package supports breaking text utterances into indexing terms (word stems, collocations) and assigning them a weight in term vectors, a required step in nearly every text classification task. This tokenization and indexing process is achieved by using a super-flexible filter named StringToWordVector. Lets me show an example of how it works.

I will start with a simple text collection, which is an small sample of the publicly available SMS Spam Collection. Some colleagues and me built this collection for experimenting with Bayesian SMS spam filters, and it contains 4,827 legitimate messages and 747 mobile spam messages, for a total of 5,574 short messages collected from several sources. I will make use of an small subset in order to better show my points in this post. The subset is made with the first 200 messages, and it is the following one right formatted in the suitable WEKA ARFF format:

@relation sms_test

@attribute spamclass {spam,ham}
@attribute text String

ham,'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
ham,'Ok lar... Joking wif u oni...'
spam,'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C\'s apply 08452810075over18\'s'
ham,'U dun say so early hor... U c already then say...'
ham,'Nah I don\'t think he goes to usf, he lives around here though'
spam,'FreeMsg Hey there darling it\'s been 3 week\'s now and no word back! I\'d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv'
ham,'Hi its Kate how is your evening? I hope i can see you tomorrow for a bit but i have to bloody babyjontet! Txt back if u can. :) xxx'

In the first 200 messages of the collection, 33 of them are spam and 167 are legitimate ("ham"). This collection can be loaded in the WEKA Explorer, showing something similar to the following window:

The point is that messages are featured as string attributes, so you have to break them into words in order to allow learning algorithms to induce classifiers with rules like:

if ("urgent" in message) then class(message) == spam

Here is where the StringToWordVector filter comes to help. You can just select it by clicking the "Choose" button in the "Filter" area, and browsing the folders to "weka > filters > unsupervised > attribute" one. Once selected, you should be able to see something like this:

If you click on the name of the filter, you will get a lot of options, which I leave for another post. For the my goals in this one, you can just apply this filter with the default options to get an indexed collection of 200 messages and 1,382 indexing tokens (plus the class attribute), shown in the next picture:

If you want to see colors showing the distribution of attributes (tokens) according to the class, you can just select the "class" attribute as the class for the collection in the bottom-left area of the WEKA Explorer. So, you can see that the attribute "Available" occurs just in one message, which happens to be a legitimate (ham) one:

Now, we can make our experiments in the Classify tab. We can just select cross-validation using 3 folds (1), point to the appropriate attribute to be used as a class (which is the "spamclass" one) (2), and select a rule learner like PART in the classifier area (3). You can find that classifier at the "weka > classifiers > rules" folder when clicking on the "Choose" button at the "Classifier" area. This setup is shown in the next figure:

The selected evaluation method, cross-validation, instructs WEKA to divide the training collection into 3 sub-collections (folds), and perform three experiments. Each experiment is done by using two of the folds for training, and the remaining one for testing the learnt classifier. The sub-collections are sampled randomly, the way that each instance belong only to one of them, and the class distribution (50% in our example) is kept inside each fold.

So, if we click on the "Start" button, we will get the output of our experiment, featuring the classifier learnt over the full collection, and the values for the typical accuracy metrics averaged over the three experiments, along with the confusion matrix. The classifier learnt over the full collection is the following one:

PART decision list

or <= 0 AND
to <= 0 AND
2 <= 0: ham (119.0/3.0)

£1000 <= 0 AND
call <= 0 AND
Reply <= 0 AND
i <= 0 AND
all <= 0 AND
final <= 0 AND
50 <= 0 AND
mobile <= 0 AND
ur <= 0 AND
text <= 0: ham (26.0/2.0)

i <= 0 AND
all <= 0: spam (30.0/3.0)

: ham (25.0/1.0)

Number of Rules : 4

This notation can be read as:

if (("or" not in message) and ("to" not in message) and ("2" not in message)) then class(message) == ham
otherwise class(message) == ham

And the confusion matrix is the next one:

=== Confusion Matrix ===

a b <-- classified as
17 16 | a = spam
12 155 | b = ham

Which means that the PART learner is able to get 17+155 correct classifications, and it makes 12+16 mistakes. It leads to an accuracy of 86%.

But we have done it wrong!

Do you remember the "Available" token, which occurs only on one of the messages? In which fold is it? When it is on a training fold, we are using it for training (making the learner trying to generalize from a token that does not occur in the test collection). And when it is on the test collection, the learner should not even know about it! Moreover, what happens with attributes that are highly predictive for the full collection (according to their statistics when computing e.g. the Information Gain metric)? They may have worse (or better) statistics when a subset of their occurrences is not seen, as they can be on the test collection!

The right way to perform a correct text classification experiment with cross validation in WEKA is feeding the indexing process into the classifier itself, that is, chaining the indexing filter (StringToWordVector) and the learner, the way that we index and train for every sub-set in the cross-validation run. Thus, you have to use the FilteredClassifier class provided by WEKA.

In fact, this is not that difficult. Let us go back to the original test collection, which features two attributes: the message (as a string) and the class. Then you can go to the Classify tab, and choose the FilteredClassifier learner, which is available at the "weka > classifiers > meta", and shown in the next picture:

Then you must choose the filter and the classifier you are going to apply to the collection, by clicking on the classifier name at the "Classifier" area. I choose StringToWordFilter and PART with their default options:

If we now run our experiment with 3-fold cross-validation and the filtered classifier we have just configured, we get different results:

=== Confusion Matrix ===

a b <-- classified as
13 20 | a = spam
7 160 | b = ham

For an accuracy of 86.5%, a bit better than the one obtained with the wrong setup. However, we catch 4 less spam messages, and the True Positive ratio goes down from 0.515 to 0.394. This setup is more realistic and it better mimics what will happen in the real world, in which we will find highly relevant but unseen events, and our statistics may change dramatically over time.

So now we can run our experiment safely, as no unseen events will be used in the classification. Moreover, if we apply any kind of Information Theory based filter like e.g. ranking the attributes according to their Information Gain value, the statistics will be correct, as they will be based on the training set for each cross-validation run.

Thanks for reading, and please feel free to leave a comment if you think I can improve this article!

23 comentarios:

Fariha Chowdhury dijo...
Este comentario ha sido eliminado por un administrador del blog.
Lavoisier Farias dijo...

Felicitacione por esto excelente POST!!

Nirmala dijo...

Thanks for the excellent post.
The test collection is not balanced, there are more negative instances (ham) than positive instances (spam)
doesn't it effect the model performance? can you say few words on this.

I have tried to run the classifier on sample spam data. My sample contains 118 ham instances and 82 spam. Following is the part of PART classifier output. Can you explain how to read PART decision list notation and the 'predictions on test data' part, especially for the instance 13.

Is the following interpretation correct for the first one:
if ("call" is in message) and ("u" is not in message) then class(message) = spam

and in the second one, how the fraction (105.0/4.0) can be understood?

PART decision list

call > 0 AND
u <= 0: spam (34.0)

Call <= 0 AND
to <= 0 AND
is <= 0: ham (105.0/4.0)

my <= 0 AND
tomorrow <= 0 AND
I <= 0: spam (45.0/2.0)

: ham (16.0/1.0)

Number of Rules : 4

Time taken to build model: 1.03 seconds

=== Predictions on test data ===

inst#, actual, predicted, error, probability distribution
1 1:spam 1:spam *1 0
2 1:spam 1:spam *1 0
3 1:spam 1:spam *1 0
4 1:spam 1:spam *1 0
5 1:spam 1:spam *1 0
6 1:spam 1:spam *1 0
7 1:spam 1:spam *1 0
8 1:spam 1:spam *1 0
9 1:spam 1:spam *1 0
10 1:spam 1:spam *1 0
11 1:spam 1:spam *1 0
12 1:spam 1:spam *1 0
13 1:spam 2:ham + 0.041 *0.959


Nirmala dijo...

small correction: there are more POSITIVE instances (ham) than NEGATIVE instances (spam).

Jose Maria Gomez Hidalgo dijo...

Dear Nirmala

First, thank you for your comment. About your questions:

Q1: "The test collection is not balanced, there are more negative instances (ham) than positive instances (spam)
doesn't it effect the model performance? can you say few words on this."

A1: In either case (more spam than ham or the oposite, independently of ham or spam being the positive class), imbalanced distribution does affect performance. Real imbalance means over 80% of one of the classes. If you are below that distribution, most learning algorithms will be able to handle it.

In cases with e.g. 95% imbalance, many learning algorithms fall into the trivial acceptor or rejector (mark everything as ham, or as spam) because the algorithms try to optimize accuracy, and the trivial classifier has an accuracy of 95%. For those cases, I recommend to use weighting as a variation of stratification. See my post on it: http://jmgomezhidalgo.blogspot.com.es/2008/03/class-imbalanced-distribution-and-weka.html. In case of e.g. 95%/5%, you can give the majority class a weight of 5 and the minority one a weight of 95 as a rule of thumb.

Q2: I have tried to run the classifier on sample spam data. My sample contains 118 ham instances and 82 spam. Following is the part of PART classifier output. Can you explain how to read PART decision list notation and the 'predictions on test data' part, especially for the instance 13.

A2: That notation says the real class (first one, being 1 the index of the class, class 1 in this case), the predicted class (second one), and the probabilities of the prediction for each class:

12 1:spam 1:spam *1 0 => probability of class 1 (spam) = 1, probability of class 2 (ham) = 0, (*) marks the chosen class for this instance.
13 1:spam 2:ham + 0.041 *0.959 => the same, but it says that probability for class 2 (ham) is nearly 1, so it classifies the instance as ham.

Q3: Is the following interpretation correct for the first one:
if ("call" is in message) and ("u" is not in message) then class(message) = spam


Q4: and in the second one, how the fraction (105.0/4.0) can be understood?

It is the fraction of instances correctly/incorrectly classified by the rule. In the case of the rule:

call > 0 AND
u <= 0: spam (34.0)

It means it covers (ore fits) 34 instances from the training collection, and all of them are correctly classified.

In the case of the rule:

Call <= 0 AND
to <= 0 AND
is <= 0: ham (105.0/4.0)

It means that the rule covers 109 instances, classifying 105 of them correctly, and 4 wrongly.

Thanks again for your comment.

Jose Maria

Nirmala dijo...

That's very clear. Thank you.

Anónimo dijo...

Hi. Thanks for the great posts. I am new to Weka and find your material extremely useful.

I am now confused about when to use a Filtered classifier.
Is it meant to be used in certain instances say when cross validation is used, what about when a percentage split or a supplied test set is used?

Specifically to my task (similar to Text Categorization) , I would like to use the StringToWord vector to create N-Grams, then use that data on different classifiers. For instance for 2-Grams test many classifiers (about 8), then build 3-Grams and test many classifiers, then for 4-Grams, 5-Grams, … to find the best combination (of N-grams and classifiers). I aim to do this using iteration in Java (a for loop).

In this case should I use the Filtered Classifier for all experiments? I plan to use the TextDirectoryLoader. I have 2 folders with the positive and negative classes. I will also be using other pre-processing techniques such as Information Gain.

Your assistance will be highly appreciated,

Jose Maria Gomez Hidalgo dijo...

Dear Anonimo

As far as I know, WEKA does not support creating your n-grams once unless you are going to use the same folders for each run of a CV experiment. In consequence, the most simple solution is to use the FilteredClassifier including the StringToWordVector each time, you cannot reuse the n-grams from each run for a given zize (bigrams, trigrams, etc.).

If you make the folder for cross validation in advance (e.g. three folders, thus six files for training and testing), then you can make batch filtering: you apply the StringToWordVector filter to both training and test sets in batch mode, and the test file witll keep only the n-grams ocurring in the training dataset (the correct way). You can do this as well for AttributeSelection (ranking n-grams according to Information Gain).

This approach has the advantage of efficiency but another one: keeping the intermediate ARFF files will allow you to dig into them, checking which are the most informative n-grams (in each forder), thus allowing you to debug/better understand the results, check your intuitions, etc.

Best regards

Jose Maria

Saeedeh Alimardani dijo...

I download sms collection and import it to weka. when i choose string to word vector i confront with this error: attribute names are not unique: causes 'class'.
whats the problem?

Anónimo dijo...

All classifications always have the same value. Do you know the problem?

Shaundra dijo...

Hi, unfortunately we committed a terrible mistake before coming upon your blog. We applied the StringToWordVector filter on the training data, build a model using SMO and 5-fold cross validation. Then when we tried to run the model on the test set, we found that the test set was not compatible with our training set. It looks like we have to use the FilteredClassifier but we are not quite sure how to do this. Could you please help us? Thank you very much.

Jose Maria Gomez Hidalgo dijo...

@Shaundra : Well, if you are going to do it in the WEKA Explorer, this post should make it clear. On the other side, if you want to make it in the command line, the command should be something similar to this:

$>java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F "weka.filters.MultiFilter -F weka.filters.unsupervised.attribute.StringToWordVector -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.rules.PART

This is taken from another of my posts: http://jmgomezhidalgo.blogspot.pt/2013/04/command-line-functions-for-text-mining.html

Gaurav Kandoi dijo...


I'm new to weka and most of the time What I do is, make a training set with X features and N instances. Select the best classifier after checking for multiple classifier, save a model, now use 'supplied test set' option, select my test file and use 're-evaluate model on current test set'. Almost always, I get the following error:

Problem Evaluating classifier:
Index 1, Size 1

Can someone please tell me whats wrong? I replace the class by '?' in my test set! I work with .csv files!

Any help would be much appreciated!

Jose Maria Gomez Hidalgo dijo...

Gaurav, unless you give more information, I cannot see where is the problem.

These blog posts are focused on text mining with WEKA, I can't hardly help on everything.

I suggest to try at the WEKA list: http://list.waikato.ac.nz/mailman/listinfo/wekalist.

Leonardo Lion dijo...
Este comentario ha sido eliminado por el autor.
Leonardo Lion dijo...

Hi Gaurav,

You should evaluate the training set on the test set at the same time (no need to save the model and load it again because this is going to make difference between training and test set). Let's say you are performing text mining, first, load your dataset from the Preproess panel. Then from the Classify panel choose the classifier "weka.classifiers.meta.FilteredClassifier". Modify its base classifier to be e.g., J48. And its filter parameter is "weka.filters.unsupervised.attribute.StringToWordVector". After that, load your test set from the Classify panel using the option "Supply test set". Now, hit Start and collect the results. If you want to train another classifier e.g., NaiveBayes to compare it with J48, simply repeat prior process, but change only the base classifier which was J48 to be NaiveBayes.

Please note that this process can be applied with text mining as I mentioned previously, but if you tend to perform regular classification, no need to use FilteredClassifier in this case, but choose directly the classifiers that you want to use.

P.S.: Always be sure that test set is similar to the training set in terms of type, order and name.


Jose Maria Gomez Hidalgo dijo...

Thanks for your contribution, Leonardo!

Moohebat dijo...

For solving error: attribute names are not unique: causes 'class'.Just change the attribute's name 'class' to something else in the sms collection. It causes conflict in newer version of WEKA.

irhsanaB dijo...

Thanks for the excellent post.
I am new to WEKA and this post helped me to get a fair idea on using WEKA for data mining.
I am doing tweet analysis for my project where I have to categorize tweets.
I used the following code for learning the classifier.

if (trainData.classIndex() == -1)
trainData.setClassIndex(trainData.numAttributes() - 1);

filter = new StringToWordVector();
classifier = new FilteredClassifier();

AttributeSelection as = new AttributeSelection();
as.setEvaluator(new InfoGainAttributeEval());
Ranker r = new Ranker();

MultiFilter mf = new MultiFilter();
mf.setFilters(new Filter[]{ filter, as });
multiFilter = mf;
classifier.setClassifier(new NaiveBayesMultinomial());

Then the evaluated the result and saved the model to "twitterClassifier.binary" file.

For testing purpose, I wrote the following code.
DataSource testDataSrc = new DataSource("test.arff");
Instances testData = testDataSrc.getDataSet();
if (testData.classIndex() == -1)
testData.setClassIndex(testData.numAttributes() - 1);

//load classifier from file
FilteredClassifier cls_co = (FilteredClassifier) weka.core.SerializationHelper.read("twitterClassifier.binary");
double score = cls_co.classifyInstance(testData.instance(0));
System.out.println("Class predicted: "+testData.attribute("class_attr").value((int)score));

But I am getting exception.
Exception in thread "main" java.lang.NullPointerException: No output instance format defined

My test.arff file contains the following data.
@relation disaster_tweets

@attribute text string
@attribute class_attr {derailment, fire, crash, wildfire, earthquake, bombings, shooting, flood, building_collapse, typhoon, explosion}

'declare state emergency due flood resident be move community centre albert',?

Would you please help me to find out the problem in the code? Why exception is thrown?

Jose Maria Gomez Hidalgo dijo...

Hi, irhsanaB

I suggest to try to isolate which call raises the exception. For this, either capture it an use e.printStackTrace(), or set printlns after the calls.

Thank you for reading and regards

irhsanaB dijo...

Thanks for the quick reply.
I am getting exception from the line :
double score = cls_co.classifyInstance(testData.instance(0));

java.lang.NullPointerException: No output instance format defined
at weka.filters.Filter.numPendingOutput(Filter.java:596)
at weka.classifiers.meta.FilteredClassifier.distributionForInstance(FilteredClassifier.java:416)
at weka.classifiers.Classifier.classifyInstance(Classifier.java:84)
at com.test.TestWeka.main(TestWeka.java:109)

I am still clueless how to resolve it. I debugged the code. Both classifiers look same, after deserializtaion.

N A dijo...

I've read recently this article
would it give the same result as when we use the method that you have explained here ? because they have applied the filter to both the training and the testing data set ? and I noticed that in your and their example, you both have applied the filters on the testing file when training the data .. what if I want to test different file every time does that mean I have to retrain the data every time ? because I want to build my model then test multiple files .. for example testing data categorised by years. ?

Balachandar dijo...

Thanks for the post. It was really helpful. If possible could you also share how to do this in Java WEKA coding ? Especially how we save the feature vector from trained data and use it for the testing ? Most of the examples in web do not throw light on how to save feature vector derived from training and use it in testing