Mapping Vocabulary from Train to Test Datasets in WEKA Text Classifiers

There are several ways of evaluating a (text) classifier: cross validation, splitting your dataset into train and test subsets, or even evaluating the classifier on the training set itself (not recommended). I will not discuss the merits of each method, instead I will focus on a train/test split evaluation.

When you start to work with your train and test text datasets, you have got two labelled text collections like e.g. those I make available at my GITHub project: smsspam.small.train.arff and smsspam.small.test.arff . In this case, we have two collections that are a 50% split of my original simple collection smsspam.small.arff , which in turn is a subset of the the original SMS Spam Collection. The files are formatted according to the WEKA ARFF:

@relation sms_test

@attribute spamclass {spam,ham}
@attribute text String

ham,'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
spam,'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C\'s apply 08452810075over18\'s'

That is, one text instance per line, the first attribute being the nominal class spam/ham, and the second attribute being the text itself.

In text classification, you have to transform this original representation into a vector of terms/words/stems/etc. in order to allow the classifier to learn expressions like: "if the word "win" occurs in a text, then classify it as spam". In other words, you have to represent your texts as feature vectors, where the features are words and the values are e.g. binary weights, TF weights, or TF.IDF weights. In fact, WEKA provides the handy StringToWordVector filter for this purpose (Thanks, WEKA!).

However, it is most likely that the vocabulary used in your training set and in your test set is not identical. For instance, if you directly apply the StringToWordVector filter to the previous files, you get a bit different results, summarized in the following table:

Obviously, to enable learning you have to ensure that the representation of both datasets is the same. For instance, imagine that the root of the decision tree you have learnt on your training collection poses a test on an attribute that does not exist on your test collection, then what happens?

Fortunately, WEKA provides at least three ways of getting the same vocabulary in your train and test subcollections. Here are them:

  1. Using a batch filter that takes both training and test collections at the same time, using the first for getting the attributes and representing the last using those attributes.
  2. Using a FilteredClasifier (that I have discussed in previous posts), which feeds both the filter and the classifier into a single classifier that takes the original representation class/text as input for both the training and the test sets.
  3. A more recent method, that is separately getting the representations and using an InputMappedClassifier that acts as a wrapper of an underlying classifier, and tries to match attributes from the training collection into the corresponding ones of the test subset.

The first method is quite simple, and it just makes use of the -b option of the WEKA filters. The corresponding command line calls are the next ones:

$> java weka.filters.unsupervised.attribute.StringToWordVector -b -i smsspam.small.train.arff -o smsspam.small.train.vector.arff -r smsspam.small.test.arff -s smsspam.small.test.vector.arff
$> java weka.classifiers.lazy.IBk -t smsspam.small.train.vector.arff -T smsspam.small.test.vector.arff -i -c first
=== Confusion Matrix ===
a b <-- classified as
1 15 | a = spam
0 84 | b = ham

The second method, conveniently discussed in my previous post, can be applied with the following call:

$> java weka.classifiers.meta.FilteredClassifier -t smsspam.small.train.arff -T smsspam.small.test.arff -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.lazy.IBk -i -c first
=== Confusion Matrix ===
a b <-- classified as
1 15 | a = spam
0 84 | b = ham

As it is shown in the previous results, both methods achieve the same results. In this case, I have opted for using StringToWordVector without parameters (default tokenization, term weights, no stemming, etc.) with the relatively weak classifier IBk , which implements a k-Nearest-Neighbor learner that, instead of building a model from the training collection, it searches the closest training instance to the test instance (k is 1 on default) and assigns its class to the test instance.

However, the third method achieves different results, as the mapping involves some attributes from the training collection disappearing, and ignoring new attributes in the test collection. It is called the following way:

$> java weka.filters.unsupervised.attribute.StringToWordVector -i smsspam.small.train.arff -o smsspam.small.train.vector.arff
$> java weka.filters.unsupervised.attribute.StringToWordVector -i smsspam.small.test.arff -o smsspam.small.test.vector.arff
$> java weka.classifiers.misc.InputMappedClassifier -W weka.classifiers.lazy.IBk -t smsspam.small.train.vector.arff -T smsspam.small.test.vector.arff -i -c first
Attribute mappings:
Model attributes Incoming attributes
------------------------------ ----------------
(nominal) spamclass --> 1 (nominal) spamclass
(numeric) #&gt --> 2 (numeric) #&gt
(numeric) $1 --> - missing (no match)
(numeric) &amp --> - missing (no match)
(numeric) &lt --> 6 (numeric) &lt
(numeric) *9 --> 7 (numeric) *9
(numeric) + --> - missing (no match)
(numeric) - --> 8 (numeric) -
=== Confusion Matrix ===
a b <-- classified as
2 14 | a = spam
1 83 | b = ham

In fact, this time we get a bit more spam (2 over 14) with a false positive, although the general accuracy is exactly the same: 85%. You can see how some of the attributes are missing (they do not occur in the test dataset), like: "$1", "+", etc. This for sure affects the performance of the classifier, so beware.

With these options, my recommendation is using the first method, as it allows you to fully examine the representation of the datasets (term weight vectors) and it decouples filtering from training, what may be convenient in terms of efficiency.

Before ending this post, I have to thank Tiago Pasqualini Silva, Tiago Almeida and Igor Santos for our experiments with the SMS Spam Collection, and to Tiago Pasqualini in particular because he showed me the InputMappedClassifier.

And last but not least, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on this topics!

5 comentarios:

Guillermo Barbadillo Villanueva dijo...

Thank you JM for this useful post.

I have a problem that I'm not sure if it has a solution in weka.

Imagine that I have a text training set and that I use it to build a model with weka. Then I save the model.

I would like to use that saved model on new datasets to make predictions. But the new datasets may have different words and weka says that the "Training header of classifier and filter dataset don't match" when I try to use an AddClassification filter.

Is there a way to use a stored model on new data that may have different attributes?
I think that would be very convenient becouse otherwise I have to train again the model.


Jose Maria Gomez Hidalgo dijo...

Hi, Guillermo

The easiest and most suitable way of doing what you want is following the second option above, that is, using a FilteredClassifier. I have provided some guidance on it in previous posts, please read them.

Gracias por leerme, y mis mejores deseos


Rana dijo...

Hello Jose,

Thanks for this post too.
I have the same concern as Guillermo.
I did follow your advice to use the second option above, but I got an error shown below. But first let me brief you what I did:
-I used your MyFilteredLearner.java to create the Filtered classifier models.
-I modified the MyFilteredClassifier.java to evaluate the filtered classifier using an unseen dataset (instead of using the makeInstance())
-I applied the STWV filter on the dataset before using it with the loaded classifier model (that was created using MyFilteredLearner).

Here is a code segment of the evaluate() method

filter = new StringToWordVector();
Instances filteredInstances = Filter.useFilter(instances, filter);
Evaluation eval = new Evaluation(filteredInstances);

When executing the program, I got the following error on the last statment of the code segment above: 'java.lang.IllegalArgumentException: Attribute isn't nominal, string or date!'

Would appreciate your advice.


sss0350 dijo...

Hello Jose,

First, thank you for your great posts.
I read them all , and it really helps me a lot.

Let me describe my scenario here,
I use method 2 to take our training & testing data set and select the most featured attributes(by information gain) on Pre-process step to generate a vector matrix say OuputA.
On Classify step, I use FilteredClasifier as you recommended(say use RandomForest as my classifer, and multiFilter including STWV and AttributeSelection), and set Cross-validation folds to 3.

Here are my questions,
I'm a little bit confused about method 2(FilteredClasifier) and method 3(InputMappedClassifier),

Since we use OuputA as input which contains features in both training and testing data.
What will really happened when testing data doesn't contain the attribute between these two methods.

My understanding are ,
When method 2, although we use OutputA , FilteredClassifier allow us to chaining STWV and IG filter and train every sub-set CV cut.

Not sure if I misunderstood anything here, since OutputA contain all attribute we get from training and testing data. We still use the same attribute dimensions we select from pre-process.

a) For each sub-set of our training&testing process, how exactly does it present the attributes testing data are missing. Does weka just ignore it as it doesn't appear in this testing data?

b) Or for each sub-set, since we will do STWV and IG every fold of our data again, those missing attributes won't even be selected if it's not significant enough for each CV run.

Thank you for your time.

sss0350 dijo...

Hello Jose again,

Another question, like Guillermo, I have similar question about how to predict unsupervised data in the future.

My goal is trying to build up a sentiment anaysis model to predict future news data to see whether it's positive or negative.

I seems we have to add our new testing data(usually unsupervised), and run over steps we do before.
Or the attribute dimension will be different to while testing using method 2(FilteredClasifier).

In this case, method 3 using InputMappedClassifier seems to be a viable way?
Since it will try to match attributes from our training set into coresponding test set.

Can you get us some insight of how do you think about using current trained model to predict future data? That will be really great.

Thank you again.