Class imbalanced distribution and WEKA cost sensitive learning

I have recently been asked about how to address the imbalanced class distribution problem using WEKA cost sensitive classifiers. In particular, the Weighting method supported by WEKA can be used to simulate stratification, avoiding donwsampling the majority classs, and thus taking adavantage of the full available data.

The idea is simple. WEKA supports increasing the weight of examples. If class A distribution is 1%, most classifiers would learn a trivial rejector, because it is 99% effective. But you can increase the weight of mistakes on class A (false negatives, FN), for instance in a 10:1 relation. The classifier will then try to avoid false negatives, because each one is equivalent to 10 false positives (FP).

To do this in WEKA, just on the Explorer:

  1. Load a data collection in the Preprocess tab, and go to the Classify tab.
  2. Select the meta.CostSensitiveClassifier, and click on it the classifier textbox to get its properties.
  3. Click on the cost matrix field, select a 2x2 matrix and configure the costs. For instance, set the FN to 10.0 and the FP to 1.0. True positives and negatives should be usually 0.0, because a success rarely has a cost.
  4. Click on the classifier field to select the appropriate classifier. Every classifier that tries to optimize accuracy or error can be cost-sensitive. In particular, all decision trees, rule learner, and even Support Vector Machines.
  5. Go on with your experiment.

Here you can see a capture of the cost matrix edition. Click on it to get a better view at Picasa.

This is my 5 cent tip for WEKA. More coming :-)

Powered by Zoundry

6 comentarios:

Deviously Alien dijo...

thanks for your nice article
I have one question though: how do you manage to set up a 2x2 cost matrix in stead of a 1x1?

José María Gómez Hidalgo dijo...

First, select CostSensitiveClassifier in the Explorer windows.

Second, click on the classifier text bar; you will have access to the properties.

Third, click on the costMatrix field.

Fourth, select 2 classes: there you can edit the costs.

Best -

volkan dijo...

Thanks so much, saved my life. This is the only source that I could find which explains using cost matrices with weka. Thanks again

Anónimo dijo...

I ran through this approach, since I didn't want to use WEKA's Resample filter to bias the instances to a uniform distribution by either subsampling majority classes, or resampling minority classes.

However, what I'm asking myself is whether this proposed method of weighting really differs from the ones above. I'm having the impression that this method biases as if i would resample the minoritiy classes.

However, it might be that I'm wrong. So my Question is: Am I wrong in my understanding? What is better about this approach?

Jose Maria Gomez Hidalgo dijo...

Hi, anonymous

Using weighting (by over-wighting the minority class) is fully equivalent to oversampling the minority class. It is just like duplicating the instances in the minority class by multiplying their occurrences by the weight you have assigned to them.

To my view, this is better than downsampling the majority class (because you do not loose instances that may be important).

It differs from resampling the minority class if this resampling does not follow the criteria of just duplicating the instances the exact number of times you have selected as the weight for them. For instance, if resampling is done by randomly duplicating instances in the minority class, you may have several occurences of an instance and just one of a different one. Using costs enforces each instance to be replicated the same number of times. To my view, it is better this way because you get a predictable behaviour -- I tend to run from random samples because you get results that may be not reproducible.

I hope this explanation helps you. In any case, feel free to make more questions :)


Anónimo dijo...

Thank you very much for your clear answer! That helped me a lot!