Nihil Obstat: Class imbalanced distribution and WEKA cost sensitive learning

6.3.08

Class imbalanced distribution and WEKA cost sensitive learning

I have recently been asked about how to address the imbalanced class distribution problem using WEKA cost sensitive classifiers. In particular, the Weighting method supported by WEKA can be used to simulate stratification, avoiding donwsampling the majority classs, and thus taking adavantage of the full available data.

The idea is simple. WEKA supports increasing the weight of examples. If class A distribution is 1%, most classifiers would learn a trivial rejector, because it is 99% effective. But you can increase the weight of mistakes on class A (false negatives, FN), for instance in a 10:1 relation. The classifier will then try to avoid false negatives, because each one is equivalent to 10 false positives (FP).

To do this in WEKA, just on the Explorer:

Load a data collection in the Preprocess tab, and go to the Classify tab.
Select the meta.CostSensitiveClassifier, and click on it the classifier textbox to get its properties.
Click on the cost matrix field, select a 2x2 matrix and configure the costs. For instance, set the FN to 10.0 and the FP to 1.0. True positives and negatives should be usually 0.0, because a success rarely has a cost.
Click on the classifier field to select the appropriate classifier. Every classifier that tries to optimize accuracy or error can be cost-sensitive. In particular, all decision trees, rule learner, and even Support Vector Machines.
Go on with your experiment.

Here you can see a capture of the cost matrix edition. Click on it to get a better view at Picasa.

This is my 5 cent tip for WEKA. More coming :-)

10 comentarios:

NightlordTW dijo...: hello
thanks for your nice article
I have one question though: how do you manage to set up a 2x2 cost matrix in stead of a 1x1?; 10:11 p. m.
Jose Maria Gomez Hidalgo dijo...: First, select CostSensitiveClassifier in the Explorer windows.

Second, click on the classifier text bar; you will have access to the properties.

Third, click on the costMatrix field.

Fourth, select 2 classes: there you can edit the costs.

Best -; 10:36 p. m.
volkan dijo...: Thanks so much, saved my life. This is the only source that I could find which explains using cost matrices with weka. Thanks again; 7:52 a. m.
Anónimo dijo...: I ran through this approach, since I didn't want to use WEKA's Resample filter to bias the instances to a uniform distribution by either subsampling majority classes, or resampling minority classes.

However, what I'm asking myself is whether this proposed method of weighting really differs from the ones above. I'm having the impression that this method biases as if i would resample the minoritiy classes.

However, it might be that I'm wrong. So my Question is: Am I wrong in my understanding? What is better about this approach?; 5:25 a. m.
Jose Maria Gomez Hidalgo dijo...: Hi, anonymous

Using weighting (by over-wighting the minority class) is fully equivalent to oversampling the minority class. It is just like duplicating the instances in the minority class by multiplying their occurrences by the weight you have assigned to them.

To my view, this is better than downsampling the majority class (because you do not loose instances that may be important).

It differs from resampling the minority class if this resampling does not follow the criteria of just duplicating the instances the exact number of times you have selected as the weight for them. For instance, if resampling is done by randomly duplicating instances in the minority class, you may have several occurences of an instance and just one of a different one. Using costs enforces each instance to be replicated the same number of times. To my view, it is better this way because you get a predictable behaviour -- I tend to run from random samples because you get results that may be not reproducible.

I hope this explanation helps you. In any case, feel free to make more questions :)

Best; 7:55 a. m.
Anónimo dijo...: Thank you very much for your clear answer! That helped me a lot!; 5:01 a. m.
Anónimo dijo...: Hi!
In your comment above you said: "Using weighting (by over-wighting the minority class) is fully equivalent to oversampling the minority class. It is just like duplicating the instances in the minority class by multiplying their occurrences by the weight you have assigned to them."

If that is the case, then why would the number of false positive instances increase? What has oversampling got to do with classifying negative cases as positives? Shouldn't oversampling ideally just improve the true positive rate (aka decrease false negative rate)? Thank you!; 9:28 p. m.
Jose Maria Gomez Hidalgo dijo...: Thanks for your comment.

When you do oversampling, you are making the learning algorithm focus on over-weighted examples, which was missing previously. This will most likely increase the TP rate, which is the goal; that is, getting examples you were missing.

However there is a trade-off between TP and FP rates, as it is between Recall and Precision. Focusing on the under-represented examples, the algorithm will put some others from the negative class into the positive class (FPs), thus increasing the FP rate. This behavior is to be expected.

Let's look at if from a different point of view. If the learning algorithm focus on samples that were previously not considered (as being under-represented), some variables will change their importance, and relevant variables in the minority class will be important now. Those variables can change the classification of some examples in the majority negative class.

All in all, when you get to increase your TP rate, you typically increase your FP rate as well. As when you increase your recall, you decrease your precision. However the benefit is that you are missing less examples in your target positive class, which is the goal... at some cost.

Hope it is clear.; 6:02 p. m.
Anónimo dijo...: Thank you very much for the informative answer.
I have a question if that's okay. So if cost sensitivity's reweighting is equivalent to over-sampling that means that one should be careful with the weight assigned. We do not want the underrepresented class to be duplicated too much that it ends up being a majority class which would bring us back to having an imbalanced class. Is my thinking correct?

Finally, I am questioning the generalizability of cost-sensitivity on different datasets because there are no guidelines as for which ratio to use and it's solely based on experimenting different costs until a desired Recall score is reached. For example, after applying Random Forest on a dataset of chemical compounds that, hypothetically, cure cancer, I may find that a misclassification cost of 100 results in a satisfactory Recall score. But if I use 100 again as a misclassification cost on a different set, with a different ratio of classes, I may not get a good Recall score.

Thanks for the help!; 3:29 p. m.
Jose Maria Gomez Hidalgo dijo...: Thanks for your comment.

You are right, we have to carefully define the weights in order to avoid too much oversampling. Depending on the problem, you may have access to an expert or to knowledge on the domain to assess miss-classification costs. For instance, an expert can tell you that one FN is as expensive as 100 FPs - thus, you can overweight the under-represented class by 100.

If you do not have access to such knowledge, you can just set the weights to ensure that both classes (positive and negative) have the same weight (50%/50%), or to make tests with different weights (10/90, 20/80, etc.). This experimental approach leads to several tests and it is time-consuming. Moreover, there is no guarantee that any of the weighting schemas you test will lead to better results that using the default 1/1.

In addition, you are right regarding the generalization of results from one domain or dataset to another. Breast cancer is different to spam detection which, in turn, is different to fraud detection, and so on. And in fact, spam in one dataset may have one particular distribution, and a different one in another dataset. There are no general results, all depends on the domain and dataset and you have to experiment and test. That is the task of the data scientist! A good data scientist starts with good hypothesis and quickly improves the results in several testing cycles, reaching fast to good results. And what matters here is experience, at the end this task is making educated guesses guided by experience...

Hope this helps.; 5:20 p. m.

Publicar un comentario