Comments on Nihil Obstat: Class imbalanced distribution and WEKA cost sensitive learning

Thanks for your comment. You are right, we have t...

2020-04-13T17:20:36.773+02:00

Thanks for your comment.

You are right, we have to carefully define the weights in order to avoid too much oversampling. Depending on the problem, you may have access to an expert or to knowledge on the domain to assess miss-classification costs. For instance, an expert can tell you that one FN is as expensive as 100 FPs - thus, you can overweight the under-represented class by 100.

If you do not have access to such knowledge, you can just set the weights to ensure that both classes (positive and negative) have the same weight (50%/50%), or to make tests with different weights (10/90, 20/80, etc.). This experimental approach leads to several tests and it is time-consuming. Moreover, there is no guarantee that any of the weighting schemas you test will lead to better results that using the default 1/1.

In addition, you are right regarding the generalization of results from one domain or dataset to another. Breast cancer is different to spam detection which, in turn, is different to fraud detection, and so on. And in fact, spam in one dataset may have one particular distribution, and a different one in another dataset. There are no general results, all depends on the domain and dataset and you have to experiment and test. That is the task of the data scientist! A good data scientist starts with good hypothesis and quickly improves the results in several testing cycles, reaching fast to good results. And what matters here is experience, at the end this task is making educated guesses guided by experience...

Hope this helps.

Thank you very much for the informative answer. I ...

2020-04-13T15:29:22.784+02:00

Thank you very much for the informative answer.
I have a question if that's okay. So if cost sensitivity's reweighting is equivalent to over-sampling that means that one should be careful with the weight assigned. We do not want the underrepresented class to be duplicated too much that it ends up being a majority class which would bring us back to having an imbalanced class. Is my thinking correct?

Finally, I am questioning the generalizability of cost-sensitivity on different datasets because there are no guidelines as for which ratio to use and it's solely based on experimenting different costs until a desired Recall score is reached. For example, after applying Random Forest on a dataset of chemical compounds that, hypothetically, cure cancer, I may find that a misclassification cost of 100 results in a satisfactory Recall score. But if I use 100 again as a misclassification cost on a different set, with a different ratio of classes, I may not get a good Recall score.

Thanks for the help!

Thanks for your comment. When you do oversampling...

2020-04-11T18:02:07.785+02:00

Thanks for your comment.

When you do oversampling, you are making the learning algorithm focus on over-weighted examples, which was missing previously. This will most likely increase the TP rate, which is the goal; that is, getting examples you were missing.

However there is a trade-off between TP and FP rates, as it is between Recall and Precision. Focusing on the under-represented examples, the algorithm will put some others from the negative class into the positive class (FPs), thus increasing the FP rate. This behavior is to be expected.

Let's look at if from a different point of view. If the learning algorithm focus on samples that were previously not considered (as being under-represented), some variables will change their importance, and relevant variables in the minority class will be important now. Those variables can change the classification of some examples in the majority negative class.

All in all, when you get to increase your TP rate, you typically increase your FP rate as well. As when you increase your recall, you decrease your precision. However the benefit is that you are missing less examples in your target positive class, which is the goal... at some cost.

Hope it is clear.

Hi! In your comment above you said: "Using we...

2020-04-10T21:28:58.873+02:00

Hi!
In your comment above you said: "Using weighting (by over-wighting the minority class) is fully equivalent to oversampling the minority class. It is just like duplicating the instances in the minority class by multiplying their occurrences by the weight you have assigned to them."

If that is the case, then why would the number of false positive instances increase? What has oversampling got to do with classifying negative cases as positives? Shouldn't oversampling ideally just improve the true positive rate (aka decrease false negative rate)? Thank you!

Thank you very much for your clear answer! That he...

2013-05-17T05:01:35.787+02:00

Thank you very much for your clear answer! That helped me a lot!

Hi, anonymous Using weighting (by over-wighting t...

2013-05-16T07:55:55.701+02:00

Hi, anonymous

Using weighting (by over-wighting the minority class) is fully equivalent to oversampling the minority class. It is just like duplicating the instances in the minority class by multiplying their occurrences by the weight you have assigned to them.

To my view, this is better than downsampling the majority class (because you do not loose instances that may be important).

It differs from resampling the minority class if this resampling does not follow the criteria of just duplicating the instances the exact number of times you have selected as the weight for them. For instance, if resampling is done by randomly duplicating instances in the minority class, you may have several occurences of an instance and just one of a different one. Using costs enforces each instance to be replicated the same number of times. To my view, it is better this way because you get a predictable behaviour -- I tend to run from random samples because you get results that may be not reproducible.

I hope this explanation helps you. In any case, feel free to make more questions :)

Best

I ran through this approach, since I didn't wa...

2013-05-16T05:25:35.025+02:00

I ran through this approach, since I didn't want to use WEKA's Resample filter to bias the instances to a uniform distribution by either subsampling majority classes, or resampling minority classes.

However, what I'm asking myself is whether this proposed method of weighting really differs from the ones above. I'm having the impression that this method biases as if i would resample the minoritiy classes.

However, it might be that I'm wrong. So my Question is: Am I wrong in my understanding? What is better about this approach?

Thanks so much, saved my life. This is the only so...

2009-06-05T07:52:44.881+02:00

Thanks so much, saved my life. This is the only source that I could find which explains using cost matrices with weka. Thanks again

First, select CostSensitiveClassifier in the Explo...

2008-06-10T22:36:00.000+02:00

First, select CostSensitiveClassifier in the Explorer windows.

Second, click on the classifier text bar; you will have access to the properties.

Third, click on the costMatrix field.

Fourth, select 2 classes: there you can edit the costs.

Best -

hellothanks for your nice articleI have one questi...

2008-06-10T22:11:00.000+02:00

hello
thanks for your nice article
I have one question though: how do you manage to set up a 2x2 cost matrix in stead of a 1x1?