Comments on Nihil Obstat: Text Mining in WEKA Revisited: Selecting Attributes by Chaining Filters

Very informative posts to get one started with Wek...

2016-11-23T08:51:49.454+01:00

Very informative posts to get one started with Weka based ML. I had a very basic question - Assume that I have multiple TEXT attributes (bug report, summary , description, related conversation etc) , and the class can have multiple values (instead of a yes/no or spam/ not a spam) ..
let's say my ARFF looks like -

@attribute Summary string
@attribute Desc string
@attribute @@PROBLEM-CLASS@@ {Severe,Moderate,Regular,NoFixNeeded}

@data

...
...

Now the questions -
Can I handle such cases in Weka.
I had been trying with taking one String attribute at a time and apply StringToWord (with TF-IDF) transformation followed by SMO (had to do SMOTE on the class as there were too few SEVERE records in the sample). My confusion comes from the fact that the class attribute is not 1/0 type.. there are multiple values.
If you could kindly guide me with this.

Regards,

Hi Jose! Usually in literature, LSA is a method re...

2015-10-28T12:15:52.466+01:00

Hi Jose!
Usually in literature, LSA is a method related to unstructured data. I used LSA as an attribute selector for my structured data, increasing the accuracy of the classifier. It's correct to use it to structured data? Thanks!

Hi, Abhishek I am currently very busy and not abl...

2015-04-30T08:47:30.231+02:00

Hi, Abhishek

I am currently very busy and not able to perform self-supported research, and very limited freelancing. I am afraid I cannot help you, sorry for that.

Good luck

Hello! I found your post very useful and informati...

2015-04-29T22:50:43.174+02:00

Hello! I found your post very useful and informative. I have a .arff file and by using the same approach as yours, I have achieved an efficiency of 45% on a 10 fold cross validation using rule based PART classifier. Is there a way where I could send you my results? Would it be possible on your part to give me suggestions to improve the efficiency to about 85%?

Thanks

Hello! I found your post very useful and informati...

2015-04-29T22:50:42.864+02:00

With reference to my previous message, I forgot to...

2015-01-09T11:48:16.971+01:00

With reference to my previous message, I forgot to mention that I had the following instance variables defined in addition to those already defined in the MyFilteredLearner class:

MultiFilter multiFilter;
Filter[] filters;
StringToWordVector filter1;
AttributeSelection filter2;

Thanks

Dear Jose, Inspired by your post :), I have used ...

2015-01-09T11:39:33.297+01:00

Dear Jose,

Inspired by your post :), I have used it to updat your MyFilteredLearner.java such that a MultiFilter is used to chain the STWV and the Attribute selection filters as shown below, but I’m having a null pointer exception at the statement: filters[0] = filter1;

Here is the code in the try block of evaluate() method:
trainData.setClassIndex(0);
multiFilter = new MultiFilter();

filter1 = new StringToWordVector();
filter1.setAttributeIndices("last");
filters[0] = filter1;

filter2 = new AttributeSelection();
ASEvaluation attEvaluator = new InfoGainAttributeEval();
filter2.setEvaluator(attEvaluator);
Ranker ranker = new Ranker();
ranker.setThreshold(0); //<0 ignored
ASSearch asSearch = ranker;
filter2.setSearch(asSearch);
filters[1] = filter2;

//Apply the chained filters
classifier = new FilteredClassifier();
multiFilter.setFilters(filters);
classifier.setFilter(multiFilter);
classifier.setClassifier(new NaiveBayes());

Evaluation eval = new Evaluation(trainData);
eval.crossValidateModel(classifier, trainData, 4, new Random(1));

The rest is the same as yours.
Also, do I have to repeat the same setup in the learn() method? especially that applying the evaluation and search is expensive. Can't I just use the same multiFilter defined in evalute()?

I would appreciate your soonest reply

Regards

Hi, Sir_Kay I guess there is a problem in the scr...

2015-01-09T09:17:59.771+01:00

Hi, Sir_Kay

I guess there is a problem in the script or the config at the Explorer, most likely related with setting up the class attribute. However, unless I see an example or more details, I cannot be sure :(

Regards,

JM

Hi, Rana Well, it depends on your PC and Operatin...

2015-01-09T09:16:02.579+01:00

Hi, Rana

Well, it depends on your PC and Operating System; for datasets as yours, I usually make use of the Explorer on a very small subset (e.g. 100 tweets) in order to set up the config I want (filters, classifiers, etc.) and then I write them in a script (bat or bash) and run the script on the full dataset.

There is some guidance on this in the WEKA page: http://www.cs.waikato.ac.nz/ml/weka/bigdata.html

Best regards,

JM

Hello, Thank you for this post. I'm facing p...

2015-01-08T13:53:44.049+01:00

Hello,

Thank you for this post.
I'm facing problems with removing attributes from my sparse arff file. But first I would appreciate your help with a problem in the Explorer when I try using it with my sparse arff file. It hangs if I try to set my class attribute; the 'Edit' tab takes a long time to get the list of attributes; then if I right click the attribute to set it as my class attribute it just hangs and stops responding.
Can the Explorer handle a sparse file of around 22800 attributes (mostly unigrams generated out of around 6000 tweets).

Thank you in advance

Rana

Dear Jose, Your discussion is really marvelous. I...

2014-11-28T17:48:53.452+01:00

Dear Jose,

Your discussion is really marvelous. It has helped me in getting better accuracy for my classification task.
But I had an error message when I applied the MulfFilter concept. The error is"Problem evaluating classifier:weka.attributeSelection.InfoGainAttributeEval: Cannot handle string attribute

Please, kindly help out.
Thanks

For applying LSA to a train/test splitted dataset ...

2014-04-07T22:36:01.380+02:00

For applying LSA to a train/test splitted dataset with WEKA, you only need to make use of the LatentSemanticAnalysis evaluator with Ranker search in the AttributeSelection (AS) filter. Something like:

java weka.filters.supervised.attribute.AttributeSelection -E "weka.attributeSelection.LatentSemanticAnalysis -R 0.95 -A 5" -S "weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1"

As AS is a filter, you can apply it in batch mode (-b option) to a training & a test subset at once, thus representing the test dataset according to the dimensions defined in the training set.

Hi! First of all I would like to say that your blo...

2014-04-07T21:23:12.459+02:00

Hi!
First of all I would like to say that your blog posts are amazing! They really helped me resolve some issues for my research. However, now I wish to apply LSA on my dataset which is divided into a train and test set. Could you please give me some directions in how I can apply LSA on both train as test set?

thanks in advance!

Very nice question. One of the techniques I have s...

2014-04-02T10:32:59.401+02:00

Very nice question. One of the techniques I have suggested in the article is unsupervised: Latent Semantic Indexing. This technique does not make use of the class information, and it is intended to reduce the dimmensionality of the (otherwise sparse) text representation, along with capturing the semantic dimmensions of the texts. It does fit clustering perfectly.

Other alternatives are the utilization of lexical databases or thesauri to map related terms or synonyms into single classes of semantically related words. For instance, using Roget's Thesaurus, we could map all occurrences of ball, bat, pitcher, etc. into a single class [baseball]. This reduces dimmensionality as well, but it has the problem of Word Sense Disambiguation (which still has limmited effectiveness). Thesauri can be constructed manually or automatically (see classic books by Salton or van Rijsbergen).

Most unsupervised dimmensionality reduction techniques come from Information retrieval, as text retrieval is an usupervised task (in general - some subtasks are supervised: Relevance Feedback, Learning to Rank, etc.).

From these alternatives, WEKA only supports LSI by SVD. It is not trivial, but you can use Babelnet or Wordnet as thesauri in Java, they both have APIs. Babelnet provides a very simple WSD algorithm as well, implemented in Java.

Dear Jose, Thanks for your informative sharing.You...

2014-04-02T10:12:46.420+02:00

Dear Jose,
Thanks for your informative sharing.Your posts focused on feature selection/reduction in text classification, which was an supervised problem. But when it comes to unsupervised problems,such as text clustering, what's your solutions for feature selection?Thank you for your reply in advance!

Dear Jose, you blog is very informative and helpf...

2014-01-20T09:44:25.230+01:00

Dear Jose,

you blog is very informative and helpful. I commend you for your efforts. We have developped a machine learning based method for information retrieval. For this, different attributes like TF.IDF, length, etc are used to classify textual documents. Now I want to know what attributes are the most useful in process of classification. I think you use AttributeSelection with InfoGainAttributeEval as attribute evaluator and Ranker as search method. Using these classes in java code, I would like to know how to get the different attributes with their corresponding information gain. Sample code would be appreciated.
Thanks in advance for any help.
Best regards,

khadim

Hi Yes, it will be applied to training and test d...

2014-01-02T17:22:39.337+01:00

Hi

Yes, it will be applied to training and test datasets in CV, but correctly. I mean, for instance in 10-fold CV, WEKA auto atically generates 10 subsets and it uses 9 for training and 1 for testing, round-robin. If you make use of FilteredClassifier with MultiFilter and AttributeSelection, WEKA will apply AttributeSelection on the training folders as input and map it over the testing folder. This way:

- Only infirmation from the training folders will be used for selecting the attributes.
- The testing folder will be represented according to the attributes selected using the training folders.

This is equivalent to having separate trainin and test datasets, and using AttributeSelection (and/or StringToWordVector) in batch mode as I demonstrate, for instance, in my following posts:

http://jmgomezhidalgo.blogspot.com.es/2013/05/mapping-vocabulary-from-train-to-test.html
http://jmgomezhidalgo.blogspot.com.es/2013/04/command-line-functions-for-text-mining.html

Regards and thanks for reading!

Dear Jose Maria, I have a question concerning thi...

2014-01-02T16:39:00.407+01:00

Dear Jose Maria,

I have a question concerning this approach. When combining the STW filter and IG filter through the MultiFilter in a FilteredClassifier for cross-validation, will the IG filter be applied to both the test and training sets?

I would think you want to apply the IG filter only to the training sets, since the IG filter is supervised and shouldn't be applied to the test data.

Best regards

Dear Jose Maria, ok great, perhaps you can help m...

2014-01-02T14:27:45.937+01:00

Dear Jose Maria,
ok great, perhaps you can help me with a thought(it is right or not).

I am thinking to apply both in text classification
i) feature selection via information Gain
ii) Feature reduction/transformation via
LSI or PCA

When i apply the feature selections filter. It work very well however the feature reductions filter is too slow for performance.

Do we need to apply both of them or only one of them is enough for text classification?

Reason for asking this question is i found different views i.e. some prefer to apply both while some only end up with feature selection (IG or ChiSqr)

Thanx for valued information

Regard, Kashif

Dear Kashif Do not worry, perhaps I missunderstoo...

2014-01-02T11:44:50.988+01:00

Dear Kashif

Do not worry, perhaps I missunderstood the question.

WEKA supports PCA via another filter, which is weka.filters.unsupervised.attribute.PrincipalComponents. I will write another post on the topic in the near future, because it is a very interesting technique for Text Mining, but for the moment, I just shoot this example:

java weka.filters.unsupervised.attribute.PrincipalComponents -i weather.numeric.arff

If you apply Naive Bayes to this dataset without PCA, you get an accuracy of 64.2857 %, however if you apply it after PCA, you get 85.7143 %.

Regards

Dear Jose Maria, aah thanx but i realized i misspe...

2014-01-02T11:17:51.502+01:00

Dear Jose Maria,
aah thanx but i realized i misspelled my question above and missed what i really wanted to ask.

I want to ask about feature reduction like PCA or LSI(latent semantic indexing). I searched in weka but could'nt find any filter which can be applied like the way we applied in feature selection via information gain....!! Do you know of any filter or way with which i can apply LSI for text classification ?

Thanx alot for valued information..

Dear Kashif Regarding other posts dealing with di...

2014-01-02T10:49:04.754+01:00

Dear Kashif

Regarding other posts dealing with dimmensionality reduction, there are several of them; check the label WEKA in the blog: http://jmgomezhidalgo.blogspot.com.es/search/label/WEKA. Most of them are application-oriented.

Regarding your second question, the answer is yes; the same technique can be applied to multiclass problems. Entropy, and thus, Information Gain, applies to multiclass problems as well, so the same commands should work in a multiclass problem with no-overlapping classes.

In case you have N overlapping classes, you may think about dealing with it by using N binary classifiers (one class against the rest), or using the library MULAN: http://mulan.sourceforge.net/starting.html.

Regards

Hi, First this post is very informative and defini...

2014-01-01T16:09:24.119+01:00

Hi, First this post is very informative and definitely helped alot. I just want to ask that you mentioned above to share the dimensionality reduction way in another post, can you redirect me to that post (if its written)?

Secondly if i have more than two classes let's say 20, would the way of applying feature selection technique same as discussed in this post ?

Thanx alot in advance for your information..

Dear Harold I do not know exactly which kind of n...

2013-09-05T09:06:02.595+02:00

Dear Harold

I do not know exactly which kind of number you want to generate from the bug description text. Specifically, I do not understand why it should depende on the class of a bug.

So just guessing: if you need to transform the text in order to predict the class (which I do not know), the best strategy is to decompose the text into words, then selecting predictive words. So you should apply the StringToWordVector filter, then the AttributeSelection filter. You can do it programmatically as explained here: http://jmgomezhidalgo.blogspot.com.es/2013/04/a-simple-text-classifier-in-java-with.html, or you can do it in the command line like here: http://jmgomezhidalgo.blogspot.com.es/2013/04/command-line-functions-for-text-mining.html

If you provide me with more details, I may give you better advice...

Regards

Hi Jose. Your post is really, really helpful. I&#...

2013-09-05T04:42:35.170+02:00

Hi Jose.

Your post is really, really helpful. I've been reading your posts all night.

I am working with Soft Bug Repositories (e.g. Bugzilla, Jira) and one of my features is the bug's description.

I was looking for some method for convert or transform this feature into a numerical value. I thought PCA or LSA could help me, but they don't take into account the class info.

Do you know if there are other methods that use the class information?.