11.6.13

Baseline Sentiment Analysis with WEKA

Sentiment Analysis (and/or Opinion Mining) is one of the hottest topics in Natural Language Processing nowadays. The task, defined in a simplistic way, consists of determining the polarity of a text utterance according to the opinion or sentiment of the speaker or writer, as positive or negative. This task has multiple applications, including e.g. Customer Relationship Management or predicting political elections.

While initial results dating back to early 2000 seem very promising, it is not such a simple task. We face from the informal Twitter language to the fact that opinions can be faceted (for instance, I may like the software but not the hardware of a device), or opinion spam and fake reviews, along with traditional and complex problems in Natural Language Processing as irony, sarcasm or negation. For a good overview of the task, please check the survey paper on opinion mining and sentiment analysis by Bo Pang and Lillian Lee. A more practical overview is the Sentiment Tutorial with LingPîpe by Alias-i.

In general, there are two main approaches to this task:

  • Counting and/or weighting sentiment-related words that have been evaluated and tagged by experts, conforming a lexical collection like SentiWordNet.
  • Learning a text classifier on a previously labelled text collection, like e.g. the SFU Review Corpus.

The SentiWordNet home page offers a simple Java program that follows the first approach. I will follow the second one in order to show how to use an essential WEKA text mining class (weka.core.converters.TextDirectoryLoader), and to provide another example of the weka.filters.unsupervised.attribute.StringToWordVector class.

I will follow the process outlined in the previous post about Language Identification using WEKA.

Data Collection and Preprocessing

For this demonstration, I will make use of a relatively small but interesting dataset named the SFU Review Corpus. This corpus consists of 400 reviews in English extracted from the Epinions website in 2004 divided in 25 positive and 25 negative reviews for each of 8 product categories (Books, Cars, Computers, etc.). It also contains 400 reviews in Spanish extracted from Ciao.es divided in the same categories (except for the Cookware category in English, which --more or less-- maps to Lavadoras --Washing Machines-- in Spanish).

The original format of the collections is one directory per category of products, including 25 positive reviews including the word "yes" in the file name and 25 negative reviews including the word "no" in the file name. Unfortunately, this format does not allow to work directly with it in WEKA, but a couple of handy scripts transform it into a new format: two directories, one including the positive reviews (directory yes), and the other one including the negative reviews (directory no). I have kept the category in the name of the files (with patterns like bookyes1.txt) in order to allow others making a more detailed analysis per category.

Comparing the structure of the original and the new format of the text collections:

In order to construct an ARFF file from this structure, we can use the weka.core.converters.TextDirectoryLoader class, which is an evolution of a previously existing helper class named TextDirectoryToArff.java and available at WEKA Documentation at wikispaces. Using this class is as simple as issuing the next command:

$> java weka.core.converters.TextDirectoryLoader -dir SFU_Review_Corpus_WEKA > SFU_Review_Corpus.arff

You have to call this command at the parent directory of SFU_Review_Corpus_WEKA, and the parameter -dir sets up the input directory. This class expects to have a single directory containing a directory per class value (yes and no in our case), which in turn should contain a number of files pertaining to the corresponding classes. As the output of this command goes to the standard output, I have to redirect it to a file.

I have left the output of the execution of this command for both the English (SFU_Review_Corpus.arff) and the Spanish (SFU_Spanish_Review.arff) collections at the OpinionMining folder of my GitHub repository.

Data Analysis

Previous models in my blog posts have been based on a relatively simple representation of texts as sequences of words. However, a trivial analysis of the problem easily drives us to think that multi-word expressions (e.g. "very bad" vs. "bad", or "a must" vs. "I must") can lead to better predictors of user sentiment or opinion about an item. Because of this, we will compare word n-grams vs. single words (or unigrams). As an basic set up, I propose to compare word unigrams, 3-grams, and 1-to-3-grams. The latter representation will include uni- to 3-grams with the hope of getting the best of all of them.

Keeping in ming that capitalization may matter in this problem ("BAD" is worse than "bad"), and that we can use standard punctuation (for each of the languages) as texts are long comments (several paragraphs each), I derive the following calls to the weka.filters.unsupervised.attribute.StringToWordVector class:

$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"\\\\W\" -min 1 -max 1" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.uni.arff
$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"\\\\W\" -min 3 -max 3" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.tri.arff
$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"\\\\W\" -min 1 -max 3" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.unitri.arff

We follow the notation vector.uni to denote that the dataset is vectorized and that we are using word unigrams, and so on. The calls for the Spanish collection are similar to these ones.

The most important thing in these calls is that we are no longer using the weka.core.tokenizers.WordTokenizer class. Instead, we are using weka.core.tokenizers.NGramTokenizer, which uses the options -min and -max to set the minimum and maximum size of the n-grams. But the most important thing is that there is a major difference between both classes, regarding the usage of delimiters:

  • The weka.core.tokenizers.WordTokenizer class uses the deprecated Java class java.util.StringTokenizer , even in the latest versions of the WEKA package (as of the day of this writing). In StringTokenizer, the delimiters are the characters used as "spaces" to tokenize the input string: white space, punctuation marks, etc. So you have to explicitly define which will be the "spaces" in your text.
  • The weka.core.tokenizers.NGramTokenizer class uses the recommended Java String method String[] split(String regex) , in which the argument (and thus the delimiters string) is a Regular Expression (regex) in Java. The text is splitted into tokens separated by substrings that match the regex, so you can use all the power of regexes including e.g. special codes for characters. In this case I am using the code \W which denotes any non-word character, in order to get only alpha-numeric character sequences.

After splitting the text into word n-grams (or more properly, after representing the texts as term-weight vectors in our Vector Space Model), we may want to examine which n-grams are most predictive. As in the Language Identification post, we make use of the weka.filters.supervised.attribute.AttributeSelection class:

$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.uni.arff -o SFU_Review_Corpus.vector.uni.ig0.arff
$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.tri.arff -o SFU_Review_Corpus.vector.tri.ig0.arff
$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.unitri.arff -o SFU_Review_Corpus.vector.unitri.ig0.arff

After the selection of the most predictive n-grams, we get the following statistics in the test collections:

The percentages in rows 3-6-9 measure the agressivity of feature selection. Overall, both collections have comparable statistics (in the same order of magnitude). Original unigrams are quite similar, but bigrams and trigrams are less in Spanish (despite the fact that there are more isolated words -- unigrams). Selecting n-grams with Information Gain is a bit more aggressive in Spanish for unigrams and possible bigrams, but less in trigrams.

Adding bigrams and trigrams to the representation substantially increases the number of predictive features (from 4 to 5 times). However, only trigrams result in a little increment of features, so bigrams will play a role here. The number of features is quite handy, and allows us to make quick experiments.

According to my previous post on setting up experiments with WEKA text classifiers and how to chain filters and classifiers, you must note that these are not the final features if we configure a cross-validation experiment -- we have to chain the filters (StringToWordVector and AttributeSelection) and the classifier in order to perform a valid experiment, as the features for each folder should be different.

Experiments and Results

In order to simplify the example, and expecting to get good results, we will use the same algorithms we used in the Language Identification problem. These are: Naive Bayes (NB, weka.classifiers.bayes.NaiveBayes), PART (weka.classifiers.rules.PART), J48 (weka.classifiers.trees.J48), k-Nearest Neighbors (weka.classifiers.lazy.IBk) with k = 1,3,5, and Support Vector Machines (weka.classifiers.functions.SMO); all of them with the default options, except for kNN which uses 1, 3 and 5 neighbors. I am testing the three proposed representations (based on unigrams, trigrams and 1-3grams) by 10-fold cross-validation. An example experiment command line is the following one:

$> java weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer \\\"weka.core.tokenizers.NGramTokenizer -delimiters \\\\\\\"\\\\\\\W\\\\\\\" -min 1 -max 1\\\" -W 10000000\" -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.bayes.NaiveBayes -v -i -t SFU_Review_Corpus.arff > tests/uniNB.txt

You can change the size of n-grams with the -min and -max parameters. Also, you can change the learning algorithm with the most external -W option. I am storing the results in a tests folder, in files with the convention <rep><alg>.txt. The results of this test for the English language collection are the following ones:

Considering the class yes (positive sentiment) as the positive class, in each column we show the True Positives (hits on the yes class), False Positives (members of the no class mistakenly classified as yes), False Negatives (members of the yes class mistakenly classified as no) and True Negatives (hits on the no class); along with the macro-averaged F1 (standard average F1 over both classes) and the general accuracy.

Additionally, the results for the Spanish language collection are the following ones:

So these are the results. Let us start the analysis...

Results Analysis

We can perform an analysis regarding different aspects:

  • Which is the overall performance?
  • Which is the performance when comparing different languages?
  • Which are the best learning algorithms?
  • Which effect do have different text representations in the classifier performance?

All in all, and taking into account that class balance is 50% (thus a trivial acceptor or a trivial rejector, or a random classifier accuracy would be 50%), most of the classifiers beat this baseline but not by a wide margin, and even the best one among all algorithms, languages and representations (SVMs on English 1-to-3-grams) reaches only a modest 71% -- far from a satisfying 90% or over. Let me remind we are facing a relatively simple problem -- long, few texts, and a binary classification. Most approaches in the literature get much better results in similar setups.

Results are better for English than for Spanish, comparing one on one. I will check the representations used in Spanish, for instance listing the first 20 n-grams for each representation, in order to explain it:

Some of the n-grams (highlighted in italics) are just incorrect, because of the incorrect recognition of accents due to the inappropriate pattern I have used in the tokenization step. The tokenizer makes use of the string "\W" in order to recognize alphanumeric string -- which in Java do not include vowels with accents ("á", "é", "í", "ó", "ú") and other language-specific symbols (e.g. "ñ"). Most of the n-grams are just not opinionated words or n-grams; instead, they are either intensifiers (like e.g. "muy" -- "very") or just contingent (dependent on the training collection, e.g. "en el taller" -- "in the garage"; "tarjeta de memoria" -- "storage card"). Those clearly opinionated words are highlighted in boldface. Very few. So for this issue, we can conclude that the training collection is too small.

If we examine the performance of different classifiers, we can cluster them in three groups: top performers (SVMs, NB), medium performers (PART, J48) and losers for this problem (kNN). These groups are intuitive:

  • Both SVMs and NB have often demonstrated their high performance in sparse datasets, and in text classification problems in particular. They both build a linear classifier with weights (or probabilities) for each of the features. Linear classifiers perform well here given that the dataset is built on representations that clearly promote over-fitting the dataset, as we have seen that many of the most predictive n-grams are collection-dependent.
  • Both PART and J48 (C4.5) are based on reducing error by progressively partitioning the dataset according to tests on the most predictive features. But the predictive features we have for such a small collection are not very good, indeed.
  • All versions of kNN perform very bad, most likely because the dataset is sparse and relatively small.

However, we have to keep in mind that we have used the algorithms with their default configurations. For instance, kNN allows to use the cosine similarity instead of the Euclidean distance -- being the cosine similarity much better for text classification problems, as demonstrated many times during 50 years of research in Information Retrieval.

And regarding dataset representations, the behavior is not uniform -- we do not systematically get better results with one representation in comparison with the others. In general, 1-to-3-grams perform better than the other representations in English, while unigrams are best in Spanish, and trigrams is most often the worst representation for both languages. If we focus on top performing classifiers (NB and SVMs), this latter comment is always true. In consequence, trigrams have --to some extent-- demonstrated their power in English (as a complement to uni- and bigrams), but not in Spanish (but knowing that the representation is incorrect because of character encoding).

Concluding Remarks

So all in all, we have a baseline learning-based method for Sentiment Analysis in English (and probably in Spanish, after correcting the representation), which is -- not surprisingly -- based on 1-to-3-grams and Support Vector Machines. And it is a baseline because its performance is relatively poor (with an accuracy of 71%), and we have not taken full advantage of the configuration, text representation and other parameters yet.

After this long (again!) post, I propose the next steps -- some of them left for the reader as an exercise:

  • Build a Java class that classifies text files according their sentiment, for English at least, taking my previous post on Language Identification as an example -- left for the reader.
  • Test other algorithms, and in particular: play with SVM configuration, and add Boosting (using weka.classifiers.meta.AdaBoostM1) to Naive Bayes -- left for the realer.
  • Check differences of accuracy in terms of product type -- cars, movies, etc. -- left for the reader.
  • Improve the Spanish language representation using the appropriate regex in the tokenizer to cover Spanish letters and accents -- I will take this one myself.
  • Check the accuracy of the basic keyword-based algorithm available in the SentiWordNet page -- I will take this one as well.

So that is all for the moment. You can expect one or more posts from me on this hot topic. Finally, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

9 comentarios:

matienzo dijo...

Hello,
First of all, i congratulate you for all this posts you have done.... making DM with WEKA more accessible.
I'm making my thesis of Computer Engineering about sentiment analysis of tweets.
I'm comparing SMO, J48 and BayesNet . I would like to ask you if it is enough to run the classifiers with the default configuration? because in your posts is how you usually do it. What others configurations can i try?? may be there is some post you made about this.
Thanks
Francisco Boato

Jose Maria Gomez Hidalgo dijo...

Dear Francisco

Thank you very much for reading and for your feedback.

Unfortunately, I have not written about the possible algorithm configurations yet. I focus on simple tutorials for doing relatively simple things. When performing a comprenhensive research (like that for your thesis), you may need to test the algorithms in many other configurations, apart from the default ones. For instance, the parameter "C" in SMO (Support Vector Machines) has been demonstrated to strongly affect results, specially in the case of text classification.

However you can hardly test all potential configurations, and most times you have to use your knowledge about the algorithms to choose the parameters' values.

And it depends on the focus of your research as well. For instance, if you are focusing on the representaion of tweets, the machine learning algorithms is rather a black box; in this case, choosing 2/3 configurations over 5/6 algorithms may be enough. If some algorithm is particularly strong, you can then test several more configurations of this algorithm in order to check which is the best representation across all them.

In other words, you need your advisor to check with you which configurations are more approppriate and how many you need to check.

Good luck with your experiments!

Jose Maria

Apicio dijo...

I'd like to thank you for your hard work and for your Git repo. I'm making a thesis about sentiment analysis on TripAdvisor. I've manually tagged 10.000+ reviews and now I've to do an app using your tut. May you confirm these:
1) I need an ARFF file with the text of reviews.
2) After that I need to run Sentiment Analysis following your SentimentClassifier.java
3) I check the results.

Thanks a lot!

Jose Maria Gomez Hidalgo dijo...

Hi, Apicio

Thanks for reading. Yes, the steps you outline should be OK. Just check the ARFF files I provide to ensure that the class tags you use are the same, or change them in the SentimentClassifier.java code to fit your needs.

Please remember the class targets classification, not evaluation. So you would need to evaluate within WEKA or to try my FilteredClassifier examples.

Good luck and regards,

JM

Apicio dijo...
Este comentario ha sido eliminado por el autor.
Apicio dijo...
Este comentario ha sido eliminado por el autor.
Apicio dijo...

Hi Jose, thanks for you answers and you sharing of knowledge.
I've another (I think simple) question: may I use an already compiled tranining set to train or I've tu use a traning set of mine? I would use the PangLi's training set, and I want to apply the classifier trained with that training set on testset of mine.

Best regardas.

Jose Maria Gomez Hidalgo dijo...

Hi Apicio

Yes you can, as long as you use the same tags at your test set as the ones used in the Pang training set. For instance, you must use {POS, NEU, NEG} in both sets. Of course, you will get good accuracy if the genre and language are the same in the training and test sets - for instance, it makes no sense to train on Pang's dataset to classify or test tweets in Spanish.

Good luck and regards

Jatin Mistry dijo...

Hello,

I would like to thank you for this wonderful post.
I would like to know more about the regex that you have used. During data analysis, to get n-gram arff files, you have used [ \"\\\\W\" ] as the regex. But later on you have used [ \\\\\\\"\\\\\\\W\\\\\\\" ] as the regex. Can you please explain it in a little more detail.

Regards,
Jatin.