Sentiment Analysis (and/or Opinion Mining) is one of the hottest topics in Natural Language Processing nowadays. The task, defined in a simplistic way, consists of determining the polarity of a text utterance according to the opinion or sentiment of the speaker or writer, as positive or negative. This task has multiple applications, including e.g. Customer Relationship Management or predicting political elections.
While initial results dating back to early 2000 seem very promising, it is not such a simple task. We face from the informal Twitter language to the fact that opinions can be faceted (for instance, I may like the software but not the hardware of a device), or opinion spam and fake reviews, along with traditional and complex problems in Natural Language Processing as irony, sarcasm or negation. For a good overview of the task, please check the survey paper on opinion mining and sentiment analysis by Bo Pang and Lillian Lee. A more practical overview is the Sentiment Tutorial with LingPîpe by Alias-i.
In general, there are two main approaches to this task:
- Counting and/or weighting sentiment-related words that have been evaluated and tagged by experts, conforming a lexical collection like SentiWordNet.
- Learning a text classifier on a previously labelled text collection, like e.g. the SFU Review Corpus.
The SentiWordNet home page offers a simple Java program that follows the first approach. I will follow the second one in order to show how to use an essential WEKA text mining class (
weka.core.converters.TextDirectoryLoader), and to provide another example of the
I will follow the process outlined in the previous post about Language Identification using WEKA.
Data Collection and Preprocessing
For this demonstration, I will make use of a relatively small but interesting dataset named the SFU Review Corpus. This corpus consists of 400 reviews in English extracted from the Epinions website in 2004 divided in 25 positive and 25 negative reviews for each of 8 product categories (Books, Cars, Computers, etc.). It also contains 400 reviews in Spanish extracted from Ciao.es divided in the same categories (except for the Cookware category in English, which --more or less-- maps to Lavadoras --Washing Machines-- in Spanish).
The original format of the collections is one directory per category of products, including 25 positive reviews including the word "yes" in the file name and 25 negative reviews including the word "no" in the file name. Unfortunately, this format does not allow to work directly with it in WEKA, but a couple of handy scripts transform it into a new format: two directories, one including the positive reviews (directory
yes), and the other one including the negative reviews (directory
no). I have kept the category in the name of the files (with patterns like
bookyes1.txt) in order to allow others making a more detailed analysis per category.
Comparing the structure of the original and the new format of the text collections:
In order to construct an ARFF file from this structure, we can use the
weka.core.converters.TextDirectoryLoader class, which is an evolution of a previously existing helper class named
TextDirectoryToArff.java and available at WEKA Documentation at wikispaces. Using this class is as simple as issuing the next command:
$> java weka.core.converters.TextDirectoryLoader -dir SFU_Review_Corpus_WEKA > SFU_Review_Corpus.arff
You have to call this command at the parent directory of
SFU_Review_Corpus_WEKA, and the parameter
-dir sets up the input directory. This class expects to have a single directory containing a directory per class value (
no in our case), which in turn should contain a number of files pertaining to the corresponding classes. As the output of this command goes to the standard output, I have to redirect it to a file.
I have left the output of the execution of this command for both the English (
SFU_Review_Corpus.arff) and the Spanish (
SFU_Spanish_Review.arff) collections at the OpinionMining folder of my GitHub repository.
Previous models in my blog posts have been based on a relatively simple representation of texts as sequences of words. However, a trivial analysis of the problem easily drives us to think that multi-word expressions (e.g. "very bad" vs. "bad", or "a must" vs. "I must") can lead to better predictors of user sentiment or opinion about an item. Because of this, we will compare word n-grams vs. single words (or unigrams). As an basic set up, I propose to compare word unigrams, 3-grams, and 1-to-3-grams. The latter representation will include uni- to 3-grams with the hope of getting the best of all of them.
Keeping in ming that capitalization may matter in this problem ("BAD" is worse than "bad"), and that we can use standard punctuation (for each of the languages) as texts are long comments (several paragraphs each), I derive the following calls to the weka.filters.unsupervised.attribute.StringToWordVector class:
$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"\\\\W\" -min 1 -max 1" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.uni.arff
$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"\\\\W\" -min 3 -max 3" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.tri.arff
$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"\\\\W\" -min 1 -max 3" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.unitri.arff
We follow the notation
vector.uni to denote that the dataset is vectorized and that we are using word unigrams, and so on. The calls for the Spanish collection are similar to these ones.
The most important thing in these calls is that we are no longer using the
weka.core.tokenizers.WordTokenizer class. Instead, we are using
weka.core.tokenizers.NGramTokenizer, which uses the options
-max to set the minimum and maximum size of the n-grams. But the most important thing is that there is a major difference between both classes, regarding the usage of delimiters:
weka.core.tokenizers.WordTokenizerclass uses the deprecated Java class
java.util.StringTokenizer, even in the latest versions of the WEKA package (as of the day of this writing). In
StringTokenizer, the delimiters are the characters used as "spaces" to tokenize the input string: white space, punctuation marks, etc. So you have to explicitly define which will be the "spaces" in your text.
weka.core.tokenizers.NGramTokenizerclass uses the recommended Java String method
String split(String regex), in which the argument (and thus the delimiters string) is a Regular Expression (regex) in Java. The text is splitted into tokens separated by substrings that match the regex, so you can use all the power of regexes including e.g. special codes for characters. In this case I am using the code
\Wwhich denotes any non-word character, in order to get only alpha-numeric character sequences.
After splitting the text into word n-grams (or more properly, after representing the texts as term-weight vectors in our Vector Space Model), we may want to examine which n-grams are most predictive. As in the Language Identification post, we make use of the
$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.uni.arff -o SFU_Review_Corpus.vector.uni.ig0.arff
$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.tri.arff -o SFU_Review_Corpus.vector.tri.ig0.arff
$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.unitri.arff -o SFU_Review_Corpus.vector.unitri.ig0.arff
After the selection of the most predictive n-grams, we get the following statistics in the test collections:
The percentages in rows 3-6-9 measure the agressivity of feature selection. Overall, both collections have comparable statistics (in the same order of magnitude). Original unigrams are quite similar, but bigrams and trigrams are less in Spanish (despite the fact that there are more isolated words -- unigrams). Selecting n-grams with Information Gain is a bit more aggressive in Spanish for unigrams and possible bigrams, but less in trigrams.
Adding bigrams and trigrams to the representation substantially increases the number of predictive features (from 4 to 5 times). However, only trigrams result in a little increment of features, so bigrams will play a role here. The number of features is quite handy, and allows us to make quick experiments.
According to my previous post on setting up experiments with WEKA text classifiers and how to chain filters and classifiers, you must note that these are not the final features if we configure a cross-validation experiment -- we have to chain the filters (
AttributeSelection) and the classifier in order to perform a valid experiment, as the features for each folder should be different.
Experiments and Results
In order to simplify the example, and expecting to get good results, we will use the same algorithms we used in the Language Identification problem. These are: Naive Bayes (NB,
weka.classifiers.bayes.NaiveBayes), PART (
weka.classifiers.rules.PART), J48 (
weka.classifiers.trees.J48), k-Nearest Neighbors (
weka.classifiers.lazy.IBk) with k = 1,3,5, and Support Vector Machines (
weka.classifiers.functions.SMO); all of them with the default options, except for kNN which uses 1, 3 and 5 neighbors. I am testing the three proposed representations (based on unigrams, trigrams and 1-3grams) by 10-fold cross-validation. An example experiment command line is the following one:
$> java weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer \\\"weka.core.tokenizers.NGramTokenizer -delimiters \\\\\\\"\\\\\\\W\\\\\\\" -min 1 -max 1\\\" -W 10000000\" -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.bayes.NaiveBayes -v -i -t SFU_Review_Corpus.arff > tests/uniNB.txt
You can change the size of n-grams with the
-max parameters. Also, you can change the learning algorithm with the most external
-W option. I am storing the results in a
tests folder, in files with the convention
<rep><alg>.txt. The results of this test for the English language collection are the following ones:
Considering the class
yes (positive sentiment) as the positive class, in each column we show the True Positives (hits on the
yes class), False Positives (members of the
no class mistakenly classified as
yes), False Negatives (members of the
yes class mistakenly classified as
no) and True Negatives (hits on the
no class); along with the macro-averaged F1 (standard average F1 over both classes) and the general accuracy.
Additionally, the results for the Spanish language collection are the following ones:
So these are the results. Let us start the analysis...
We can perform an analysis regarding different aspects:
- Which is the overall performance?
- Which is the performance when comparing different languages?
- Which are the best learning algorithms?
- Which effect do have different text representations in the classifier performance?
All in all, and taking into account that class balance is 50% (thus a trivial acceptor or a trivial rejector, or a random classifier accuracy would be 50%), most of the classifiers beat this baseline but not by a wide margin, and even the best one among all algorithms, languages and representations (SVMs on English 1-to-3-grams) reaches only a modest 71% -- far from a satisfying 90% or over. Let me remind we are facing a relatively simple problem -- long, few texts, and a binary classification. Most approaches in the literature get much better results in similar setups.
Results are better for English than for Spanish, comparing one on one. I will check the representations used in Spanish, for instance listing the first 20 n-grams for each representation, in order to explain it:
Some of the n-grams (highlighted in italics) are just incorrect, because of the incorrect recognition of accents due to the inappropriate pattern I have used in the tokenization step. The tokenizer makes use of the string "
\W" in order to recognize alphanumeric string -- which in Java do not include vowels with accents ("á", "é", "í", "ó", "ú") and other language-specific symbols (e.g. "ñ"). Most of the n-grams are just not opinionated words or n-grams; instead, they are either intensifiers (like e.g. "muy" -- "very") or just contingent (dependent on the training collection, e.g. "en el taller" -- "in the garage"; "tarjeta de memoria" -- "storage card"). Those clearly opinionated words are highlighted in boldface. Very few. So for this issue, we can conclude that the training collection is too small.
If we examine the performance of different classifiers, we can cluster them in three groups: top performers (SVMs, NB), medium performers (PART, J48) and losers for this problem (kNN). These groups are intuitive:
- Both SVMs and NB have often demonstrated their high performance in sparse datasets, and in text classification problems in particular. They both build a linear classifier with weights (or probabilities) for each of the features. Linear classifiers perform well here given that the dataset is built on representations that clearly promote over-fitting the dataset, as we have seen that many of the most predictive n-grams are collection-dependent.
- Both PART and J48 (C4.5) are based on reducing error by progressively partitioning the dataset according to tests on the most predictive features. But the predictive features we have for such a small collection are not very good, indeed.
- All versions of kNN perform very bad, most likely because the dataset is sparse and relatively small.
However, we have to keep in mind that we have used the algorithms with their default configurations. For instance, kNN allows to use the cosine similarity instead of the Euclidean distance -- being the cosine similarity much better for text classification problems, as demonstrated many times during 50 years of research in Information Retrieval.
And regarding dataset representations, the behavior is not uniform -- we do not systematically get better results with one representation in comparison with the others. In general, 1-to-3-grams perform better than the other representations in English, while unigrams are best in Spanish, and trigrams is most often the worst representation for both languages. If we focus on top performing classifiers (NB and SVMs), this latter comment is always true. In consequence, trigrams have --to some extent-- demonstrated their power in English (as a complement to uni- and bigrams), but not in Spanish (but knowing that the representation is incorrect because of character encoding).
So all in all, we have a baseline learning-based method for Sentiment Analysis in English (and probably in Spanish, after correcting the representation), which is -- not surprisingly -- based on 1-to-3-grams and Support Vector Machines. And it is a baseline because its performance is relatively poor (with an accuracy of 71%), and we have not taken full advantage of the configuration, text representation and other parameters yet.
After this long (again!) post, I propose the next steps -- some of them left for the reader as an exercise:
- Build a Java class that classifies text files according their sentiment, for English at least, taking my previous post on Language Identification as an example -- left for the reader.
- Test other algorithms, and in particular: play with SVM configuration, and add Boosting (using
weka.classifiers.meta.AdaBoostM1) to Naive Bayes -- left for the realer.
- Check differences of accuracy in terms of product type -- cars, movies, etc. -- left for the reader.
- Improve the Spanish language representation using the appropriate regex in the tokenizer to cover Spanish letters and accents -- I will take this one myself.
- Check the accuracy of the basic keyword-based algorithm available in the SentiWordNet page -- I will take this one as well.
So that is all for the moment. You can expect one or more posts from me on this hot topic. Finally, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!