23.6.13

Sample Code for Text Indexing with WEKA

Following the example in which I demonstrated how to develop your own classifier in Java based on WEKA, I propose an additional example on how to index a collection of texts in you Java code. This post is inspired and supported by the WEKA "Use WEKA in your Java code" wiki page. To index a text collection is to generate a mapping between docs and words (or other indexing units) as represented in the next graph:

The fundamental class for text indexing in WEKA is weka.filters.unsupervised.attribute.StringToWordVector. This class provides an impressive range of indexing options that include using custom tokenizers, stemmers and stoplists; binary, Term Frequency and TF.IDF weights, etc. For some applications, its default options may be enough -- however I recommend to get familiar with all its options, in order to get full advantage of it.

With the purpose of showing how to use StringToWordVector in your code, I have created a simple class named IndexTest.java, stored in my GitHub repository. Apart from the relatively simple methods for loading and storing Attribute-Relation File Format (ARFF) files, the core of the class is the method void index(), which creates and employs a StringToWordVector object. The first piece of the code is the following one:

// Set the tokenizer
NGramTokenizer tokenizer = new NGramTokenizer();
tokenizer.setNGramMinSize(1);
tokenizer.setNGramMaxSize(1);
tokenizer.setDelimiters("\\W");

This snippet creates and configures a tokenizer, that is the object responsible for breaking the original text into individual strings named tokens, representing the indexing units (typically words). In this case I am using a weka.core.tokenizers.NGramTokenizer, which I find more useful than the usual weka.core.tokenizers.WordTokenizer, as I describe in the post about sentiment analysis with WEKA. This tokenizer is able to recognize n-grams, that is, sequences of tokens. Here I use the methods void setNGramMaxSize(int value) and void setNGramMinSize(int value) to define the size of the n-grams as unigrams.

Another interesting aspect of the tokenizer part is that we setup the regular expression "\\W" as delimiters or separators. This regex defines that any character not being alphanumeric is considered a delimiter. As a result, only alphanumeric character strings will be considered tokens. For a detailed reference on regular expression in Java, check the lesson on the topic in the Java Tutorial.

The second code snippet is the following one:

// Set the filter
StringToWordVector filter = new StringToWordVector();
filter.setInputFormat(inputInstances);
filter.setTokenizer(tokenizer);
filter.setWordsToKeep(1000000);
filter.setDoNotOperateOnPerClassBasis(true);
filter.setLowerCaseTokens(true);

This second snippet creates and configures the StringToWordVector object, which is a subclass of the weka.filters.Filter class. Any filter has to make reference to a dataset, which is the inputInstances dataset in this case, as done with the filter.setInputFormat(inputInstances) call.

We setup the tokenizer and some other options as an example. Both DoNotOperateOnPerClassBasis and WordsToKeep should be standard in most of text classifiers. The first one tells the filter to extract the tokens from all classes as a whole, instead of doing it class per class (default option). I simply fail to understand why one should want to get different indexing tokens per class in a text classification problem. The second option sets the number of words to keep, and I recommend to define a big integer here in order to cover all possible tokens.

The third and last code snippet shows the invocation of the filter on the inputInstances reference:

// Filter the input instances into the output ones
outputInstances = Filter.useFilter(inputInstances,filter);

This is the standard method for applying a filter, according to the "Use WEKA in your Java code". The output of calling this class on a simple dataset as smsspam.small.arff is the next one:

$> javac IndexTest.java
$>java IndexTest
Usage: java IndexTest <fileInput> <fileOutput>
$>java IndexTest smsspam.small.arff result.arff
===== Loaded dataset: smsspam.small.arff =====
Started indexing at: 1371939800703
===== Filtering dataset done =====
Finished indexing at: 1371939800812
Total indexing time: 109
===== Saved dataset: result.arff =====
$>more result.arff
@relation 'sms_test-weka.filters.unsupervised.attribute.StringToWordVector-R2-W1000000-prune-rate-1.0-N0-L-stemmerweka.core.stemmers.NullStemmer-M1-O-
tokenizerweka.core.tokenizers.NGramTokenizer -delimiters "\\W" -max 1 -min 1'

@attribute spamclass {spam,ham}
@attribute 000 numeric
@attribute 03 numeric
@attribute 07046744435 numeric
@attribute 07732584351 numeric
../..

As a note, the name of the relation in the generated ARFF file (tag @relation) encodes the properties of the applied filter, including some default options I have not configured in it.

So that is all. More examples on this topics coming in the next weeks. And as always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

19.6.13

Comparing baselines of keyword and learning based sentiment analysis

In my previous post, I have presented a simple example of using WEKA for Sentiment Analysis (or Opinion Mining). As most of my blog posts on text mining with WEKA, I approach interesting, hot or easy tasks as a way to present this package capabilities for text mining -- in consequence, these posts are tutorial in essence.

In that particular post, I left several open tasks for anybody who may be interested on completing them, and I picked two for myself. One of the tasks left for the reader was coding a class and training a model to actually classify texts according to sentiment -- and as I have been requested the code, I did it by myself and it is available at my GitHub repository.

Another task I left pending, and picked for myself, was applying a keyword-based approach using SentiWordNet to the same (SFU Review Corpus) collection and comparing its accuracy to the learning (WEKA) approach. So this is the topic of this post.

Goal

The goal of this post is to build a simple keyword-based sentiment analysis program based on SentiWordNet and evaluate it on the SFU Review Corpus, in order to compare its accuracy with the one obtained via (WEKA) learning as described in my previous post "Baseline Sentiment Analysis with WEKA".

About SentiWordNet

SentiWordNet is a collection of concepts (synonym sets, synsets) from WordNet that have been evaluated from the point of view of their polarity (if they convey a positive or a negative feeling). Some interesting features include:

  • As it is based on WordNet, only English and the four most significant parts of speech (nouns, adjectives, adverbs and verbs) are covered. Multi-word expressions are included, encoded with underscore (e.g. "too_bad", "at_large").
  • Each concept has attached polarity scores. For instance:

# POS ID PosScore NegScore SynsetTerms Gloss
a 01125429 0 0.625 bad#1 having undesirable or negative qualities; "a bad report card"; "his sloppy appearance made a bad impression"; "a bad little boy"; "clothes in bad shape"; "a bad cut"; "bad luck"; "the news was very bad"; "the reviews were bad"; "the pay is bad"; "it was a bad light for reading"; "the movie was a bad choice"
a 01052038 0.222 0.778 too_bad#1 regrettable#1 deserving regret; "regrettable remarks"; "it's regrettable that she didn't go to college"; "it's too bad he had no feeling himself for church"

So SentiWordNet is in a tab-separated format, being the first column the Part Of Speech (POS), the second and third ones the polarity scores (between 0 and 1), the next column the synset (synonym set, list of synonyms tagged with their sense -- word#sense_number), and the last one the WordNet gloss (roughly speaking, the definition).

Another interesting feature is that SentiWordNet researchers have provided us with a very basic Java class named SWN3.java to query the database for a pair word/POS. This class loads the database and provides a function that outputs "positive", "strong_positive", "negative", "strong_negative" or "neutral" for a given pair according to the manual scores assigned to the synsets. It is very basic because it does not perform Word Sense Disambiguation nor even POS Tagging, and the labels are heuristically defined (some other definitions are possible). However, we can take advantage of it in order to implement a very basic sentiment classifier, as described below.

In order to make use of the SWN3.java class, you have to:

  1. Download a copy of SentiWordNet.
  2. Rename the file to SentiWordNet_3.0.0.txt and put it in a data folder -- relative to the place you located your SWN3.java file. Alternatively, you can modify this class to use a different path or data file name.
  3. Delete all lines starting with the symbol "#" from the SentiWordNet_3.0.0.txt file. HINT: The header and the last line of the file.

And that's it.

The Algorithm and Its Parameters/Heuristics

I have sketched a very simple algorithm for sentiment classification using the provided by the SWN3.java querying class. Given the output of its function public String extract(String word, String pos), that is "positive" etc., the algorithm consists of:

  1. Tokenizing the target text into alphanumeric strings (eventually, words).
  2. Start a polarity score with 0.
  3. For each token, search for it using the extract function and add +1 (positive), +2 (strong_positive), -1 (negative), or -2 (strong_negative).
  4. Return "yes" if the final polarity score is over 0, and "no" if its below 0.

Let me remind that the class tags used in the SFU Review Corpus are "yes" (positive) and "no" (negative).

That's all. No rocket science here.

However, there are two basic parameters:

  • What to do if you get a neutral score (0)? So we can be positive (Y, return "yes" when the score is greater or equals to 0), or negative (N, return "no" when the score is less or equals to 0).
  • Which is the Part of Speech we can use in the SentiWordNet search? I have crafted to options: (1) Looking (and summing) as all available POS (AllPOS), and (2) looking only as adjectives (ADJ).

So I have coded four methods, named classifyAllPOSY(), classifyAllPOSN(), classifyADJY() and classifyADJN() for the four possible combinations. These functions are available in the SentiWordNetDemo.java class at the GitHub repository. And these are the approaches I test below.

The rationale for the first parameter is that we have a 50% balance between the 400 reviews, so it is not clear which we should prefer. In an imbalanced problem, we could choose the most populated class. An alternative is analyzing SentiWordNet to check if it is positively or negatively biased (that is, with more positive or negative words), or even refine this with an additional corpus (counting words and weighting according the frequencies of positive/negative words).

The rationale for the second parameter is that adjectives tend to be less ambiguous (discarding sarcasm or irony), but it is easy to test with any other POS. Using all of them is incorrect (as every word has only one POS in context) but it is practical, and it will give more extreme scores (assuming that a negative word is so with each of its possible POS).

Results and Analysis

So we are testing four approaches, and I will be using the same metrics as I used in the previous blog on sentiment analysis with WEKA, that are averaged F1 and accuracy (along with the Confusion Matrix itself). The test is performed over the 400 text documents in the dataset, as we do not need training for this algorithm. The following table shows the results I have obtained:

I have added to the table the two best performing configurations for a learning based classifier as presented in the previous blog post. However, the comparison is not 100% fair, as the learning approach has been evaluated by 10 fold Cross Validation -- which involves using the full dataset as test set, but in 10% size batches.

All in all, it seems that the keyword-based (using SentiWordNet) approach is competitive (it beats many learning-based classifiers in my previous experiment), getting its best results using only adjectives and outputting "no" in case of neutral scores. The effectiveness on the "yes" class is better than the SVMs with 1-to-3-grams, in terms of recall. I believe that, with some adjustments, the keyword-based approach can be very competitive in this case, and it has the additional advantage that it does not rely on the quality or amount of training data.

Comparing the parameters, the default "no" is consistently better than the default "yes". Using all POS is worse than using only adjectives, because even in the case of default "yes" (which is beaten by both ALL cases in terms of accuracy), we get more balanced decisions -- the ALL setup leads to extremely positive scores, and a clear bias to the "yes" class.

Concluding Discussion

As discussed above, I consider this test as a baseline because of the wide number of simple heuristics employed in the algorithm. Actually, there are a number of possible improvements to be done, although some of them are not trivial. I tag them as [easy|hard] according to my experience in text mining. For instance:

  • Recognizing multiword expressions [easy]. This can be done by making simple searches for token n-grams in the SentiWordNet database, just modifying the SentiWordNetDemo class.
  • Using a validation dataset to optimize the score threshold [easy]. We have assumed that an overall score of 0 is neutral, and tested to classify it as positive or negative (being the second option better). We have general evidence that the database is positively oriented, so we can set a threshold over 0 (e.g. 10, 20...) for classifying a text as positive, in order to correct this effect. The most simple way of doing this is selecting a 10% of the corpus as a validation set, sorting the decisions according to the score, and defining a threshold that optimizes the accuracy (or F1).
  • Test different scoring models like e.g. modifying the SWN3.java program to output original scores instead of tags [easy] and use those scores for the final polarity score calculation. Alternatively, we can play with different definitions of "strong_positive" etc. in terms of the weights [easy], or to use different scores for assigning polarity labels database [easy]. This can be more difficult to test, but we can use a validation set as in the previous point.
  • Performing POS Tagging by using the majority tag [easy], coding a POS Tagger based on learning [hard], or using an existing off-the-shelf POS Tagger (like e.g. Freeling or CoreNLP) [easy]. After using a POS Tagger, the tags must be normalized or processed in order to retain the basic POS, as most of POS Taggers make use of sophisticated tag sets that represent morphology and so on. Obviously, the algorithm should be changed to perform only the search for the appropriate POS tag.
  • Performing Word Sense Disambiguation by using the first sense [easy], coding a WSD system based on learning using a dataset like Semcor [hard], coding a WSD system based on dictionaries -- e.g. using the WordNet glosses in the database itself [easy], or using a an existing off-the-shelf WSD system like e.g. SenseLearner [easy]. You may need to perform data transformations in the case of using different database versions for WSD and sentiment analysis, and in terms of format.

In a more exploratory work, I suggest to:

I am not sure if I will be making any other tests with the keyword-based approach to sentiment analysis, as I want to keep my focus on WEKA features for text mining.

Anyway, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

11.6.13

Baseline Sentiment Analysis with WEKA

Sentiment Analysis (and/or Opinion Mining) is one of the hottest topics in Natural Language Processing nowadays. The task, defined in a simplistic way, consists of determining the polarity of a text utterance according to the opinion or sentiment of the speaker or writer, as positive or negative. This task has multiple applications, including e.g. Customer Relationship Management or predicting political elections.

While initial results dating back to early 2000 seem very promising, it is not such a simple task. We face from the informal Twitter language to the fact that opinions can be faceted (for instance, I may like the software but not the hardware of a device), or opinion spam and fake reviews, along with traditional and complex problems in Natural Language Processing as irony, sarcasm or negation. For a good overview of the task, please check the survey paper on opinion mining and sentiment analysis by Bo Pang and Lillian Lee. A more practical overview is the Sentiment Tutorial with LingPîpe by Alias-i.

In general, there are two main approaches to this task:

  • Counting and/or weighting sentiment-related words that have been evaluated and tagged by experts, conforming a lexical collection like SentiWordNet.
  • Learning a text classifier on a previously labelled text collection, like e.g. the SFU Review Corpus.

The SentiWordNet home page offers a simple Java program that follows the first approach. I will follow the second one in order to show how to use an essential WEKA text mining class (weka.core.converters.TextDirectoryLoader), and to provide another example of the weka.filters.unsupervised.attribute.StringToWordVector class.

I will follow the process outlined in the previous post about Language Identification using WEKA.

Data Collection and Preprocessing

For this demonstration, I will make use of a relatively small but interesting dataset named the SFU Review Corpus. This corpus consists of 400 reviews in English extracted from the Epinions website in 2004 divided in 25 positive and 25 negative reviews for each of 8 product categories (Books, Cars, Computers, etc.). It also contains 400 reviews in Spanish extracted from Ciao.es divided in the same categories (except for the Cookware category in English, which --more or less-- maps to Lavadoras --Washing Machines-- in Spanish).

The original format of the collections is one directory per category of products, including 25 positive reviews including the word "yes" in the file name and 25 negative reviews including the word "no" in the file name. Unfortunately, this format does not allow to work directly with it in WEKA, but a couple of handy scripts transform it into a new format: two directories, one including the positive reviews (directory yes), and the other one including the negative reviews (directory no). I have kept the category in the name of the files (with patterns like bookyes1.txt) in order to allow others making a more detailed analysis per category.

Comparing the structure of the original and the new format of the text collections:

In order to construct an ARFF file from this structure, we can use the weka.core.converters.TextDirectoryLoader class, which is an evolution of a previously existing helper class named TextDirectoryToArff.java and available at WEKA Documentation at wikispaces. Using this class is as simple as issuing the next command:

$> java weka.core.converters.TextDirectoryLoader -dir SFU_Review_Corpus_WEKA > SFU_Review_Corpus.arff

You have to call this command at the parent directory of SFU_Review_Corpus_WEKA, and the parameter -dir sets up the input directory. This class expects to have a single directory containing a directory per class value (yes and no in our case), which in turn should contain a number of files pertaining to the corresponding classes. As the output of this command goes to the standard output, I have to redirect it to a file.

I have left the output of the execution of this command for both the English (SFU_Review_Corpus.arff) and the Spanish (SFU_Spanish_Review.arff) collections at the OpinionMining folder of my GitHub repository.

Data Analysis

Previous models in my blog posts have been based on a relatively simple representation of texts as sequences of words. However, a trivial analysis of the problem easily drives us to think that multi-word expressions (e.g. "very bad" vs. "bad", or "a must" vs. "I must") can lead to better predictors of user sentiment or opinion about an item. Because of this, we will compare word n-grams vs. single words (or unigrams). As an basic set up, I propose to compare word unigrams, 3-grams, and 1-to-3-grams. The latter representation will include uni- to 3-grams with the hope of getting the best of all of them.

Keeping in ming that capitalization may matter in this problem ("BAD" is worse than "bad"), and that we can use standard punctuation (for each of the languages) as texts are long comments (several paragraphs each), I derive the following calls to the weka.filters.unsupervised.attribute.StringToWordVector class:

$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"\\\\W\" -min 1 -max 1" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.uni.arff
$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"\\\\W\" -min 3 -max 3" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.tri.arff
$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"\\\\W\" -min 1 -max 3" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.unitri.arff

We follow the notation vector.uni to denote that the dataset is vectorized and that we are using word unigrams, and so on. The calls for the Spanish collection are similar to these ones.

The most important thing in these calls is that we are no longer using the weka.core.tokenizers.WordTokenizer class. Instead, we are using weka.core.tokenizers.NGramTokenizer, which uses the options -min and -max to set the minimum and maximum size of the n-grams. But the most important thing is that there is a major difference between both classes, regarding the usage of delimiters:

  • The weka.core.tokenizers.WordTokenizer class uses the deprecated Java class java.util.StringTokenizer , even in the latest versions of the WEKA package (as of the day of this writing). In StringTokenizer, the delimiters are the characters used as "spaces" to tokenize the input string: white space, punctuation marks, etc. So you have to explicitly define which will be the "spaces" in your text.
  • The weka.core.tokenizers.NGramTokenizer class uses the recommended Java String method String[] split(String regex) , in which the argument (and thus the delimiters string) is a Regular Expression (regex) in Java. The text is splitted into tokens separated by substrings that match the regex, so you can use all the power of regexes including e.g. special codes for characters. In this case I am using the code \W which denotes any non-word character, in order to get only alpha-numeric character sequences.

After splitting the text into word n-grams (or more properly, after representing the texts as term-weight vectors in our Vector Space Model), we may want to examine which n-grams are most predictive. As in the Language Identification post, we make use of the weka.filters.supervised.attribute.AttributeSelection class:

$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.uni.arff -o SFU_Review_Corpus.vector.uni.ig0.arff
$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.tri.arff -o SFU_Review_Corpus.vector.tri.ig0.arff
$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.unitri.arff -o SFU_Review_Corpus.vector.unitri.ig0.arff

After the selection of the most predictive n-grams, we get the following statistics in the test collections:

The percentages in rows 3-6-9 measure the agressivity of feature selection. Overall, both collections have comparable statistics (in the same order of magnitude). Original unigrams are quite similar, but bigrams and trigrams are less in Spanish (despite the fact that there are more isolated words -- unigrams). Selecting n-grams with Information Gain is a bit more aggressive in Spanish for unigrams and possible bigrams, but less in trigrams.

Adding bigrams and trigrams to the representation substantially increases the number of predictive features (from 4 to 5 times). However, only trigrams result in a little increment of features, so bigrams will play a role here. The number of features is quite handy, and allows us to make quick experiments.

According to my previous post on setting up experiments with WEKA text classifiers and how to chain filters and classifiers, you must note that these are not the final features if we configure a cross-validation experiment -- we have to chain the filters (StringToWordVector and AttributeSelection) and the classifier in order to perform a valid experiment, as the features for each folder should be different.

Experiments and Results

In order to simplify the example, and expecting to get good results, we will use the same algorithms we used in the Language Identification problem. These are: Naive Bayes (NB, weka.classifiers.bayes.NaiveBayes), PART (weka.classifiers.rules.PART), J48 (weka.classifiers.trees.J48), k-Nearest Neighbors (weka.classifiers.lazy.IBk) with k = 1,3,5, and Support Vector Machines (weka.classifiers.functions.SMO); all of them with the default options, except for kNN which uses 1, 3 and 5 neighbors. I am testing the three proposed representations (based on unigrams, trigrams and 1-3grams) by 10-fold cross-validation. An example experiment command line is the following one:

$> java weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer \\\"weka.core.tokenizers.NGramTokenizer -delimiters \\\\\\\"\\\\\\\W\\\\\\\" -min 1 -max 1\\\" -W 10000000\" -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.bayes.NaiveBayes -v -i -t SFU_Review_Corpus.arff > tests/uniNB.txt

You can change the size of n-grams with the -min and -max parameters. Also, you can change the learning algorithm with the most external -W option. I am storing the results in a tests folder, in files with the convention <rep><alg>.txt. The results of this test for the English language collection are the following ones:

Considering the class yes (positive sentiment) as the positive class, in each column we show the True Positives (hits on the yes class), False Positives (members of the no class mistakenly classified as yes), False Negatives (members of the yes class mistakenly classified as no) and True Negatives (hits on the no class); along with the macro-averaged F1 (standard average F1 over both classes) and the general accuracy.

Additionally, the results for the Spanish language collection are the following ones:

So these are the results. Let us start the analysis...

Results Analysis

We can perform an analysis regarding different aspects:

  • Which is the overall performance?
  • Which is the performance when comparing different languages?
  • Which are the best learning algorithms?
  • Which effect do have different text representations in the classifier performance?

All in all, and taking into account that class balance is 50% (thus a trivial acceptor or a trivial rejector, or a random classifier accuracy would be 50%), most of the classifiers beat this baseline but not by a wide margin, and even the best one among all algorithms, languages and representations (SVMs on English 1-to-3-grams) reaches only a modest 71% -- far from a satisfying 90% or over. Let me remind we are facing a relatively simple problem -- long, few texts, and a binary classification. Most approaches in the literature get much better results in similar setups.

Results are better for English than for Spanish, comparing one on one. I will check the representations used in Spanish, for instance listing the first 20 n-grams for each representation, in order to explain it:

Some of the n-grams (highlighted in italics) are just incorrect, because of the incorrect recognition of accents due to the inappropriate pattern I have used in the tokenization step. The tokenizer makes use of the string "\W" in order to recognize alphanumeric string -- which in Java do not include vowels with accents ("á", "é", "í", "ó", "ú") and other language-specific symbols (e.g. "ñ"). Most of the n-grams are just not opinionated words or n-grams; instead, they are either intensifiers (like e.g. "muy" -- "very") or just contingent (dependent on the training collection, e.g. "en el taller" -- "in the garage"; "tarjeta de memoria" -- "storage card"). Those clearly opinionated words are highlighted in boldface. Very few. So for this issue, we can conclude that the training collection is too small.

If we examine the performance of different classifiers, we can cluster them in three groups: top performers (SVMs, NB), medium performers (PART, J48) and losers for this problem (kNN). These groups are intuitive:

  • Both SVMs and NB have often demonstrated their high performance in sparse datasets, and in text classification problems in particular. They both build a linear classifier with weights (or probabilities) for each of the features. Linear classifiers perform well here given that the dataset is built on representations that clearly promote over-fitting the dataset, as we have seen that many of the most predictive n-grams are collection-dependent.
  • Both PART and J48 (C4.5) are based on reducing error by progressively partitioning the dataset according to tests on the most predictive features. But the predictive features we have for such a small collection are not very good, indeed.
  • All versions of kNN perform very bad, most likely because the dataset is sparse and relatively small.

However, we have to keep in mind that we have used the algorithms with their default configurations. For instance, kNN allows to use the cosine similarity instead of the Euclidean distance -- being the cosine similarity much better for text classification problems, as demonstrated many times during 50 years of research in Information Retrieval.

And regarding dataset representations, the behavior is not uniform -- we do not systematically get better results with one representation in comparison with the others. In general, 1-to-3-grams perform better than the other representations in English, while unigrams are best in Spanish, and trigrams is most often the worst representation for both languages. If we focus on top performing classifiers (NB and SVMs), this latter comment is always true. In consequence, trigrams have --to some extent-- demonstrated their power in English (as a complement to uni- and bigrams), but not in Spanish (but knowing that the representation is incorrect because of character encoding).

Concluding Remarks

So all in all, we have a baseline learning-based method for Sentiment Analysis in English (and probably in Spanish, after correcting the representation), which is -- not surprisingly -- based on 1-to-3-grams and Support Vector Machines. And it is a baseline because its performance is relatively poor (with an accuracy of 71%), and we have not taken full advantage of the configuration, text representation and other parameters yet.

After this long (again!) post, I propose the next steps -- some of them left for the reader as an exercise:

  • Build a Java class that classifies text files according their sentiment, for English at least, taking my previous post on Language Identification as an example -- left for the reader.
  • Test other algorithms, and in particular: play with SVM configuration, and add Boosting (using weka.classifiers.meta.AdaBoostM1) to Naive Bayes -- left for the realer.
  • Check differences of accuracy in terms of product type -- cars, movies, etc. -- left for the reader.
  • Improve the Spanish language representation using the appropriate regex in the tokenizer to cover Spanish letters and accents -- I will take this one myself.
  • Check the accuracy of the basic keyword-based algorithm available in the SentiWordNet page -- I will take this one as well.

So that is all for the moment. You can expect one or more posts from me on this hot topic. Finally, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!