In my previous post, I have presented a simple example of using WEKA for Sentiment Analysis (or Opinion Mining). As most of my blog posts on text mining with WEKA, I approach interesting, hot or easy tasks as a way to present this package capabilities for text mining -- in consequence, these posts are tutorial in essence.
In that particular post, I left several open tasks for anybody who may be interested on completing them, and I picked two for myself. One of the tasks left for the reader was coding a class and training a model to actually classify texts according to sentiment -- and as I have been requested the code, I did it by myself and it is available at my GitHub repository.
Another task I left pending, and picked for myself, was applying a keyword-based approach using SentiWordNet to the same (SFU Review Corpus) collection and comparing its accuracy to the learning (WEKA) approach. So this is the topic of this post.
The goal of this post is to build a simple keyword-based sentiment analysis program based on SentiWordNet and evaluate it on the SFU Review Corpus, in order to compare its accuracy with the one obtained via (WEKA) learning as described in my previous post "Baseline Sentiment Analysis with WEKA".
SentiWordNet is a collection of concepts (synonym sets, synsets) from WordNet that have been evaluated from the point of view of their polarity (if they convey a positive or a negative feeling). Some interesting features include:
- As it is based on WordNet, only English and the four most significant parts of speech (nouns, adjectives, adverbs and verbs) are covered. Multi-word expressions are included, encoded with underscore (e.g. "too_bad", "at_large").
- Each concept has attached polarity scores. For instance:
# POS ID PosScore NegScore SynsetTerms Gloss
a 01125429 0 0.625 bad#1 having undesirable or negative qualities; "a bad report card"; "his sloppy appearance made a bad impression"; "a bad little boy"; "clothes in bad shape"; "a bad cut"; "bad luck"; "the news was very bad"; "the reviews were bad"; "the pay is bad"; "it was a bad light for reading"; "the movie was a bad choice"
a 01052038 0.222 0.778 too_bad#1 regrettable#1 deserving regret; "regrettable remarks"; "it's regrettable that she didn't go to college"; "it's too bad he had no feeling himself for church"
So SentiWordNet is in a tab-separated format, being the first column the Part Of Speech (POS), the second and third ones the polarity scores (between 0 and 1), the next column the synset (synonym set, list of synonyms tagged with their sense -- word#sense_number), and the last one the WordNet gloss (roughly speaking, the definition).
Another interesting feature is that SentiWordNet researchers have provided us with a very basic Java class named
SWN3.java to query the database for a pair word/POS. This class loads the database and provides a function that outputs "
strong_negative" or "
neutral" for a given pair according to the manual scores assigned to the synsets. It is very basic because it does not perform Word Sense Disambiguation nor even POS Tagging, and the labels are heuristically defined (some other definitions are possible). However, we can take advantage of it in order to implement a very basic sentiment classifier, as described below.
In order to make use of the
SWN3.java class, you have to:
- Download a copy of SentiWordNet.
- Rename the file to
SentiWordNet_3.0.0.txtand put it in a
datafolder -- relative to the place you located your
SWN3.javafile. Alternatively, you can modify this class to use a different path or data file name.
- Delete all lines starting with the symbol "
#" from the
SentiWordNet_3.0.0.txtfile. HINT: The header and the last line of the file.
And that's it.
The Algorithm and Its Parameters/Heuristics
I have sketched a very simple algorithm for sentiment classification using the provided by the
SWN3.java querying class. Given the output of its function
public String extract(String word, String pos), that is "positive" etc., the algorithm consists of:
- Tokenizing the target text into alphanumeric strings (eventually, words).
- Start a polarity score with 0.
- For each token, search for it using the extract function and add +1 (positive), +2 (strong_positive), -1 (negative), or -2 (strong_negative).
- Return "
yes" if the final polarity score is over 0, and "
no" if its below 0.
Let me remind that the class tags used in the SFU Review Corpus are "
yes" (positive) and "
That's all. No rocket science here.
However, there are two basic parameters:
- What to do if you get a neutral score (0)? So we can be positive (
Y, return "
yes" when the score is greater or equals to 0), or negative (
N, return "
no" when the score is less or equals to 0).
- Which is the Part of Speech we can use in the SentiWordNet search? I have crafted to options: (1) Looking (and summing) as all available POS (
AllPOS), and (2) looking only as adjectives (
So I have coded four methods, named
classifyADJN() for the four possible combinations. These functions are available in the
SentiWordNetDemo.java class at the GitHub repository. And these are the approaches I test below.
The rationale for the first parameter is that we have a 50% balance between the 400 reviews, so it is not clear which we should prefer. In an imbalanced problem, we could choose the most populated class. An alternative is analyzing SentiWordNet to check if it is positively or negatively biased (that is, with more positive or negative words), or even refine this with an additional corpus (counting words and weighting according the frequencies of positive/negative words).
The rationale for the second parameter is that adjectives tend to be less ambiguous (discarding sarcasm or irony), but it is easy to test with any other POS. Using all of them is incorrect (as every word has only one POS in context) but it is practical, and it will give more extreme scores (assuming that a negative word is so with each of its possible POS).
Results and Analysis
So we are testing four approaches, and I will be using the same metrics as I used in the previous blog on sentiment analysis with WEKA, that are averaged F1 and accuracy (along with the Confusion Matrix itself). The test is performed over the 400 text documents in the dataset, as we do not need training for this algorithm. The following table shows the results I have obtained:
I have added to the table the two best performing configurations for a learning based classifier as presented in the previous blog post. However, the comparison is not 100% fair, as the learning approach has been evaluated by 10 fold Cross Validation -- which involves using the full dataset as test set, but in 10% size batches.
All in all, it seems that the keyword-based (using SentiWordNet) approach is competitive (it beats many learning-based classifiers in my previous experiment), getting its best results using only adjectives and outputting "
no" in case of neutral scores. The effectiveness on the "
yes" class is better than the SVMs with 1-to-3-grams, in terms of recall. I believe that, with some adjustments, the keyword-based approach can be very competitive in this case, and it has the additional advantage that it does not rely on the quality or amount of training data.
Comparing the parameters, the default "
no" is consistently better than the default "
yes". Using all POS is worse than using only adjectives, because even in the case of default "
yes" (which is beaten by both ALL cases in terms of accuracy), we get more balanced decisions -- the ALL setup leads to extremely positive scores, and a clear bias to the "
As discussed above, I consider this test as a baseline because of the wide number of simple heuristics employed in the algorithm. Actually, there are a number of possible improvements to be done, although some of them are not trivial. I tag them as [easy|hard] according to my experience in text mining. For instance:
- Recognizing multiword expressions [easy]. This can be done by making simple searches for token n-grams in the SentiWordNet database, just modifying the
- Using a validation dataset to optimize the score threshold [easy]. We have assumed that an overall score of 0 is neutral, and tested to classify it as positive or negative (being the second option better). We have general evidence that the database is positively oriented, so we can set a threshold over 0 (e.g. 10, 20...) for classifying a text as positive, in order to correct this effect. The most simple way of doing this is selecting a 10% of the corpus as a validation set, sorting the decisions according to the score, and defining a threshold that optimizes the accuracy (or F1).
- Test different scoring models like e.g. modifying the
SWN3.javaprogram to output original scores instead of tags [easy] and use those scores for the final polarity score calculation. Alternatively, we can play with different definitions of "strong_positive" etc. in terms of the weights [easy], or to use different scores for assigning polarity labels database [easy]. This can be more difficult to test, but we can use a validation set as in the previous point.
- Performing POS Tagging by using the majority tag [easy], coding a POS Tagger based on learning [hard], or using an existing off-the-shelf POS Tagger (like e.g. Freeling or CoreNLP) [easy]. After using a POS Tagger, the tags must be normalized or processed in order to retain the basic POS, as most of POS Taggers make use of sophisticated tag sets that represent morphology and so on. Obviously, the algorithm should be changed to perform only the search for the appropriate POS tag.
- Performing Word Sense Disambiguation by using the first sense [easy], coding a WSD system based on learning using a dataset like Semcor [hard], coding a WSD system based on dictionaries -- e.g. using the WordNet glosses in the database itself [easy], or using a an existing off-the-shelf WSD system like e.g. SenseLearner [easy]. You may need to perform data transformations in the case of using different database versions for WSD and sentiment analysis, and in terms of format.
In a more exploratory work, I suggest to:
- Test the algorithm on other datasets like the classical Movie Review Datasets by Bo Pang and Lilian Lee, or with other semantic lexicons (opinionated word databases) like the Opinion Lexicon by Bing Liu et al. or the Subjectivity Lexicon by Janyce Wiebe et al..
- Perform an exploratory analysis of the distribution of polarities at SentiWordNet and its implications on the basic algorithm.
I am not sure if I will be making any other tests with the keyword-based approach to sentiment analysis, as I want to keep my focus on WEKA features for text mining.
Anyway, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!