19.6.13

Comparing baselines of keyword and learning based sentiment analysis

In my previous post, I have presented a simple example of using WEKA for Sentiment Analysis (or Opinion Mining). As most of my blog posts on text mining with WEKA, I approach interesting, hot or easy tasks as a way to present this package capabilities for text mining -- in consequence, these posts are tutorial in essence.

In that particular post, I left several open tasks for anybody who may be interested on completing them, and I picked two for myself. One of the tasks left for the reader was coding a class and training a model to actually classify texts according to sentiment -- and as I have been requested the code, I did it by myself and it is available at my GitHub repository.

Another task I left pending, and picked for myself, was applying a keyword-based approach using SentiWordNet to the same (SFU Review Corpus) collection and comparing its accuracy to the learning (WEKA) approach. So this is the topic of this post.

Goal

The goal of this post is to build a simple keyword-based sentiment analysis program based on SentiWordNet and evaluate it on the SFU Review Corpus, in order to compare its accuracy with the one obtained via (WEKA) learning as described in my previous post "Baseline Sentiment Analysis with WEKA".

About SentiWordNet

SentiWordNet is a collection of concepts (synonym sets, synsets) from WordNet that have been evaluated from the point of view of their polarity (if they convey a positive or a negative feeling). Some interesting features include:

  • As it is based on WordNet, only English and the four most significant parts of speech (nouns, adjectives, adverbs and verbs) are covered. Multi-word expressions are included, encoded with underscore (e.g. "too_bad", "at_large").
  • Each concept has attached polarity scores. For instance:

# POS ID PosScore NegScore SynsetTerms Gloss
a 01125429 0 0.625 bad#1 having undesirable or negative qualities; "a bad report card"; "his sloppy appearance made a bad impression"; "a bad little boy"; "clothes in bad shape"; "a bad cut"; "bad luck"; "the news was very bad"; "the reviews were bad"; "the pay is bad"; "it was a bad light for reading"; "the movie was a bad choice"
a 01052038 0.222 0.778 too_bad#1 regrettable#1 deserving regret; "regrettable remarks"; "it's regrettable that she didn't go to college"; "it's too bad he had no feeling himself for church"

So SentiWordNet is in a tab-separated format, being the first column the Part Of Speech (POS), the second and third ones the polarity scores (between 0 and 1), the next column the synset (synonym set, list of synonyms tagged with their sense -- word#sense_number), and the last one the WordNet gloss (roughly speaking, the definition).

Another interesting feature is that SentiWordNet researchers have provided us with a very basic Java class named SWN3.java to query the database for a pair word/POS. This class loads the database and provides a function that outputs "positive", "strong_positive", "negative", "strong_negative" or "neutral" for a given pair according to the manual scores assigned to the synsets. It is very basic because it does not perform Word Sense Disambiguation nor even POS Tagging, and the labels are heuristically defined (some other definitions are possible). However, we can take advantage of it in order to implement a very basic sentiment classifier, as described below.

In order to make use of the SWN3.java class, you have to:

  1. Download a copy of SentiWordNet.
  2. Rename the file to SentiWordNet_3.0.0.txt and put it in a data folder -- relative to the place you located your SWN3.java file. Alternatively, you can modify this class to use a different path or data file name.
  3. Delete all lines starting with the symbol "#" from the SentiWordNet_3.0.0.txt file. HINT: The header and the last line of the file.

And that's it.

The Algorithm and Its Parameters/Heuristics

I have sketched a very simple algorithm for sentiment classification using the provided by the SWN3.java querying class. Given the output of its function public String extract(String word, String pos), that is "positive" etc., the algorithm consists of:

  1. Tokenizing the target text into alphanumeric strings (eventually, words).
  2. Start a polarity score with 0.
  3. For each token, search for it using the extract function and add +1 (positive), +2 (strong_positive), -1 (negative), or -2 (strong_negative).
  4. Return "yes" if the final polarity score is over 0, and "no" if its below 0.

Let me remind that the class tags used in the SFU Review Corpus are "yes" (positive) and "no" (negative).

That's all. No rocket science here.

However, there are two basic parameters:

  • What to do if you get a neutral score (0)? So we can be positive (Y, return "yes" when the score is greater or equals to 0), or negative (N, return "no" when the score is less or equals to 0).
  • Which is the Part of Speech we can use in the SentiWordNet search? I have crafted to options: (1) Looking (and summing) as all available POS (AllPOS), and (2) looking only as adjectives (ADJ).

So I have coded four methods, named classifyAllPOSY(), classifyAllPOSN(), classifyADJY() and classifyADJN() for the four possible combinations. These functions are available in the SentiWordNetDemo.java class at the GitHub repository. And these are the approaches I test below.

The rationale for the first parameter is that we have a 50% balance between the 400 reviews, so it is not clear which we should prefer. In an imbalanced problem, we could choose the most populated class. An alternative is analyzing SentiWordNet to check if it is positively or negatively biased (that is, with more positive or negative words), or even refine this with an additional corpus (counting words and weighting according the frequencies of positive/negative words).

The rationale for the second parameter is that adjectives tend to be less ambiguous (discarding sarcasm or irony), but it is easy to test with any other POS. Using all of them is incorrect (as every word has only one POS in context) but it is practical, and it will give more extreme scores (assuming that a negative word is so with each of its possible POS).

Results and Analysis

So we are testing four approaches, and I will be using the same metrics as I used in the previous blog on sentiment analysis with WEKA, that are averaged F1 and accuracy (along with the Confusion Matrix itself). The test is performed over the 400 text documents in the dataset, as we do not need training for this algorithm. The following table shows the results I have obtained:

I have added to the table the two best performing configurations for a learning based classifier as presented in the previous blog post. However, the comparison is not 100% fair, as the learning approach has been evaluated by 10 fold Cross Validation -- which involves using the full dataset as test set, but in 10% size batches.

All in all, it seems that the keyword-based (using SentiWordNet) approach is competitive (it beats many learning-based classifiers in my previous experiment), getting its best results using only adjectives and outputting "no" in case of neutral scores. The effectiveness on the "yes" class is better than the SVMs with 1-to-3-grams, in terms of recall. I believe that, with some adjustments, the keyword-based approach can be very competitive in this case, and it has the additional advantage that it does not rely on the quality or amount of training data.

Comparing the parameters, the default "no" is consistently better than the default "yes". Using all POS is worse than using only adjectives, because even in the case of default "yes" (which is beaten by both ALL cases in terms of accuracy), we get more balanced decisions -- the ALL setup leads to extremely positive scores, and a clear bias to the "yes" class.

Concluding Discussion

As discussed above, I consider this test as a baseline because of the wide number of simple heuristics employed in the algorithm. Actually, there are a number of possible improvements to be done, although some of them are not trivial. I tag them as [easy|hard] according to my experience in text mining. For instance:

  • Recognizing multiword expressions [easy]. This can be done by making simple searches for token n-grams in the SentiWordNet database, just modifying the SentiWordNetDemo class.
  • Using a validation dataset to optimize the score threshold [easy]. We have assumed that an overall score of 0 is neutral, and tested to classify it as positive or negative (being the second option better). We have general evidence that the database is positively oriented, so we can set a threshold over 0 (e.g. 10, 20...) for classifying a text as positive, in order to correct this effect. The most simple way of doing this is selecting a 10% of the corpus as a validation set, sorting the decisions according to the score, and defining a threshold that optimizes the accuracy (or F1).
  • Test different scoring models like e.g. modifying the SWN3.java program to output original scores instead of tags [easy] and use those scores for the final polarity score calculation. Alternatively, we can play with different definitions of "strong_positive" etc. in terms of the weights [easy], or to use different scores for assigning polarity labels database [easy]. This can be more difficult to test, but we can use a validation set as in the previous point.
  • Performing POS Tagging by using the majority tag [easy], coding a POS Tagger based on learning [hard], or using an existing off-the-shelf POS Tagger (like e.g. Freeling or CoreNLP) [easy]. After using a POS Tagger, the tags must be normalized or processed in order to retain the basic POS, as most of POS Taggers make use of sophisticated tag sets that represent morphology and so on. Obviously, the algorithm should be changed to perform only the search for the appropriate POS tag.
  • Performing Word Sense Disambiguation by using the first sense [easy], coding a WSD system based on learning using a dataset like Semcor [hard], coding a WSD system based on dictionaries -- e.g. using the WordNet glosses in the database itself [easy], or using a an existing off-the-shelf WSD system like e.g. SenseLearner [easy]. You may need to perform data transformations in the case of using different database versions for WSD and sentiment analysis, and in terms of format.

In a more exploratory work, I suggest to:

I am not sure if I will be making any other tests with the keyword-based approach to sentiment analysis, as I want to keep my focus on WEKA features for text mining.

Anyway, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

4 comentarios:

Anónimo dijo...

Great post, thanks.

I'm new to all this sentiment analysis stuff, so please excuse the dumb question, but do you know why there is a discontinuity between 0.5 and 0.75 in SWN3? Also, if the score is 0.3 (say), won't it be assigned "positive" and then overriden as "weak_positive"? I appreciate this class is not your code.

Cheers,

Bewildered Bob

Anónimo dijo...

Sorry, with the 0.3 example, I missed the else if due to the formatting, so the assignment won't be overwritten. It still seems odd that weak_positive ranges from > 0 to >= 0.25 (i.e. including 0.6, which I assumed would be either positive or strong_positive, though I admit I don't understand the maths).

Bob

Jose Maria Gomez Hidalgo dijo...

Hi Bob

Thanks a lot for your comment.

I sincerely do not know the rationale behind the scores in SWN. I know that when downloading it, your help to tag synsets is requested, so there should be some statistics in there; however I have not traced them to the papers (where they may be explained, I guess).

Regarding the ranges, I understand yo make reference to the SWN3.java example provided by SWNers. For me, the ranges are clear in the sequence of if-thens. They should be:


[0.75,1.0] => strong_positive
(0.25,0.5] => positive
...


So you are right, there is a discontinuity that is not explained in the code. It seems odd, unless scores in the interval (0.5,0.75) are not reacheable, according to the computation done some lines above the sequence of if-thens:


double score = 0.0;
double sum = 0.0;
for(int i = 0; i < v.size(); i++)
score += ((double)1/(double)(i+1))*v.get(i);
for(int i = 1; i<=v.size(); i++)
sum += (double)1/(double)i;
score /= sum;


Which in terms depends on what is stored in the arrays.

Please let me think about it for some days in order to give an explanation...

Thanks again

Jose Maria

Anónimo dijo...

Thanks Jose Maria.

Good idea to check whether the score is never reachable. I just ran the algorithm against SentiWordNet_3.0.0.txt. Among other examples of odd scores, I found a score of 0.625 is given for the weak_positive word "liveable" and a score of 0.125 is given for the uncategorised word "intuitively".

To me, it's curious that weak_positive words have a higher score than positive scores, but maybe that's a consequence of the equations used. All the weak_positive words I examined didn't seem particularly positive (as the category would suggest), but I guess looking at the words in isolation is invalid, as many of the words I checked had categories that jarred with my expectations - e.g. depravation being weak_positive.

Cheers,

Bob