27.8.09

NY Times: Opinion Mining in Social Networks, Twitter

I have read an excellent news story at the NY Times: Mining the Web for Feelings, Not Facts (via KDDNuggets), on what Opinion Mining/Sentiment Analysis is, its application to Social Networks and its increasing expexted impact on Web search. Technically is a very good dissemination article, the time it takes to be read is worth even if you have strong background on the topic.

It is not surprising that Twitter is considered a primary source of opinions. When I started a research project on things related to this some time ago, I quicky discovered that there is plenty of test collections in English, but much less in Spanish (this is a topic for other posts). I considered building a test collection in Spanish with Twitter search, surprisingly easy: type your product, brand, etc., collect tweets and evaluate them (a task for crowdsourcing, though). However, there is the problem of language (however, language identification is an easy Natural Language ProcessingTask, letter trigrams can give you over 99% accuracy on Western languages).

Si I have not become surprised with the applications that already do that (read in the article), search engines for Twitter with tweet polarity analysis. I have searched "Twitter" in them, with the following results ("+" means positive, "=" means neutral, "-" means negative) and screenshots:

Tweetfeel: + 65%, - 35% (over 350 tweets).

Twendz: + 34%, = 49%, - 17% (over an unknown number of tweets).

Twitrratr: + 11,50%, = 85,55%, - 2,95% (over 35561 tweets).

As the variety of tweets (however searched in the very same moment) is big, techniques are different, criteria also (e.g. no neutral at tweetfeel), it is not strange that results are quite different. There are some hints in the article..., but I recomend to follow the links in my previous post about the Opinion Mining tutorial by Bing Liu at WWWC 2008 for getting better informed about the techniques.

5 comentarios:

Mariana Soffer dijo...

Nice post, several issues regarding it.
1. I know that with tree grams you can get 99 percent accuracy regarding language identification nevertheless if you do a twitter search, with the twitter api that allows you to do this, and you specify the language in which you are searching, arround 15 percents of the tweets are in the wrong language, I tried it a lot in spanish and I get lots of tweets in portugueese and in italian as well.
2. I am suprised that is twitter the place that is used for doing opinion mining mainly, because actually as you can see, the results of the the polarity of the opinion are often wrong, I would said that they have a very low accuracy, I did a program my self, very simple indeed and my accuracy was better than those of the 3 new websites that deal with this subject. The problem anyway with analyzing the tweets is that you have very little context, and the meaning of the text can change highly regarding to it, so it is almost imposible to evaluate something if you do not know what field you are dealing with, for example you can have expressions such as the famous paper where it says the beer is hot, which indeed is a critique and the wine is old which is a compliement for it, but for most of the other products you cualify with those words, the result must be the oposit because being old is generally considered a bad thing.
If you want to check I give a brief introduction to this in my last post, is really basic indeed, but for those who do not know anything about the subject might be usefull indeed.
http://singyourownlullaby.blogspot.com/2009/08/opinion-mining-and-sentiment-analysis.html

Cheers nihil
pd: are you into spanish op mi yourself?

Jose Maria Gomez Hidalgo dijo...

Hi Mariana (again)

Regarding concern #1, I see it, and what I suggest is that one should perform language identification by themselves, instead of relying even on the Twitter language feature help. That is, Twitter probably bases language identification on the language of the user as stated on preferences and so. I propose to build your own trigram model and ignore Twitter language features.

Regarding #2, that the fact. How many apps have you seen that publicly make opinion mining in other networks? People do work on Twitter because it is open, however it is very hard because of the lack of context. In that sense, you are abosolutely right. For more clear opinions, I will be trying to go through facebook to brand sites/users (e.g. Burger King) and collect opinions about posts releasing new services, products and so.

Finally, yes I am working on Opinion Mining for a work-related project, I can release details about it but you could guess the applications from a previous post to this blog.

Jose Maria Gomez Hidalgo dijo...

Besides, I read your post and it looks nice :-)

Mariana Soffer dijo...

Hi Jose, Thanks a lot for your responses regarding my 2 comments on opinion mining new software.

1.I already did what you told me ablut the ngrams to recognize the language and it works pefectly well, but I was suprised it did not work properly on the online aplications.

2.You are completelly right about the reason why they do use twitter mostly, well I implemented opinion minning on blogs also, and started to analize discussion lists to do so as well. I do not find blogs to be very complex if you have a certain expertice with search engines, becausewe had to implement a crawler and a blog ranker (EigenRumors)for this which is not easy at all to do, and also to extract the relevant textual content from the pages, before we could analize the text.

Great advice the one about searching information in facebook, It did not occur to me doing it that way.

I could infer from your previous post that you are working on the FlaxSentiment module of Flax software. It seems to be extremelly interesting but there is no informatin online about it.
Do you mind if we chat a little bit about strategies for doing this? And by any chance you are not doing it in Spanish language are you? because I am, but not only on it.

Also Thank you very much for the compliment Jose, you are very kind.

Jose Maria Gomez Hidalgo dijo...

Certainly, easiest way to collect "good" opinions (with context, properly categorized as good, bad, neutral, maybe scaled) is not Twitter but Internet forums, review sites like ciao.es, and of course, blogs like xataka.com. facebook and real Social networks are challenching because the API usually prevents mass data collection, this why I suggest going through brand/product "users"/"groups", that what I am planning after a trial through twitter as a proof of concept, because SNs are the focus of my work for a Spanish OM project which is not for Flax.

We shared some time ago, using Freeling, SentimentWordnet, and so: they are still on my "pending" list, as first I will be testing the classical bag of words and putting it into an architecture for my application; then I can go for harder things. Besides, my timeframe for the project is by the end of the year, I have to hurry up! :-)