Wandering on the opinion mining and sentiment analysis topic, I have discovered that Bing Liu, a referent on this topic, has posted his tutorial at the World Wide Web Conference 2008 at his home page. While the survey by Bo Pang and Lillian Lee for Foundations and Trends in Information Retrieval is a must, this stuff is (obviously) easier to read. Mi opinion :-D is first to read the tutorial slides, then the survey.
A good point with Bing Liu's stuff is that he has posted data at his Opinion Mining, Sentiment Analysis, and Opinion Spam Detection project. The data consists of (short) customer reviews of several products like cameras, routers, cell phones and so. This is an extract from the Linksys Router opinions:
router[+2]##This router does everything that it is supposed to do, so i dont really know how to talk that bad about it.
setup[+2], installation[+2] ##It was a very quick setup and installation, in fact the disc that it comes with pretty much makes sure you cant mess it up.
install[+3]##By no means do you have to be a tech junkie to be able to install it, just be able to put a CD in the computer and it tells you what to do.
Besides, Bing has worked on opinion spam!
As soon as I find the time, I will try to write a tutorial on how to train and test a baseline opinion classifier with WEKA.
10 comentarios:
Hi, my name is mariana, I am from argentina, and now I am working in opinion mining. I have a pretty good background in nlp and machine learning, and I also do my own algorithms, mostly in phython. I was intrested in sharing information with you (hope you are too), what I need now is some advice about which strategy adtopt (using an already tagged ontology or group of synsets, or use unsupervised learning). Bear in mind I need to evaluate the opinions from people that live in all existing spanish speaking country.
Do you know any opinion mining tagged corpus for spanish language?
Thank you very much for your time
Best regards
MS
Hi, Mariana
On the side of the corpus, Fermín Cruz (Universidad de Sevilla) has made public a corpus of film reviews in Spanish, available at his home page: http://www.lsi.us.es/~fermin/index.php/Main_Page. He has used it in several papers.
Regarding the techniques, I haven worked on the topic yet. My suggestion is to start with a simple statistical analysis as a baseline (you know: using word stems and no other lexical resources, binary/TF*IDF weigths on a Vector Space Model - bag of words, and simple learning algorithms: Bayes, SVM, C4.5, etc. -- all this can be provided by WEKA). Then I would identify the har examples (those that are mistakenly clasified, or in which we have a very small margin and then consider using specialiced resources: a polarity lexicon (words with polarity tags assigned: pretty is good, hot is good, damn is bad, etc.), some parsing (Freling can provide that), prehaps the Spanish Wordnet (you know, pretty small), etc.
Than you very much, it is really kind from you. I am impressed, cause all that you mention is what I am already working with. I downloaded the movie corpus, with the 5 files per movie. I installed the freeling, although with the svm, not the bayes, but I guess I should start with that. I did the bag of word that uses tf idf, to cluster them. I use weka, and the python nltk before. And about polarity I was examining the sentiwordned ,I guess is called, the collored triangules with 3 values for the synsets.
But my problem is that I guess I need to train the spanish corpus to add the polarity to de words, if so what method should I use for that. that part I am kind of confuses with. I have lots of doubts. Please let me know if I can help you in any way. I will write to you soon, if you do not mind anyway.
Mariana
To be honest, I have never used sentiwordnet, maybe it is worth a try. About polarity, what I suggest is to detect the polarity of words by learning. I would apply freeling, feed weka with syntactic groups (NPs, prepositional phrases, and so on), get the phrases with more Information Gain, and give a look at them.
I do not know if it is possible to get polarity from an "Spanish" sentiwordnet, and I do not know if there is any polarity lexicon for Spanish out there. May I suggest to give a look to what Fermin has written?
Besides, I have taken a look at your Linkedin profile, blog, and so, pretty brilliant (I must say!).
Do try sentiWordNet, it will take you only 1 minute. Cllick the following link:http://sentiwordnet.isti.cnr.it/browse/sentiwn.pl?word=nice&show=position
There you have an online opinion analizer, just so you get an Idea about how it works.
Thanks for your adivce about talking to fermine, I would definitely talk to him soon. I have been doing some research about spanish opinion mining, If you are intrested I can send you the papers I found related to that.
Thank you very much for your compliment's, I would love to stay in touch with you.
I have had no time yet, but I will give it a try!
Feel free to contact me!
Thanks for the reply, do not worry forthe time, I am swamped to, now I need to implement a crawler for all latin america, I am going with hounder, based on nutch, do you know who or where can I get some info about it?
Thank you very much for you time.
M
To be honest, I have not used crawlers apart from those I developed for some of my oldest projects :-(
I believe that Ricardo Baeza-Yates and Nivio Ziviane have things that are specialized for the Latin America area. Ricardo is hard to contact, as he is the VP of Yahoo! Research for Latin America, Europe and Israel! Nivio may be easier to contact...
Short comment, I have been all Friday with the guys from flaptor, the ones that made hounder, they are amazing, I learned a lot, They are into the task of crawling the whole Latin american web. If you ever need something regarding that I strongly recomend them, check their website: and their products, and their experimental products, They are the ones that made the fir blog searcher, the one for wordpress.
Just letting you know
Take care
M
I have wandered a while through their web page, looks very nice! Plenty of challenging demos!
I have tested the tag recommender (you know, with a history of text categorization, that was the test to do) with the text of this very post, and these are the tags: install, tutorial, opensuse, gutsy gibbon, setup, citizen journalism, survey, opinion, methodology, web analytics. They are all given bewteen 5 and 8 score, and of course, Autotagger has not been trained on my own tags! Nice anyway.
Very good recommendation, thanks!!!
Publicar un comentario