tag:blogger.com,1999:blog-365893032024-03-13T18:33:17.571+01:00Nihil Obstat<b>Blog de/by José María Gómez Hidalgo</b><br><br>
Mis reflexiones sobre tecnología e Internet, seguridad e inteligencia artificial<br>
My opinions about technology, Internet, security and Artificial IntelligenceJose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.comBlogger390125tag:blogger.com,1999:blog-36589303.post-38338332968815317712014-10-10T11:23:00.001+02:002014-10-10T11:23:49.102+02:00Carlos Laorden nominado para "Born To Be Discovery" por Negobot<a href="http://www.carloslaorden.com/" target="_blank">Carlos Laorden</a>, Doctor en Sistemas de Información por la Universidad de Deusto, y compañero y amigo de <a href="http://www.deustotech.deusto.es/" target="_blank">DeustoTech</a>, ha sido nominado en la categoría de Ciencia y Tecnología para los premios "<a href="http://borntobediscovery.discoverymax.es/" target="_blank">Born to be Discovery</a>" por el bot antipedófilos NEGOBOT. Yo ya le he votado. ¿Lo vas a hacer tú?Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com0tag:blogger.com,1999:blog-36589303.post-42154637942777434342014-05-21T11:49:00.001+02:002014-05-21T11:58:16.678+02:00WEKA Text Mining Trick: Copying Options from the Explorer to the Command Line<p>From previous posts (specially from <a href="http://jmgomezhidalgo.blogspot.com.es/2013/04/command-line-functions-for-text-mining.html" target="_blank">Command Line Functions for Text Mining in WEKA</a>), you may know that writing command-line calls to WEKA can be far from trivial, mostly because you may need to nest <tt><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/meta/FilteredClassifier.html" target="_blank">FilteredClassifier</a></tt> , <tt><a href="http://weka.sourceforge.net/doc.dev/weka/filters/MultiFilter.html" target="_blank">MultiFilter</a></tt> , <tt><a href="http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html" target="_blank">StringToWordVector</a></tt> , <tt><a href="http://weka.sourceforge.net/doc.dev/weka/attributeSelection/AttributeSelection.html" target="_blank">AttributeSelection</a></tt> and a classifier into a single command with plenty of options -- <em>and nested strings with escaped characters</em>.</p>
<p>For instance, consider the following need: I want to test the classifier J48 over the <tt><a href="https://github.com/jmgomezh/tmweka/blob/master/FilteredClassifier/smsspam.small.arff" target="_blank">smsspam.small.arff</a></tt> file, which contains couples of <tt>{class,text}</tt> lines. However, I want to:</p>
<ul>
<li>Apply <tt>StringToWordVector</tt> with specific options: lowercased tokens, specific string delimiters, etc.</li>
<li>Get only those words with Information Gain over zero, which implies using the filter <tt>AttributeSelection</tt> with <tt>InfoGainAttributeEval</tt> and <tt>Ranker</tt> with threhold <tt>0.0</tt>.</li>
<li>Make use of 10-fold cross validation, which implies using <tt>FilteredClassifier</tt>; and as long as I have two filters (<tt>StringToWordVector</tt> and <tt>AttributeSelection</tt>), I need to make use of <tt>MultiFilter</tt> as well.</li>
</ul>
<p>With some experience, this is not too hard to be done by hand. However, it is much easier to configure your test at the WEKA Explorer, make a quick test with a very small subset of your dataset, then copy the configuration to a text file and editi it to fully fit your needs. For this specific example, I start with loading the dataset at the Preprocess tab, and then I configure the classifier by:</p>
<ol>
<li>Choosing <tt>FilteredClassifier</tt>, and <tt>J48</tt> as the classifier.</li>
<li>Choosing <tt>MultiFilter</tt> as the filter, then deleting the default <tt>AllFilter</tt> and adding <tt>StringToWordVector</tt> and <tt>AttributeSelection</tt> filters to it.</li>
<li>Editing the <tt>StringToWordVector</tt> filter to specify lowercased tokens, do not operate per class, and my list of delimietrs.</li>
<li>Editing the <tt>AttributeSelection</tt> filter to choose <tt>InfoGainAttributeEval</tt> as the evaluator, and <tt>Ranker</tt> with threshold <tt>0.0</tt> as the search method.</li>
</ol>
<p>I show a picture in the middle of the process, just when editing the <tt>StringToWordVector</tt> filter:</p>
<p style="TEXT-ALIGN: center"><img src="https://lh5.googleusercontent.com/-Z2yipUR1JLg/U3xghaWvjwI/AAAAAAAACFE/YIyV4KjZS5U/w644-h595-no/weka.explorer.configure.process.png" style="WIDTH: 550px; DISPLAY: inline; HEIGHT: 507px" height="507" width="550"/></p>
<p>Then you can specify <tt>spamclass</tt> as the class and run it to get something like:</p>
<p><tt>=== Run information ===
<br/>
Scheme: weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 100000 -prune-rate -1.0 -N 0 -L -stemmer weka.core.stemmers.NullStemmer -M 1 -O -tokenizer <a>\\\"weka.core.tokenizers.WordTokenizer</a> -delimiters <a>\\\\\\\</a>" <a>\\\\\\\\r</a> <a>\\\\\\\\t.,;:\\\\\\\\\\\\\\\'\\\\\\\\\\\\\\\"()?!\\\\\\\\\\\\\\\%-/<>#@+*£&\\\\\\\"\\\"\</a>" -F \"weka.filters.supervised.attribute.AttributeSelection -E <a>\\\"weka.attributeSelection.InfoGainAttributeEval</a> <a>\\\</a>" -S <a>\\\"weka.attributeSelection.Ranker</a> -T 0.0 -N -1\\\"\"" -W weka.classifiers.trees.J48 -- -C 0.25 -M 2
<br/>
<br/>
Relation: sms_test
<br/>
Instances: 200
<br/>
Attributes: 2 spamclass text
<br/>
Test mode: 10-fold cross-validation
<br/>
(../..)
<br/>
=== Confusion Matrix ===
<br/>
a b <-- classified as
<br/>
16 17 | a = spam
<br/>
6 161 | b = ham</tt></p>
<p>As you can see, the <tt>Scheme</tt> line gives us the exact command options we need to get that result! You can just copy and edit it (after saving the result buffer) to get what you want. Alternatively, you can right click on the command at the Explorer, like in the following picture:</p>
<p style="TEXT-ALIGN: center"><img src="https://lh4.googleusercontent.com/-CwMOzx3uh38/U3xii8GgYeI/AAAAAAAACFc/jG9S8btfGho/w796-h595-no/weka.explorer.cut.options.png" style="WIDTH: 550px; DISPLAY: inline; HEIGHT: 411px" height="411" width="550"/></p>
<p>In any case, you get the following messy thing:</p>
<p><tt>weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 100000 -prune-rate -1.0 -N 0 -L -stemmer weka.core.stemmers.NullStemmer -M 1 -O -tokenizer <a>\\\"weka.core.tokenizers.WordTokenizer</a> -delimiters <a>\\\\\\\</a>" <a>\\\\\\\\r</a> <a>\\\\\\\\t.,;:\\\\\\\\\\\\\\\'\\\\\\\\\\\\\\\"()?!\\\\\\\\\\\\\\\%-/<>#@+*£&\\\\\\\"\\\"\</a>" -F \"weka.filters.supervised.attribute.AttributeSelection -E <a>\\\"weka.attributeSelection.InfoGainAttributeEval</a> <a>\\\</a>" -S <a>\\\"weka.attributeSelection.Ranker</a> -T 0.0 -N -1\\\"\"" -W weka.classifiers.trees.J48 -- -C 0.25 -M 2</tt></p>
<p>Then you can strip the options you do not need. For instance, some default options in <tt>StringToWordVector</tt> are <tt>-R first-last</tt>, <tt>prune-rate -1.0</tt>, <tt>-N 0</tt>, the stemmer, etc. You can guess the default options by issuing the help command:</p>
<p><tt>$>java weka.filters.unsupervised.attribute.StringToWordVector -h
<br/>
Help requested.
<br/>
<br/>
Filter options:
<br/>
-C
<br/>
Output word counts rather than boolean word presence.
<br/>
-R <index1,index2-index4,...>
<br/>
Specify list of string attributes to convert to words (as weka Range).
<br/>
(default: select all string attributes)
<br/>
...</tt></p>
<p>So after cleaning the default options (in all filters and the classifier), adding the dataset file and the class index (<tt>-t spamsms.small.arff -c 1</tt>), and with some pretty printing for clarification, you can easily build the following command:</p>
<p><tt>java weka.classifiers.meta.FilteredClassifier
<br/>
-c 1
<br/>
-t smsspam.small.arff
<br/>
-F "weka.filters.MultiFilter
<br/>
-F \"weka.filters.unsupervised.attribute.StringToWordVector
<br/>
-W 100000
<br/>
-L
<br/>
-O
<br/>
-tokenizer <a>\\\"weka.core.tokenizers.WordTokenizer</a>
<br/>
-delimiters <a>\\\\\\\</a>" <a>\\\\\\\\r</a> <a>\\\\\\\\t.,;:\\\\\\\\\\\\\\\'\\\\\\\\\\\\\\\"()?!\\\\\\\\\\\\\\\%-/<>#@+*£&\\\\\\\"\\\"\</a>"
<br/>
-F \"weka.filters.supervised.attribute.AttributeSelection
<br/>
-E <a>\\\"weka.attributeSelection.InfoGainAttributeEval</a> <a>\\\</a>"
<br/>
-S <a>\\\"weka.attributeSelection.Ranker</a> -T 0.0 <a>\\\"\</a>""
<br/>
-W weka.classifiers.trees.J48</tt></p>
<p>So now you can change other parameters if you want, in order to test other text representations, classifiers, etc., without dealing with escaping the options, delimiters, etc.</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com0tag:blogger.com,1999:blog-36589303.post-63130817043783542292014-01-30T12:43:00.001+01:002014-01-30T12:43:40.636+01:00CFP: Sixth International Conference on Social Informatics<p>The <a href="http://socinfo2014.org/" target="_blank">Sixth International Conference on Social Informatics</a> (SocInfo 2014) will be taking place at Barcelona, Spain, from November 10th to November 13th. The ultimate goal of Social Informatics is to create better understanding of socially-centric platforms not just as a technology, but also as a set of social phenomena. To that end, the organizers are inviting interdisciplinary papers, on applying information technology in the study of social phenomena, on applying social concepts in the design of information systems, on applying methods from the social sciences in the study of social computing and information systems, on applying computational algorithms to facilitate the study of social systems and human social dynamics, and on designing information and communication technologies that consider social context.</p>
<p><strong>Important dates</strong></p>
<ul>
<li>
<div>Full paper submission: August 8, 2014 (23:59 Hawaii Standard Time)</div>
</li>
<li>
<div>Notification of acceptance: October 3, 2014</div>
</li>
<li>
<div>Submission of final version: October 10, 2014</div>
</li>
<li>
<div>Conference dates: November 10-13, 2014</div>
</li>
</ul>
<p><strong>Topics</strong></p>
<ul>
<li>
<div>New theories, methods and objectives in computational social science</div>
</li>
<li>
<div>Computational models of social phenomena and social simulation</div>
</li>
<li>
<div>Social behavior modeling</div>
</li>
<li>
<div>Social communities: discovery, evolution, analysis, and applications</div>
</li>
<li>
<div>Dynamics of social collaborative systems</div>
</li>
<li>
<div>Social network analysis and mining</div>
</li>
<li>
<div>Mining social big data</div>
</li>
<li>
<div>Social Influence and social contagion</div>
</li>
<li>
<div>Web mining and its social interpretations</div>
</li>
<li>
<div>Quantifying offline phenomena through online data</div>
</li>
<li>
<div>Rich representations of social ties</div>
</li>
<li>
<div>Security, privacy, trust, reputation, and incentive issues</div>
</li>
<li>
<div>Opinion mining and social media analytics</div>
</li>
<li>
<div>Credibility of online content</div>
</li>
<li>
<div>Algorithms and protocols inspired by human societies</div>
</li>
<li>
<div>Mechanisms for providing fairness in information systems</div>
</li>
<li>
<div>Social choice mechanisms in the e-society</div>
</li>
<li>
<div>Social applications of the semantic Web</div>
</li>
<li>
<div>Social system design and architectures</div>
</li>
<li>
<div>Virtual communities (e.g., open-source, multiplayer gaming, etc.)</div>
</li>
<li>
<div>Impact of technology on socio-economic, security, defense aspects</div>
</li>
<li>
<div>Real-time analysis or visualization of social phenomena and social graphs</div>
</li>
<li>
<div>Socio-economic systems and applications</div>
</li>
<li>
<div>Collective intelligence and social cognition</div>
</li>
</ul>
<p>My friend <a href="http://boldi.di.unimi.it/" target="_blank">Paolo Boldi</a> is in the organizing committee.</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com0tag:blogger.com,1999:blog-36589303.post-56639434835643248422013-08-23T11:47:00.001+02:002013-08-23T18:58:38.084+02:00Data Mining for Political Elections, and Isaac Asimov<p>Using Data Mining, Data Science and Big Data is cool in political elections, and in political decision-making. Well, not sure if cool, but it is a trending topic in Data Science in the latest years.</p>
<p>Here are some examples:</p>
<ul>
<li>
<div><a href="http://www.computerworld.com/s/article/9232567/Campaign_2012_Mining_for_voters" target="_blank">Campaign 2012: Mining for voters. Data-driven campaigning goes mainstream</a>.</div>
</li>
<li>
<div><a href="http://analytics.blogspot.com.es/2013/08/obama-for-america-uses-google-analytics.html" target="_blank">Obama for America uses Google Analytics to democratize rapid, data-driven decision making</a>.</div>
</li>
<li>
<div><a href="http://www.computerworld.com/s/article/9233587/Barack_Obama_39_s_Big_Data_won_the_US_election" target="_blank">Barack Obama's Big Data won the US election</a>.</div>
</li>
</ul>
<p>From the research point of view, you can check for instance how Twitter information is used in political campaigns in this <a href="https://sites.google.com/site/twitterandtherealworld/home" target="_blank">Twitter and the Real World CIKM'13 Tutorial</a> by Ingmar Weber and by Yelena Mejova. There is an interesting list of references on several ways of using Twitter to predict user political orientation, general public trends, and other. On the opposite side, you can find an interesting paper which provides sound criticism on some of the research performed on Twitter and politics: <a href="http://arxiv.org/pdf/1204.6441v1.pdf" target="_blank">"I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper": A Balanced Survey on Election Prediction using Twitter Data</a>, by Daniel Gayo-Avello.</p>
<p>Anyway, it should be clear from multiple points of view that governments (e.g. the <a href="http://en.wikipedia.org/wiki/PRISM_(surveillance_program)" target="_blank">NSA PRISM case</a>) and politicians are collecting and using citizen data in order to predict their tastes and to guide their decisions and actions in political campaigns.</p>
<p>I will avoid the privacy discussion here, as I want the case for something different. My case is: <strong><em>Hey, if they can predict elections results, then why voting?</em></strong></p>
<p>But my blog is not a political one; it should be a technical one - or at least, a technically-focused blogger one. And as many computer geeks, I am a scifi fan. And as one of the biggest authors is <a href="http://en.wikipedia.org/wiki/Isaac_Asimov" target="_blank">Isaac Asimov</a>, I have read a lot by him.</p>
<p><em><strong>What has to do Asimov with data mining in politics?</strong></em> Well, <em><strong>he predicted it</strong></em> .</p>
<p>More precisely, he predicted <em><strong>how elections may evolve in the Era of Big Data</strong></em> . And he answered my question. <strong><em>You will not vote</em></strong> .</p>
<p>Asimov used to publish short stories in scifi magazines (as many others, I know). In August 1955, he published a short story titled " <strong><a href="http://en.wikipedia.org/wiki/Franchise_(short_story)" target="_blank">Franchise</a></strong> " in the magazine "If: Worlds of Science Fiction". I read that story many years later, re-printed in one of his short stores collection books. I was young, and I liked the story, but not too much - there were others more appealing to my taste in the volume. However, I have revisited it recently, and under the light of my technical background, things have changed.</p>
<p>That is <em>real</em> scifi. He technically predicted the future. And it is happening.</p>
<p>The plot is simple; just let me quote the Wikipedia article:</p>
<blockquote>
<p>In the future, the United States has converted to an "electronic democracy" where the computer Multivac selects a single person to answer a number of questions. Multivac will then use the answers and other data to determine what the results of an election would be, avoiding the need for an actual election to be held.</p>
</blockquote>
<p>As the Big Data platform (the computer Multivac in the story) gets to know more and more about the citizens, it will need less and less to accurately predict election results. The problem is reduced to just making a list of (quite <a href="http://en.wikipedia.org/wiki/Sentiment_analysis" target="_blank">Sentiment Analysis</a> related) questions to a single citizen selected as being representative for answering those questions, in order to refine some details, and that's it.</p>
<p><strong><em>Do not blame him, nor me. It is just happening.</em></strong></p>
<p>As always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!</p>
<p><strong>Update 1:</strong> Yet another example: <a href="http://www.newscientist.com/article/mg21929315.500-twitter-hashtags-predict-rising-tension-in-egypt.html" target="_blank">Twitter hashtags predict rising tension in Egypt</a>.</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com0tag:blogger.com,1999:blog-36589303.post-6776958319527637272013-07-27T17:06:00.001+02:002013-07-27T17:06:43.681+02:00More Clever Tokenization of Spanish Text in Social Networks<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/spanish.tokenizer.header.png" style="WIDTH: 450px; DISPLAY: inline; HEIGHT: 252px" height="252" width="450"/></p>
<p>Text written by users in Social Networks is noisy: emoticons, chat codes, typos, grammar mistakes, and moreover, explicit noise created by users as a style, trend or fashion. Consider the next utterance, taken from a post in the social network <a href="https://www.tuenti.com/" target="_blank">Tuenti</a>:</p>
<blockquote style="MARGIN-RIGHT: 0px" dir="ltr">
<p>"felicidadees!! k t lo pases muy bien!! =)Feeeliiciidaaadeeess !! (:Felicidadesss!!pasatelo genialll :DFeliicCiidaDesS! :D Q tte Lo0 paseS bN! ;) (heart)"</p>
</blockquote>
<p>This is a real text. Its approximate translation to English would be something like:</p>
<blockquote style="MARGIN-RIGHT: 0px" dir="ltr">
<p>"happybirthdaay!! njy it lot!! =)Haaapyyybirthdaaayyy !! (:Happybirthdayyy!!have a great timeee :DHappyyBiirtHdayY :D Enjy! ;) (heart)"</p>
</blockquote>
<p>The latest word between parenthesis is a Tuenti code that is shown as a heart.</p>
<p>If you want to find more text like this out there, just point your browser to <a href="http://www.fotolog.com/" target="_blank">Fotolog</a>.</p>
<p>As you can imagine, just tokenizing this kind of text for further analysis is quite a headache. During our experiments for the project <a href="http://wendy.optenet.com/" target="_blank">WENDY</a> (link in Spanish), we have designed a relatively simple tokenization algorithm in order to deal with this kind of text for age prediction. Although the method is designed for the Spanish language, it is quite language-independent and it may well be applied to other languages - not yet tested. The algorithm is the following one:</p>
<ol>
<li>Separate the initial string into candidate tokens using white spaces.</li>
<li>A candidate token can be:</li>
<li style="list-style: none">
<ol>
<li>A proper sequence of alphabetic characters (a potential word), or proper sequence of punctuation symbols (a potential emoticon). In this case, the candidate token is considered already a token.</li>
<li>A mixed sequences of alphabetic characters and punctuation symbols. In this case, the character sequence is divided into sequences of alphabetic characters and sequences of punctuation symbols. For instance, "Hola:-)ketal" is further divided into "Hola", ":-)", and "ketal".</li>
</ol>
</li>
</ol>
<p>For instance, consider the next (real) text utterance:</p>
<blockquote style="MARGIN-RIGHT: 0px" dir="ltr">
<p>"Felicidades LauraHey, felicidades! ^^felicidiadeees;DFelicidades!Un beso! FELIZIDADESS LAURIIIIIIIIIIIIII (LL)felicidadeeeeeees! :D jajaja mira mi tablonme meo jajajajajjajate quiero(:,"</p>
</blockquote>
<p>The output of our algorithm is the list of tokens in the next table:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/spanish.tokenizer.example.png" style="WIDTH: 320px; DISPLAY: inline; HEIGHT: 224px" height="224" width="320"/></p>
<p>We have evaluate this algorithm directly and indirectly. Direct evaluation consists of comparing how many hits we get with an space-only tokenizer and with out tokenizer, in a Spanish and in a SMS-language dictionary. The more hits you get, the best recognized are words. We find about 9.5 more words in average in the Spanish dictionary with our tokenizer, and an average of 1.13 words more in the SMS-language dictionary, per text utterance (comment).</p>
<p>The indirect evaluation is performed by pipelining the algorithm in the full process of the WENDY age recognition system. The new tokenizer increases the accuracy of the age recognition system from 0.768 to 0.770, which may seem marginal except for the fact that it accounts for 206 new hits in our text collection of Tuenti comments. The new tokenizer provides relatively important increments in recall and precision for the most under-represented but most critical class, that is that of under 14 users.</p>
<p>This is the reference of the paper which details the tokenizer, the experiments, and the context of the WENDY project, in Spanish:</p>
<blockquote style="MARGIN-RIGHT: 0px" dir="ltr">
<p>José María Gómez Hidalgo, Andrés Alfonso Caurcel Díaz, Yovan Iñiguez del Rio. <strong><a href="http://linguamatica.com/index.php/linguamatica/article/view/156" target="_blank">Un método de análisis de lenguaje tipo SMS para el castellano</a></strong>. <a href="http://linguamatica.com/index.php/linguamatica" target="_blank">Linguamatica</a>, Vol. 5, No. 1, pp. 31-39, July 2013.</p>
</blockquote>
<p> </p>
<p>If you are interested in the first steps of text analysis (tokenization, text normalization, POS Tagging), then these two recent news may be useful for you:</p>
<ul>
<li>The <a href="http://komunitatea.elhuyar.org/tweet-norm/participation/#Results" target="_blank">results of the Tweet Normalization Workshop/Task</a> at <a href="http://www.sepln.org/?news=xxix-conference-of-the-sepln&lang=en" target="_blank">SEPLN 2013</a> have been just published, interesting data & dataset.</li>
<li><a href="http://derczynski.com/sheffield/" target="_blank">Leon Derczynski</a> <em>et al.</em> have released a <a href="https://gate.ac.uk/wiki/twitter-postagger.html" target="_blank">GATE-based POS-Tagger for Twitter</a> with very good levels of accuracy.</li>
</ul>
<p>And you may want to <a href="http://jmgomezhidalgo.blogspot.com.es/2013/07/chat-or-what-approaching-text.html" target="_blank">take a look at my previous post on text normalization</a>.</p>
<p>As always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com3tag:blogger.com,1999:blog-36589303.post-29318067815582725152013-07-22T16:54:00.001+02:002013-07-22T18:35:56.723+02:00Negobot is in the news!<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/negobot/negobot.png" style="WIDTH: 450px; DISPLAY: inline; HEIGHT: 298px" height="298" width="450"/></p>
<p>... And I must say, <em>it is quite popular out there</em>.</p>
<p>Negobot is a conversational agent posing as a 14 year old girl, intended to detecting paedophilic intentions and adapting to them. Negobot is based on Game Theory, and it is the result of a R&D project performed by the <a href="http://www.deustotech.deusto.es/" target="_blank">Deustotech</a> <a href="http://s3lab.deusto.es/index.php?lang=en" target="_blank">Laboratory for Smartness, Semantics and Security</a> (S3Lab) and <a href="http://www.optenet.com/" target="_blank">Optenet</a>. The members of the team are:</p>
<ul>
<li><a href="http://www.carloslaorden.com/" target="_blank">Carlos Laorden</a></li>
<li><a href="http://paginaspersonales.deusto.es/patxigg/es/inicio.shtml" target="_blank">Patxi Galán-García</a></li>
<li><a href="http://paginaspersonales.deusto.es/isantos/es/about.shtml" target="_blank">Igor Santos</a></li>
<li><a href="http://paginaspersonales.deusto.es/bosanz/es/index.html" target="_blank">Borja Sanz</a></li>
<li><a href="http://www.linkedin.com/in/pablogarciabringas" target="_blank">Pablo García-Bringas</a></li>
</ul>
<p>And myself. Its scientific approach is explained in the following paper:</p>
<blockquote>
<p>Laorden, C., Galán-García. P., Santos, I., Sanz, B., Gómez Hidalgo, J.M., García Bringas, P., 2012. <strong><a href="http://rd.springer.com/chapter/10.1007/978-3-642-33018-6_27#" target="_blank"><strong>Negobot: A Conversational Agent Based on Game Theory for the Detection of Paedophile Behaviour</strong></a></strong> . International Joint Conference CISIS12-ICEUTEA12-SOCOA 12 Special Sessions, Advances in Intelligent Systems and Computing, Vol. 189, Springer Berlin Heidelberg, pp. 261-270. (<a href="http://www.esp.uem.es/jmgomez/papers/cisis12.pdf" target="_blank">preprint</a>)</p>
</blockquote>
<p>My friend and colleague <strong><a href="http://www.carloslaorden.com/" target="_blank">Carlos Laorden</a></strong> was interviewed by the <a href="http://www.agenciasinc.es/en/Who-are-we" target="_blank">SINC Agency</a> about the project some days ago, and the agency released a news story that quickly jumped on a wide range of online and offline agencies, newspapers, radio stations, news aggregators, blogs, etc. Here is the original news story in Spanish:</p>
<blockquote style="MARGIN-RIGHT: 0px" dir="ltr">
<p><a href="http://www.agenciasinc.es/Noticias/Una-Lolita-virtual-a-la-caza-de-pederastas" target="_blank"><strong>Una 'Lolita' virtual a la caza de pederastas
<br/></strong></a> SINC | 10 julio 2013 10:40</p>
</blockquote>
<p>The news story featured <a href="http://youtu.be/-RbPeiNhV-E" target="_blank">a video with the interview to Carlos</a>.</p>
<p>And in English, published by SINC at <a href="http://www.alphagalileo.org/" target="_blank">Alpha Galileo</a>:</p>
<blockquote style="MARGIN-RIGHT: 0px" dir="ltr">
<p><strong><a href="http://www.alphagalileo.org/ViewItem.aspx?ItemId=132829&CultureCode=en" target="_blank">A virtual 'Lolita' on the hunt for paedophiles
<br/></a></strong> 10 de julio de 2013 Plataforma SINC</p>
</blockquote>
<p>From there, to <strong>major English-language media</strong>:</p>
<table cellpadding="2" width="450" align="center" cellspacing="2">
<tbody>
<tr>
<td><img src="http://www.esp.uem.es/jmgomez/blogimg/negobot/nbc.png" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 80px" height="80" width="100"/></td>
<td><a href="http://www.nbcnews.com/technology/controversial-lolita-chatbot-catches-online-predators-6C10622694" target="_blank">Controversial 'Lolita' chatbot catches online predators
<br/></a> <strong>NBC News</strong></td>
</tr>
<tr>
<td><img src="http://www.esp.uem.es/jmgomez/blogimg/negobot/bbc.png" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 82px" height="82" width="100"/></td>
<td><a href="http://www.bbc.co.uk/news/technology-23268893" target="_blank">'Virtual Lolita' aims to trap chatroom paedophiles
<br/></a> <strong>BBC News Technology</strong></td>
</tr>
<tr>
<td><img src="http://www.esp.uem.es/jmgomez/blogimg/negobot/huffingtonpost.png" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 93px" height="93" width="100"/></td>
<td><a href="http://www.huffingtonpost.com/2013/07/11/negobot-virtual-lolita-game-theory_n_3579716.html" target="_blank">Negobot, 'Virtual Lolita,' Uses Game Theory To Bust Child Predators In Internet Chat Rooms
<br/></a> <strong>Huffington Post</strong></td>
</tr>
<tr>
<td><img src="http://www.esp.uem.es/jmgomez/blogimg/negobot/theindependent.png" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 69px" height="69" width="100"/></td>
<td><a href="http://www.independent.co.uk/news/science/virtual-lolita-poses-as-schoolgirl-aged-14-to-trap-online-paedophiles-8700920.html" target="_blank">Virtual Lolita poses as schoolgirl aged 14 to trap online paedophiles
<br/></a> <strong>The Independent</strong></td>
</tr>
<tr>
<td><img src="http://www.esp.uem.es/jmgomez/blogimg/negobot/dailymail.png" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 84px" height="84" width="100"/></td>
<td><a href="http://www.dailymail.co.uk/sciencetech/article-2359499/How-Lolita-style-virtual-robots-posing-teenage-girls-used-uncover-paedophiles-social-network-sites.html" target="_blank">How 'Lolita style' virtual robots posing as teenage girls are being used to uncover paedophiles on social network sites
<br/></a> <strong>Daily Mail</strong></td>
</tr>
<tr>
<td><img src="http://www.esp.uem.es/jmgomez/blogimg/negobot/metro.png" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 68px" height="68" width="100"/></td>
<td><a href="http://metro.co.uk/2013/07/11/virtual-lolita-created-to-trap-paedophiles-in-online-chatrooms-3878742/" target="_blank">'Virtual Lolita' created to trap paedophiles in online chatrooms
<br/></a> <strong>METRO</strong></td>
</tr>
</tbody>
</table>
<p><strong>Major international blogs and news aggregators</strong> have also featured Negobot:</p>
<ul>
<li><strong>Engadget</strong>: <a href="http://www.engadget.com/2013/07/11/negobot-virtual-chat-agent-trap-pedophiles/" target="_blank">Negobot: a virtual chat agent engineered to trap pedophiles</a></li>
<li><strong>Ubergizmo</strong>: <a href="http://www.ubergizmo.com/2013/07/negobot-chatbot-to-trap-pedophiles/" target="_blank">Negobot Chatbot To Trap Pedophiles</a></li>
<li><strong>IO9</strong>: <a href="http://io9.com/sophisticated-chatbot-poses-as-teenage-girl-to-lure-ped-743260398" target="_blank">Sophisticated chatbot poses as teenage girl to lure pedophiles</a></li>
<li><strong>GigaOm</strong>: <a href="http://gigaom.com/2013/07/11/catching-pedophiles-with-text-mining-and-game-theory/" target="_blank">Catching pedophiles with text mining and game theory</a></li>
<li><strong>Gizmag</strong>: <a href="http://www.gizmag.com/negobot-pedophile-hunting-chatbot/28240/" target="_blank">Chatbot hunts for pedophiles</a></li>
<li><strong>BetaBeat</strong>: <a href="http://betabeat.com/2013/07/virtual-teen-can-lure-sexual-predators-with-the-blink-of-an-emoticon/" target="_blank">Virtual Teen Can Lure Sexual Predators With the Blink of an Emoticon</a></li>
<li><strong>Slashdot</strong>: <a href="http://yro.slashdot.org/story/13/07/11/1233215/spanish-chatbot-hunts-for-pedophiles" target="_blank">Spanish Chatbot Hunts For Pedophiles</a></li>
</ul>
<p>As of today, Negobot has got:</p>
<ul>
<li>181 comments in <a href="http://slashdot.org/" target="_blank">Slashdot</a>.</li>
<li>42 diggs in <a href="http://digg.com/" target="_blank">Digg</a>.</li>
<li>124 points and 49 comments in <a href="http://www.reddit.com/" target="_blank">Reddit</a>.</li>
</ul>
<p>Negobot has obtained a <strong>world-wide coverage in the news</strong>:</p>
<div style="TEXT-ALIGN: center">
<table width="400" align="center" cellspacing="2">
<tbody>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/ar-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 66px" height="66" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Argentine Republic
<br/></strong> <a href="http://www.elintransigente.com/notas/2013/7/13/crearon-programa-informatico-para-atrapar-pedofilos-los-chats-redes-sociales-193644.asp" target="_blank">Crearon un programa informático para atrapar pedófilos en los chats y redes sociales
<br/></a> El Intransigente</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/bk-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 50px" height="50" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Bosnia and Herzegovina
<br/></strong> <a href="http://www.vijesti.ba/magazin/zanimljivosti/156017-Sofisticirani-robot-Negobot-sluzi-namami-otkrije-pedofile.html" target="_blank">Sofisticirani robot "Negobot" služi da namami i otkrije pedofile</a>
<br/>
Vijesti</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/as-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 50px" height="50" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Commonwealth of Australia
<br/></strong> <a href="http://www.news.com.au/technology/artificial-intelligence-poses-as-14yearoldgirl-to-detect-paedophiles-in-social-chatrooms/story-e6frfro0-1226677357656" target="_blank">Artificial intelligence poses as 14-year-old-girl to detect paedophiles in social chatrooms
<br/></a> News Limited Network
<br/>
<a href="http://m.heraldsun.com.au/technology/news/artificial-intelligence-poses-as-14yearoldgirl-to-detect-paedophiles-in-social-chatrooms/story-fni0bzod-1226677357656" target="_blank">Artificial intelligence poses as 14-year-old-girl to detect paedophiles in social chatrooms
<br/></a> Herald Sun, Melbourne</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/ez-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 66px" height="66" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Czech Republic
<br/></strong> "<a href="http://pej.cz/Wirtualna-Lolita-czyli-czatbot-ktory-wskaze-pedofilow-a7008" target="_blank">Wirtualna Lolita", czyli czatbot, który wskaże pedofilów
<br/></a> PEJ</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/fr-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 66px" height="66" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>French Republic
<br/></strong> <a href="http://www.marieclaire.fr/,negobot-l-adolescente-virtuelle-qui-piege-les-pedophiles-sur-internet,696016.asp" target="_blank">Negobot, l'adolescente virtuelle qui piège les pédophiles sur internet !
<br/></a> Marie Claire
<br/>
<a href="http://www.metronews.fr/info/espagne-negobot-une-lolita-virtuelle-traque-les-pedophiles-sur-internet/mmgo!KDWfEhp1jC02c/" target="_blank">Espagne : une lolita virtuelle traque les pédophiles sur Internet
<br/></a> Metro News
<br/>
<a href="http://www.lepoint.fr/societe/l-adolescente-virtuelle-qui-traquait-les-pedophiles-en-ligne-15-07-2013-1704989_23.php" target="_blank">L'adolescente virtuelle qui traquait les pédophiles en ligne
<br/></a> Le Point</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/gr-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 64px" height="64" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Hellenic Republic
<br/></strong> <a href="http://www.naftemporiki.gr/story/674679" target="_blank">Τεχνητή νοημοσύνη- «κυνηγός» παιδόφιλων στο Ίντερνετ
<br/></a> Naftemporiki</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/it-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 66px" height="66" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Italian Republic
<br/></strong> <a href="http://www.iltempo.it/hitech-games/2013/07/13/negobot-il-software-lolita-che-individua-i-pedofili-dialogando-1.1156254" target="_blank">Negobot, il software "Lolita" che individua i pedofili dialogando
<br/></a> Il Tempo
<br/>
<a href="http://www.repubblica.it/tecnologia/2013/07/11/news/robot_anti_pedofili-62788718/" target="_blank">Negobot, la lolita virtuale che stana i pedofili in rete
<br/></a> La Republica
<br/>
<a href="http://www.lastampa.it/2013/07/11/italia/cronache/negobot-la-lolita-virtuale-che-incastra-i-pedofili-in-rete-QCEBpn8A29n73FqW3fuVpK/pagina.html" target="_blank">Negobot, la Lolita virtuale che incastra i pedofili in Rete
<br/></a> La Stampa</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/sp-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 66px" height="66" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Kingdom of Spain
<br/></strong> <a href="http://www.abc.es/tecnologia/20130712/rc-negobot-contra-pedofilos-201307121320.html" target="_blank">Negobot contra los pedófilos
<br/></a> ABC Tecnología
<br/>
<a href="http://noticias.lainformacion.com/ciencia-y-tecnologia/tecnologia-general/negobot-contra-los-pedofilos_GC7T2ptMK1wNuQ87VJ6c9/" target="_blank">Negobot contra los pedófilos
<br/></a> La Información
<br/>
<a href="http://www.publico.es/458788/una-lolita-virtual-a-la-caza-de-pederastas" target="_blank">Una 'Lolita' virtual a la caza de pederastas
<br/></a> Publico
<br/>
<a href="http://www.lavozdegalicia.es/noticia/galicia/2013/07/11/idean-lolita-virtual-detectar-pedofilos-red/0003_201307G11P5992.htm" target="_blank">Idean una lolita virtual para detectar pedófilos en la Red
<br/></a> La Voz de Galicia
<br/>
<a href="http://www.elcorreogallego.es/tendencias/ecg/lolita-virtual-caza-pederastas/idEdicion-2013-07-11/idNoticia-816343/" target="_blank">Una 'Lolita' virtual para la caza de pederastas
<br/></a> El Correo Gallego
<br/>
<a href="http://www.elespectador.com/noticias/cultura/vivir/articulo-432934-trampa-los-pederastas-red" target="_blank">La trampa para los pederastas en la red
<br/></a> El Espectador
<br/>
<a href="http://ecodiario.eleconomista.es/ciencia/noticias/4981319/07/13/Nuevo-sistema-virtual-a-la-caza-de-posibles-pederastas.html" target="_blank">Nuevo sistema virtual a la caza de posibles pederastas
<br/></a> El Economista</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/sw-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 63px" height="63" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Kingdom of Sweden
<br/></strong> <a href="http://nyheter24.se/nyheter/internet/749661-virtuell-lolita-ska-fa-fast-pedofiler-pa-natet" target="_blank">"Virtuell lolita" ska få fast pedofiler på nätet
<br/></a> Nyheter24</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/my-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 50px" height="50" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Malasya</strong>
<br/>
<a href="http://www.pikiran-rakyat.com/node/242358" target="_blank">Robot Virtual Gadis Remaja Digunakan untuk Menjebak Pedofil
<br/></a> Pikiran Rakyat</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/nl-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 66px" height="66" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Netherlands</strong>
<br/>
<a href="http://www.pcmweb.nl/nieuws/digitale-pedolokker-imiteert-schoolmeisje.html" target="_blank">Digitale pedolokker imiteert schoolmeisje
<br/></a> PCM</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/uy-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 67px" height="67" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Oriental Republic of Uruguay
<br/></strong> <a href="http://www.lr21.com.uy/tecnologia/1117187-desarrollan-lolita-virtual-para-dar-caza-a-pederastas-y-corruptores-de-menores" target="_blank">Desarrollan "Lolita virtual" para dar caza a pederastas y corruptores de menores
<br/></a> La Red 21</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/po-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 66px" height="66" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Portuguese Republic
<br/></strong> <a href="http://hypescience.com/a-adolescente-robotica-cacadora-de-pedofilos/" target="_blank">A adolescente robótica caçadora de pedófilos
<br/></a> Hype Science</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/au-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 66px" height="66" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Republic of Austria
<br/></strong> <a href="http://www.style.at/contator/style/news.asp?nnr=61301" target="_blank">Negobot findet Pädophile
<br/></a> style.at Kurzmeldungen
<br/>
<a href="http://derstandard.at/1373512635374/Negobot-Chatprogramm-forscht-Paedophile-aus" target="_blank">"Negobot": Chatprogramm forscht Pädophile aus
<br/></a> Der Standard</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/ci-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 66px" height="66" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Republic of Chile
<br/></strong> <a href="http://www.24horas.cl/tendencias/mundodigital/nuevo-software-permite-detectar-pedofilos-en-la-red-742074" target="_blank">Nuevo software permite detectar pedófilos en la red
<br/></a> 24 Horas</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/hr-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 50px" height="50" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Republic of Croatia
<br/></strong> <a href="http://www.radiosarajevo.ba/novost/118880/napravljen-robot-koji-pronalazi-pedofile" target="_blank">Napravljen robot koji pronalazi pedofile
<br/></a> Radio Sarajevo</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/in-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 66px" height="66" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Republic of India
<br/></strong> <a href="http://articles.timesofindia.indiatimes.com/2013-07-12/science/40535516_1_paedophiles-game-theory-police-force" target="_blank">A virtual Lolita on the hunt for paedophiles online
<br/></a> The Times of India</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/kz-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 49px" height="49" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Republic of Kazakhstan
<br/></strong> <a href="http://www.safekaznet.kz/en/bez-rubriki/bot-virtualnaya-lolita-ispolzuet-teoriyu-igr-dlya-raspoznaniya-ohotnikov-na-detey-v-internet-chatah" target="_blank">Negobot, 'Virtual Lolita,' Uses Game Theory To Bust Child Predators In Internet Chat Rooms</a>
<br/>
Safekaznet</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/pl-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 62px" height="62" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Republic of Poland
<br/></strong> <a href="http://www.autonom.pl/?p=6436" target="_blank">Negobot sieciową pułapką na pedofilów
<br/></a> Autonom</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/ri-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 50px" height="50" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Republic of Serbia
<br/></strong> <a href="http://www.telegraf.rs/hi-tech/internet/781516-virtuelna-lolita-krece-u-lov-na-manijake" target="_blank">STOP PEDOFILIJI: Virtuelna Lolita kreće u lov na manijake!</a>
<br/>
Telegraf.rs
<br/>
<a href="http://www.novosti.rs/vesti/naslovna/tehnologije/aktuelno.236.html:443846-Virtuelna-Lolita-za-lov-na-pedofile" target="_blank">Virtuelna Lolita za lov na pedofile
<br/></a> HOBOCTN</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/ro-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 67px" height="67" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Romania</strong>
<br/>
<a href="http://www.ziare.com/articole/inteligenta+artificiala+negobot+pedofili" target="_blank">Robotul care pozeaza in pustoaica de 14 ani - da de gol pedofilii
<br/></a> Ziare</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/rs-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 66px" height="66" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Russian Federation
<br/></strong> <a href="http://korrespondent.net/business/web/1581297-poiskom-pedofilov-v-seti-zajmetsya-bot-vydayushchij-sebya-za-14-letnyuyu" target="_blank">Поиском педофилов в сети займется бот, выдающий себя за 14-летнюю
<br/></a> Корреспондент.net
<br/>
<a href="http://lenta.ru/news/2013/07/11/bot/" target="_blank">Вычисление педофилов в интернете поручат чат-боту
<br/></a> LENTA</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/vm-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 66px" height="66" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Socialist Republic of Vietnam
<br/></strong> <a href="http://news.com.vn/hi-tech/more-hitech/113439-virtual-lolita-aims-to-trap-chatroom-paedophiles-.html" target="_blank">'Virtual Lolita' aims to trap chatroom paedophiles</a>
<br/>
Info VN</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/sz-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 100px" height="100" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Swiss Confederation
<br/></strong> <a href="http://www.ticinonews.ch/articolo.aspx?id=304888&rubrica=15" target="_blank">Spagna: ecco Negobot, 14enne virtuale che scova i pedofili in rete
<br/></a> Ticino News</p>
</td>
</tr>
<tr>
<td>
<p style="TEXT-ALIGN: left"><img src="https://www.cia.gov/library/publications/the-world-factbook/graphics/flags/large/up-lgflag.gif" style="WIDTH: 100px; DISPLAY: inline; HEIGHT: 66px" height="66" width="100"/></p>
</td>
<td>
<p style="TEXT-ALIGN: left"><strong>Ukraine</strong>
<br/>
<a href="http://ubr.ua/uk/tv/technologii/v-spanskih-nternet-chatah-pdltkv-vd-pedoflv-zahisha-negobot-240191" target="_blank">В іспанських інтернет-чатах підлітків від педофілів захищає Negobot</a>
<br/>
UBR</p>
</td>
</tr>
</tbody>
</table>
</div>
<p>Carlos Laorden has been also interviewed for <strong>Spanish newspapers and in radio stations</strong>:</p>
<ul>
<li>Interview in <strong><a href="http://www.elmundo.es/" target="_blank">El Mundo</a></strong> (<a href="http://www.esp.uem.es/jmgomez/negobot/ElMundo.pdf" target="_blank">Spanish, PDF</a>).</li>
<li>Interview in <strong><a href="http://www.cope.es/programas/La-Noche/Inicio" target="_blank">La Noche de La Cope</a></strong> (<a href="http://www.esp.uem.es/jmgomez/negobot/LaNocheDeLaCope.mp3" target="_blank">Spanish, MP3</a>).</li>
<li>Interview in <a href="http://www.cope.es/programas/La-Manana/inicio" target="_blank"><strong>La Mañana de La Cope</strong></a> (<a href="http://www.esp.uem.es/jmgomez/negobot/LaMananaDeLaCope.mp3" target="_blank">Spanish, MP3</a>).</li>
</ul>
<p>And last but not least, <a href="http://youtu.be/S47IOaPbwXY" target="_blank">Negobot has got some criticism in the form of a (quite funny) video</a>.</p>
<p>You can keep on tracking with <a href="https://www.google.com/search?q=negobot" target="_blank">Google Search in Web Pages</a> and <a href="https://www.google.com/search?q=negobot&tbm=nws" target="_blank">in the news</a>.</p>
<p>Finally, sorry for the <em>SSF</em>, and thanks for reading.</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com0tag:blogger.com,1999:blog-36589303.post-48850311426473802722013-07-08T08:47:00.001+02:002013-07-08T08:47:32.111+02:00Performance Analysis of N-Gram Tokenizer in WEKA<p>The goal of this post is to analyze the <a href="http://weka.sourceforge.net/" target="_blank">WEKA</a> class <code><a href="http://weka.sourceforge.net/doc.dev/weka/core/tokenizers/NGramTokenizer.html" target="_blank">NGramTokenizer</a></code> in terms of performance, as it depends on the complexity of the <a href="http://en.wikipedia.org/wiki/Regular_expression" target="_blank">regular expression</a> used during the tokenization step. There is a potential trade-off between more simple regex (which lead to more tokens) and more complex regexes (which take more time to be evaluated). This post intends to provide experimental insights on this trade-off, in order to save your time when using this extremely useful class with the WEKA indexer <code><a href="http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html" target="_blank">StringToWordVector</a></code>.</p>
<p><strong>Motivation</strong></p>
<p>The WEKA <code>weka.core.tokenizers.NGramTokenizer</code> class is responsible for tokenizing a text into pieces, which depending on the configuration of its size, they can be token <a href="http://en.wikipedia.org/wiki/N-gram" target="_blank">unigrams, bigrams and so on</a>. This class relies on the method <code><a href="http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String)" target="_blank">String[] split(String regex)</a></code> for splitting a text string into tokens, which are further combined into ngrams.</p>
<p>This method, in turn, depends on the complexity of the regular expression used to split the text. For instance, let us examine this simple example:</p>
<blockquote>
<p><code>public class TextSplitTest {
<br/>
public static void main(String[] args) {
<br/>
String delimiters = "\\W";
<br/>
String s = "This is a text &$% string";
<br/>
System.out.println(s);
<br/>
String[] tokens = s.split(delimiters);
<br/>
System.out.println(tokens.length);
<br/>
for (int i = 0; i < tokens.length; ++i)
<br/>
System.out.println("#"+tokens[i]+"#");
<br/>
}
<br/>
}</code></p>
</blockquote>
<p>In this call to the <code>split()</code> method, we are using the regex "<code>\\W</code>", which <a href="http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html" target="_blank">matches any non-alphanumeric character as a delimiter</a>. The output of this class execution is:</p>
<blockquote>
<p><code>$> java TextSplitTest
<br/>
This is a text &$% string
<br/>
9
<br/>
#This#
<br/>
#is#
<br/>
#a#
<br/>
#text#
<br/>
##
<br/>
##
<br/>
##
<br/>
##
<br/>
#string#</code></p>
</blockquote>
<p>This is due that every individual non-alphanumeric character is a match, and we have five delimiters between "<code>text</code>" and "<code>string</code>". In consequence, we find four empty (but not null) strings among these five matches. If we use the regex "<code>\\W+</code>" as the delimiters string, which matches sequences of one or more non-alphanumeric characters, we get the following output:</p>
<blockquote>
<p><code>$> java TextSplitTest
<br/>
This is a text &$% string
<br/>
5
<br/>
#This#
<br/>
#is#
<br/>
#a#
<br/>
#text#
<br/>
#string#</code></p>
</blockquote>
<p>Which is closer to what we expected at the beginning.</p>
<p>When tokenizing a text, it seems wise to avoid computing empty strings as potential tokens, because we have to invest some time to discard them -- and we can have thousands of instances!. On the other side, it is clear that a more complex regular expression leads to more computation time. So there is a trade-off between using a one-character delimiter versus using a more sophisticated regex to avoid empty strings. To which extent does this trade-off impacts on the <code>StringToWordVector</code>/<code>NGramTokenizer</code> classes?</p>
<p><strong>Experiment Setup</strong></p>
<p>I run these experiments on my laptop, with: CPU - Intel Core2 Duo, P8700 @ 2.53GHz; RAM: 2.90GB (1.59 GHz). For some of the tests, specially those involving a big number of ngrams, I need to make use of the <code>-Xmx</code> option in order to increase the heap space.</p>
<p>I am using the class <code><a href="https://github.com/jmgomezh/tmweka/blob/master/WEKAExamples/IndexTest.java" target="_blank">IndexText.java</a></code> available at <a href="https://github.com/jmgomezh/tmweka" target="_blank">my GITHub repository</a>. I have commented all the output to retain only the computation time for the method <code>index()</code>, which creates the tokenizer and the filter objects and performs the filtering process. This process actually indexes the documents, that is, it transforms the text strings in each instance into a dictionary-based representation -- each instance is an sparse list of pairs (token_number,weight) where the weight is binary-numeric. I have also modified the class to set lowercasing to false, in order to accumulate as many tokens as possible.</p>
<p>I have perfomed experiments using the two next collections:</p>
<ul>
<li>The <a href="http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/" target="_blank">SMS Spam Collection</a>, which is a dataset of 5,568 short messages classified as spam/ham (not spam).</li>
<li>The classical <a href="http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html" target="_blank">Reuters-21578 text collection (ModApte split)</a>, which is a dataset of 21,578 relatively short news stories, classified according a number of economic classes (acquisitions, earning reports, products like rubber, tin or sugar, etc.). I have downloaded it from the <a href="http://nltk.org/nltk_data/" target="_blank">NLTK data directory</a>.</li>
</ul>
<p>I am comparing using the strings "<code>\\W</code>" and "<code>\\W+</code>" as delimiters in the <code>NGramTokenizer</code> instance of the <code>index()</code> method, for unigrams, uni-to-bigrams, and uni-to-trigrams. In the case of the SMS Spam Collection, I have divided the dataset into pieces of 20%, 40%, 60%, 80% and 100% in order to evaluate the effect of the collection size.</p>
<p>Finally, I have run the program 10 times per experiment, in order to average and get more stable results. All numbers are expressed in milliseconds.</p>
<p><strong>Results and Analysis</strong></p>
<p>We will examine the results on the SMS Spam Collection. The results obtained for unigrams are the next ones:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/ngramtokenizer.spamsms.chart.unigrams.png" style="WIDTH: 450px; DISPLAY: inline; HEIGHT: 314px" height="314" width="450"/></p>
<p>It is a bar diagram which shows the time in milliseconds for each collection size (20%, 40%, etc.). The results for the bigrams are:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/ngramtokenizer.spamsms.chart.bigrams.png" style="WIDTH: 450px; DISPLAY: inline; HEIGHT: 310px" height="310" width="450"/></p>
<p>And the results for trigrams on the SMS Spam Collection are the next ones:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/ngramtokenizer.spamsms.chart.trigrams.png" style="WIDTH: 450px; DISPLAY: inline; HEIGHT: 329px" height="329" width="450"/></p>
<p>So the times for unigrams, uni-to-bigrams and uni-to-trigrams are exponetially higher (as it can be expected). While on unigrams, using the simple regex "<code>\\W</code>" is more efficient, the more sophisticated regex "<code>\\W+</code>" is more efficient for bigrams and trigrams. There is one anomaly point (at 60% on trigrams), but I believe it is an outlier. So it seems that the cost of using a more sophiticated regex does not pay for unigrams, in which the cost of matching this regex is higher than discarding empty strings. However it is the opposite in the case of uni-to-bigrams and uni-to-trigrams, where the empty strings seem to hurt the algorithm for building the bi- and trigrams.</p>
<p>The results on the Reuters-21578 collection are the next ones:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/ngramtokenizer.reuters.chart.png" style="WIDTH: 450px; DISPLAY: inline; HEIGHT: 341px" height="341" width="450"/></p>
<p>These results are fully aligned with the results obtained on the SMS Spam Collection, with the advantage of increasing the difference in the case of uni-to-trigrams, as the number of different tokens on the Reuters-21578 test collection is much bigger (as there are more texts, and they are longer).</p>
<p>But all in all, the biggest increment in performance we get are 4.59% in the SMS Spam Collection (uni-to-trigrams, 40% sub-collection) and 4.15% on the Reuters-21578 collection, which I consider marginal. So all in all, there is not a big difference between using these two regexes after all.</p>
<p><strong>Conclusions</strong></p>
<p>In the potential trade-off between using simple regular expressions to recognize text tokens, and using a more sophisticated regular expression in the WEKA indexer classes for avoiding spurius tokens, my simple experiment shows that <em>both approaches are more or less equivalent in terms of performance</em>.</p>
<p>However, when using only unigrams, it is better to use simple regular expressions because the time to match tokens in a more sophisticated regex does not pay.</p>
<p>On the other side, the algorithm for building bi- and trigrams seems to be sensitive to the empty strings generated by a simple regex, and you can get around a 4% increase of performance when using more sophisticated regular expressions and avoiding those empty strings.</p>
<p>As always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com0tag:blogger.com,1999:blog-36589303.post-64723401944841885432013-07-04T21:45:00.001+02:002013-07-05T06:33:59.555+02:00Chat or What: Approaching Text Normalization in Chats and Social Networks<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/text.normalization.header.png" style="WIDTH: 400px; DISPLAY: inline; HEIGHT: 124px" height="124" width="400"/></p>
<p>It is not strange that, with the overload of user-generated content, there is an increasing interest on processing chat/SMS-like language. Social Networks, virtual worlds, <a href="http://en.wikipedia.org/wiki/Massively_multiplayer_online_role-playing_game" target="_blank">MMORPGs</a> and chat rooms are plagued with emoticons, abbreviations, typos and channel codes that make the task of processing user-generated text a nightmare. In this post I list a number of resources and approaches that may be useful for researchers and practitioners of <a href="https://en.wikipedia.org/wiki/Natural_language_processing" target="_blank">Natural Language Processing</a> regarding this problem, which following the course by <a href="http://www.cslu.ogi.edu/~sproatr/" target="_blank">Richard Sproat</a> and <a href="http://www.bedrick.org/" target="_blank">Steven Bedrick</a>, I call <a href="http://www.cslu.ogi.edu/~sproatr/Courses/TextNorm/" target="_blank"><em>Text Normalization</em></a> .</p>
<p>Text Normalization can be seen as <em>translation from informal language to standard English-Spanish-whatever</em>. The most simple approach you can follow is a <em>word by word translation</em> using a dictionary. This approach is followed by online lingo translators like <a href="http://www.lingo2word.com/" target="_blank">Lingo2Word</a> and <a href="http://transl8it.com/" target="_blank">Transl8it!</a>. In fact, you can reproduce this work using <a href="http://www.lingo2word.com/dictionary.php" target="_blank">the Lingo2Word dictionary</a> (click on the header links). I have followed this approach as a baseline in several projects and works, like <a href="http://wendy.optenet.com/" target="_blank"><em>WENDY - WEb-access coNfidence for chilDren and Young</em></a> (web page in Spanish, the paper: " <a href="http://www.clef-initiative.eu/documents/71612/271dd606-53d1-4cad-9852-fb5336e8587e" target="_blank"><em>Combining Predation Heuristics and Chat-Like Features in Sexual Predator Identification</em></a> " in English).</p>
<p>Another knowledge-based alternative is manually coding normalization rules. An example is the tool <a href="https://code.google.com/p/deflog/" target="_blank">Deflog</a>, which is a program that decodes the usual expressions used in the picture-oriented social network <a href="http://www.fotolog.com/" target="_blank">Fotolog</a>. In this network, the majority of (Spanish-language) users make use of specific language codes like repeating vowels ("I liiiiiiiiiiiiike iiiiiiit"), alternating upper and lowercase ("YoU WiLL LiKe It"), and so on. The program encodes a number of functions that "correct" word tokens, each function for a particular code. While the functions mostly apply to Spanish and Fotolog, a linguist may derive their own rules for another domain (e.g. Twitter).</p>
<p>These are obviously baselines. There much more sophisticated methods, mostly based on statistical methods; I provide a list here that complements the reading list in the course by Sproat and Bedrick:</p>
<ul>
<li>Bo Han, Paul Cook and Timothy Baldwin, <a href="http://dl.acm.org/citation.cfm?id=2414430&CFID=332163528&CFTOKEN=34685198" target="_blank">Lexical Normalisation of Short Text Messages</a>, In ACM Transactions on Intelligent Systems and Technology (TIST) 4(1), pp. 5:1-5:27, 2013.</li>
<li>Tim Schlippe, Chenfei Zhu, Daniel Lemcke, and Tanja Schultz. <a href="http://csl.ira.uka.de/~schlippe/pubs/ICASSP2013-Schlippe_SMTTextNormalization.pdf" target="_blank">Statistical Machine Translation based Text Normalization with Crowdsourcing</a>. In Proceedings of The 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2013), Vancouver, Canada, 26-31 May 2013.</li>
<li>Bo Han, Paul Cook and Timothy Baldwin, <a href="http://aclweb.org/anthology/D/D12/D12-1039.pdf" target="_blank">Automatically Constructing a Normalisation Dictionary for Microblogs</a>, In EMNLP-CoNLL 2012, 421-432, Jeju, Republic of Korea.</li>
<li>Bo Han and Timothy Baldwin, <a href="http://aclweb.org/anthology/P/P11/P11-1038.pdf" target="_blank">Lexical normalisation of short text messages: Makn sens a #twitter</a>, In ACL 2011, 368-378, Portland, OR, USA.</li>
<li>Tim Schlippe, Chenfei Zhu, Jan Gebhardt, Tanja Schultz. <a href="http://csl.ira.uka.de/~schlippe/pubs/Interspeech2010-Schlippe_SMTNormalization.pdf" target="_blank">Text Normalization based on Statistical Machine Translation and Internet User Support</a>. In Proceedings of The 11th Annual Conference of the International Speech Communication Association (Interspeech 2010), Makuhari, Japan, 26-30 September 2010.</li>
<li>Carlos Henriquez, Adolfo Hernández H., <a href="http://www2009.eprints.org/255/3/Henriquez_Hernandez_CAW2009.pdf" target="_blank">A ngram-based statistical machine translation approach for text normalization on chat-speak style communications</a>. Proceedings of the CAW2 (Content Analysis in Web 2.0) Workshop, April 2009.</li>
</ul>
<p>You can get some more papers by tracking the referenced literature or by searching these papers for citations.</p>
<p>As a final note, remember that text normalization is not always a good idea. I mean, for some problems it would be nice to keep the original abbreviations, emoticons and so as they can be representative of the style, genre, an author or a particular age.</p>
<p>I hope these works will suggest you other methods for your problem at hand. As always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com3tag:blogger.com,1999:blog-36589303.post-79746532251500991212013-06-23T01:18:00.001+02:002013-06-23T01:36:46.050+02:00Sample Code for Text Indexing with WEKA<p>Following the example in which <a href="http://jmgomezhidalgo.blogspot.com.es/2013/04/a-simple-text-classifier-in-java-with.html" target="_blank">I demonstrated how to develop your own classifier in Java based on WEKA</a>, I propose an additional example on <em>how to index a collection of texts in you Java code</em>. This post is inspired and supported by the WEKA <a href="http://weka.wikispaces.com/Use+WEKA+in+your+Java+code" target="_blank">"Use WEKA in your Java code"</a> wiki page. To index a text collection is to generate a mapping between docs and words (or other indexing units) as represented in the next graph:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/indexing.demo.index.graph.png" style="WIDTH: 400px; DISPLAY: inline; HEIGHT: 274px" height="274" width="400"/></p>
<p>The fundamental class for text indexing in WEKA is <code><a href="http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html" target="_blank">weka.filters.unsupervised.attribute.StringToWordVector</a></code>. This class provides an impressive range of indexing options that include using custom <a href="http://en.wikipedia.org/wiki/Tokenization" target="_blank">tokenizers</a>, <a href="http://en.wikipedia.org/wiki/Stemming" target="_blank">stemmers</a> and <a href="https://en.wikipedia.org/wiki/Stop_words" target="_blank">stoplists</a>; binary, <a href="http://en.wikipedia.org/wiki/Tf–idf" target="_blank">Term Frequency and TF.IDF</a> weights, etc. For some applications, its default options may be enough -- however I recommend to get familiar with all its options, in order to get full advantage of it.</p>
<p>With the purpose of showing how to use <code>StringToWordVector</code> in your code, I have created a simple class named <code><a href="https://github.com/jmgomezh/tmweka/blob/master/WEKAExamples/IndexTest.java" target="_blank">IndexTest.java</a></code>, stored <a href="https://github.com/jmgomezh/tmweka/tree/master/WEKAExamples" target="_blank">in my GitHub repository</a>. Apart from the relatively simple methods for loading and storing <a href="http://www.cs.waikato.ac.nz/ml/weka/arff.html" target="_blank">Attribute-Relation File Format (ARFF)</a> files, the core of the class is the method <code>void index()</code>, which creates and employs a <code>StringToWordVector</code> object. The first piece of the code is the following one:</p>
<blockquote>
<p><code>// Set the tokenizer
<br/>
NGramTokenizer tokenizer = new NGramTokenizer();
<br/>
tokenizer.setNGramMinSize(1);
<br/>
tokenizer.setNGramMaxSize(1);
<br/>
tokenizer.setDelimiters("\\W");</code></p>
</blockquote>
<p>This snippet creates and configures a tokenizer, that is the object responsible for breaking the original text into individual strings named tokens, representing the indexing units (typically words). In this case I am using a <code><a href="http://weka.sourceforge.net/doc.dev/weka/core/tokenizers/NGramTokenizer.html" target="_blank">weka.core.tokenizers.NGramTokenizer</a></code>, which I find more useful than the usual <code><a href="http://weka.sourceforge.net/doc.dev/weka/core/tokenizers/WordTokenizer.html" target="_blank">weka.core.tokenizers.WordTokenizer</a></code>, as I describe <a href="http://jmgomezhidalgo.blogspot.com.es/2013/06/baseline-sentiment-analysis-with-weka.html" target="_blank">in the post about sentiment analysis with WEKA</a>. This tokenizer is able to recognize <a href="http://en.wikipedia.org/wiki/N-gram" target="_blank">n-grams</a>, that is, sequences of tokens. Here I use the methods <code><a href="http://weka.sourceforge.net/doc.dev/weka/core/tokenizers/NGramTokenizer.html#setNGramMaxSize(int)" target="_blank">void setNGramMaxSize(int value)</a></code> and <code><a href="http://weka.sourceforge.net/doc.dev/weka/core/tokenizers/NGramTokenizer.html#setNGramMinSize(int)" target="_blank">void setNGramMinSize(int value)</a></code> to define the size of the n-grams as unigrams.</p>
<p>Another interesting aspect of the tokenizer part is that we setup the regular expression <code>"\\W"</code> as delimiters or separators. This regex defines that any character not being alphanumeric is considered a delimiter. As a result, only alphanumeric character strings will be considered tokens. For a detailed reference on regular expression in Java, check <a href="http://docs.oracle.com/javase/tutorial/essential/regex/" target="_blank">the lesson on the topic in the Java Tutorial</a>.</p>
<p>The second code snippet is the following one:</p>
<blockquote>
<p><code>// Set the filter
<br/>
StringToWordVector filter = new StringToWordVector();
<br/>
filter.setInputFormat(inputInstances);
<br/>
filter.setTokenizer(tokenizer);
<br/>
filter.setWordsToKeep(1000000);
<br/>
filter.setDoNotOperateOnPerClassBasis(true);
<br/>
filter.setLowerCaseTokens(true);</code></p>
</blockquote>
<p>This second snippet creates and configures the <code>StringToWordVector</code> object, which is a subclass of the <code><a href="http://weka.sourceforge.net/doc.dev/weka/filters/Filter.html" target="_blank">weka.filters.Filter</a></code> class. Any filter has to make reference to a dataset, which is the inputInstances dataset in this case, as done with the <code>filter.setInputFormat(inputInstances)</code> call.</p>
<p>We setup the tokenizer and some other options as an example. Both <code>DoNotOperateOnPerClassBasis</code> and <code>WordsToKeep</code> should be standard in most of text classifiers. The first one tells the filter to extract the tokens from all classes as a whole, instead of doing it class per class (default option). I simply fail to understand why one should want to get different indexing tokens per class in a text classification problem. The second option sets the number of words to keep, and I recommend to define a big integer here in order to cover all possible tokens.</p>
<p>The third and last code snippet shows the invocation of the filter on the <code>inputInstances</code> reference:</p>
<blockquote>
<p><code>// Filter the input instances into the output ones
<br/>
outputInstances = Filter.useFilter(inputInstances,filter);</code></p>
</blockquote>
<p>This is the standard method for applying a filter, according to the "<a href="http://weka.wikispaces.com/Use+WEKA+in+your+Java+code" target="_blank">Use WEKA in your Java code</a>". The output of calling this class on a simple dataset as <code><a href="https://github.com/jmgomezh/tmweka/blob/master/FilteredClassifier/smsspam.small.arff" target="_blank">smsspam.small.arff</a></code> is the next one:</p>
<blockquote>
<p><code>$> javac IndexTest.java
<br/>
$>java IndexTest
<br/>
Usage: java IndexTest <fileInput> <fileOutput>
<br/>
$>java IndexTest smsspam.small.arff result.arff
<br/>
===== Loaded dataset: smsspam.small.arff =====
<br/>
Started indexing at: 1371939800703
<br/>
===== Filtering dataset done =====
<br/>
Finished indexing at: 1371939800812
<br/>
Total indexing time: 109
<br/>
===== Saved dataset: result.arff =====
<br/>
$>more result.arff
<br/>
@relation 'sms_test-weka.filters.unsupervised.attribute.StringToWordVector-R2-W1000000-prune-rate-1.0-N0-L-stemmerweka.core.stemmers.NullStemmer-M1-O-
<br/>
tokenizerweka.core.tokenizers.NGramTokenizer -delimiters "\\W" -max 1 -min 1'
<br/>
<br/></code><code>@attribute spamclass {spam,ham}
<br/>
@attribute 000 numeric
<br/>
@attribute 03 numeric
<br/>
@attribute 07046744435 numeric
<br/>
@attribute 07732584351 numeric
<br/>
../..</code></p>
</blockquote>
<p style="MARGIN-RIGHT: 0px">As a note, the name of the relation in the generated ARFF file (tag <code>@relation</code>) encodes the properties of the applied filter, including some default options I have not configured in it.</p>
<p>So that is all. More examples on this topics coming in the next weeks. And as always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com5tag:blogger.com,1999:blog-36589303.post-25578626734757090622013-06-19T17:13:00.001+02:002013-06-19T17:30:45.752+02:00Comparing baselines of keyword and learning based sentiment analysis<p><img src="http://www.esp.uem.es/jmgomez/blogimg/opinion.mining.keep-calm-and-stay-positive.png" style="TEXT-ALIGN: center; WIDTH: 300px; DISPLAY: block; HEIGHT: 349px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="349" width="300"/></p>
<p>In my previous post, I have presented <a href="http://jmgomezhidalgo.blogspot.com.es/2013/06/baseline-sentiment-analysis-with-weka.html" target="_blank">a simple example of using WEKA for Sentiment Analysis (or Opinion Mining)</a>. As most of <a href="http://jmgomezhidalgo.blogspot.com.es/search/label/WEKA" target="_blank">my blog posts on text mining with WEKA</a>, I approach interesting, hot or easy tasks as a way to present this package capabilities for text mining -- in consequence, these posts are <em>tutorial</em> in essence.</p>
<p>In that particular post, I left <em>several open tasks</em> for anybody who may be interested on completing them, and I picked two for myself. One of the tasks left for the reader was <em>coding a class and training a model</em> to actually classify texts according to sentiment -- and as I have been requested the code, I did it by myself and <a href="https://github.com/jmgomezh/tmweka/tree/master/OpinionMining" target="_blank">it is available at my GitHub repository</a>.</p>
<p>Another task I left pending, and picked for myself, was applying a keyword-based approach using <a href="http://sentiwordnet.isti.cnr.it/" target="_blank">SentiWordNet</a> to the same (<a href="http://www.sfu.ca/~mtaboada/research/SFU_Review_Corpus.html" target="_blank">SFU Review Corpus</a>) collection and comparing its accuracy to the learning (<a href="http://www.cs.waikato.ac.nz/~ml/weka/" target="_blank">WEKA</a>) approach. So this is the topic of this post.</p>
<p><strong>Goal</strong></p>
<p>The goal of this post is to build a simple keyword-based sentiment analysis program based on SentiWordNet and evaluate it on the SFU Review Corpus, in order to compare its accuracy with the one obtained via (WEKA) learning as described in my previous post "<a href="http://jmgomezhidalgo.blogspot.com.es/2013/06/baseline-sentiment-analysis-with-weka.html" target="_blank">Baseline Sentiment Analysis with WEKA</a>".</p>
<p><strong>About SentiWordNet</strong></p>
<p>SentiWordNet is a collection of concepts (<a href="http://en.wikipedia.org/wiki/Synonym_ring" target="_blank">synonym sets, synsets</a>) from <a href="http://wordnet.princeton.edu/" target="_blank">WordNet</a> that have been evaluated from the point of view of their polarity (if they convey a positive or a negative feeling). Some interesting features include:</p>
<ul>
<li>As it is based on WordNet, only English and the four most significant parts of speech (nouns, adjectives, adverbs and verbs) are covered. Multi-word expressions are included, encoded with underscore (e.g. "too_bad", "at_large").</li>
<li>Each concept has attached polarity scores. For instance:</li>
</ul>
<blockquote>
<p><code># POS ID PosScore NegScore SynsetTerms Gloss
<br/>
a 01125429 0 0.625 bad#1 having undesirable or negative qualities; "a bad report card"; "his sloppy appearance made a bad impression"; "a bad little boy"; "clothes in bad shape"; "a bad cut"; "bad luck"; "the news was very bad"; "the reviews were bad"; "the pay is bad"; "it was a bad light for reading"; "the movie was a bad choice"
<br/>
a 01052038 0.222 0.778 too_bad#1 regrettable#1 deserving regret; "regrettable remarks"; "it's regrettable that she didn't go to college"; "it's too bad he had no feeling himself for church"</code></p>
</blockquote>
<p style="MARGIN-RIGHT: 0px">So SentiWordNet is in a tab-separated format, being the first column the <a href="http://en.wikipedia.org/wiki/Part_of_speech" target="_blank">Part Of Speech</a> (POS), the second and third ones the polarity scores (between 0 and 1), the next column the synset (synonym set, list of synonyms tagged with their sense -- word#sense_number), and the last one the WordNet gloss (roughly speaking, the definition).</p>
<p>Another interesting feature is that SentiWordNet researchers have provided us with a very basic Java class named <code><a href="http://sentiwordnet.isti.cnr.it/code/SWN3.java" target="_blank">SWN3.java</a></code> to query the database for a pair word/POS. This class loads the database and provides a function that outputs "<code>positive</code>", "<code>strong_positive</code>", "<code>negative</code>", "<code>strong_negative</code>" or "<code>neutral</code>" for a given pair according to the manual scores assigned to the synsets. It is very basic because it does not perform <a href="http://en.wikipedia.org/wiki/Word-sense_disambiguation" target="_blank">Word Sense Disambiguation</a> nor even <a href="http://en.wikipedia.org/wiki/Part-of-speech_tagging" target="_blank">POS Tagging</a>, and the labels are heuristically defined (some other definitions are possible). However, we can take advantage of it in order to implement a very basic sentiment classifier, as described below.</p>
<p>In order to make use of the <code>SWN3.java</code> class, you have to:</p>
<ol>
<li><a href="http://sentiwordnet.isti.cnr.it/download.php" target="_blank">Download a copy of SentiWordNet</a>.</li>
<li>Rename the file to <code>SentiWordNet_3.0.0.txt</code> and put it in a <code>data</code> folder -- relative to the place you located your <code>SWN3.java</code> file. Alternatively, you can modify this class to use a different path or data file name.</li>
<li>Delete all lines starting with the symbol "<code>#</code>" from the <code>SentiWordNet_3.0.0.txt</code> file. HINT: The header and the last line of the file.</li>
</ol>
<p>And that's it.</p>
<p><strong>The Algorithm and Its Parameters/Heuristics</strong></p>
<p>I have sketched a very simple algorithm for sentiment classification using the provided by the <code>SWN3.java</code> querying class. Given the output of its function <code>public String extract(String word, String pos)</code>, that is "positive" etc., the algorithm consists of:</p>
<ol>
<li>Tokenizing the target text into alphanumeric strings (eventually, words).</li>
<li>Start a polarity score with 0.</li>
<li>For each token, search for it using the extract function and add +1 (positive), +2 (strong_positive), -1 (negative), or -2 (strong_negative).</li>
<li>Return "<code>yes</code>" if the final polarity score is over 0, and "<code>no</code>" if its below 0.</li>
</ol>
<p>Let me remind that the class tags used in the SFU Review Corpus are "<code>yes</code>" (positive) and "<code>no</code>" (negative).</p>
<p>That's all. No rocket science here.</p>
<p>However, there are two basic parameters:</p>
<ul>
<li>What to do if you get a <em>neutral</em> score (0)? So we can be positive (<code>Y</code>, return "<code>yes</code>" when the score is greater or equals to 0), or negative (<code>N</code>, return "<code>no</code>" when the score is less or equals to 0).</li>
<li>Which is the <em>Part of Speech</em> we can use in the SentiWordNet search? I have crafted to options: (1) Looking (and summing) as all available POS (<code>AllPOS</code>), and (2) looking only as adjectives (<code>ADJ</code>).</li>
</ul>
<p>So I have coded four methods, named <code>classifyAllPOSY()</code>, <code>classifyAllPOSN()</code>, <code>classifyADJY()</code> and <code>classifyADJN()</code> for the four possible combinations. These functions are available in the <code><a href="https://github.com/jmgomezh/tmweka/blob/master/OpinionMining/SentiWordNetDemo.java" target="_blank">SentiWordNetDemo.java</a></code> class <a href="https://github.com/jmgomezh/tmweka/tree/master/OpinionMining" target="_blank">at the GitHub repository</a>. And these are the approaches I test below.</p>
<p>The <em>rationale for the first parameter</em> is that we have a 50% balance between the 400 reviews, so it is not clear which we should prefer. In an imbalanced problem, we could choose the most populated class. An alternative is analyzing SentiWordNet to check if it is positively or negatively biased (that is, with more positive or negative words), or even refine this with an additional corpus (counting words and weighting according the frequencies of positive/negative words).</p>
<p>The <em>rationale for the second parameter</em> is that adjectives tend to be less ambiguous (discarding sarcasm or irony), but it is easy to test with any other POS. Using all of them is incorrect (as every word has only one POS in context) but it is practical, and it will give more extreme scores (assuming that a negative word is so with each of its possible POS).</p>
<p><strong>Results and Analysis</strong></p>
<p>So we are testing four approaches, and I will be using the same metrics as I used in the previous blog on sentiment analysis with WEKA, that are averaged <a href="http://en.wikipedia.org/wiki/F1_score" target="_blank">F1</a> and accuracy (along with the <a href="http://en.wikipedia.org/wiki/Confusion_matrix" target="_blank">Confusion Matrix</a> itself). The test is performed over the 400 text documents in the dataset, as we do not need training for this algorithm. The following table shows the results I have obtained:</p>
<p><img src="http://www.esp.uem.es/jmgomez/blogimg/opinion.mining.results.sentiwordnet.png" style="TEXT-ALIGN: center; DISPLAY: block; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="120" width="340"/></p>
<p>I have added to the table the two best performing configurations for a learning based classifier <a href="http://jmgomezhidalgo.blogspot.com.es/2013/06/baseline-sentiment-analysis-with-weka.html" target="_blank">as presented in the previous blog post</a>. However, the comparison is not 100% fair, as the learning approach has been evaluated by 10 fold <a href="http://en.wikipedia.org/wiki/Cross-validation" target="_blank">Cross Validation</a> -- which involves using the full dataset as test set, but in 10% size batches.</p>
<p>All in all, it seems that the keyword-based (using SentiWordNet) approach is competitive (it beats many learning-based classifiers in my previous experiment), getting its best results using only adjectives and outputting "<code>no</code>" in case of neutral scores. The effectiveness on the "<code>yes</code>" class is better than the SVMs with 1-to-3-grams, in terms of recall. I believe that, with some adjustments, the keyword-based approach can be very competitive in this case, and it has the additional advantage that it does not rely on the quality or amount of training data.</p>
<p>Comparing the parameters, the default "<code>no</code>" is consistently better than the default "<code>yes</code>". Using all POS is worse than using only adjectives, because even in the case of default "<code>yes</code>" (which is beaten by both ALL cases in terms of accuracy), we get more balanced decisions -- the ALL setup leads to extremely positive scores, and a clear bias to the "<code>yes</code>" class.</p>
<p><strong>Concluding Discussion</strong></p>
<p>As discussed above, I consider this test as <strong>a baseline</strong> because of the wide number of simple heuristics employed in the algorithm. Actually, there are a number of possible improvements to be done, although some of them are not trivial. I tag them as [<em>easy</em>|<em>hard</em>] according to my experience in text mining. For instance:</p>
<ul>
<li>Recognizing <strong>multiword expressions</strong> [<em>easy</em>]. This can be done by making simple searches for token n-grams in the SentiWordNet database, just modifying the <code>SentiWordNetDemo</code> class.</li>
<li>Using a validation dataset to <strong>optimize the score threshold</strong> [<em>easy</em>]. We have assumed that an overall score of 0 is neutral, and tested to classify it as positive or negative (being the second option better). We have general evidence that the database is positively oriented, so we can set a threshold over 0 (e.g. 10, 20...) for classifying a text as positive, in order to correct this effect. The most simple way of doing this is selecting a 10% of the corpus as a validation set, sorting the decisions according to the score, and defining a threshold that optimizes the accuracy (or F1).</li>
<li>Test <strong>different scoring models</strong> like e.g. modifying the <code>SWN3.java</code> program to output original scores instead of tags [<em>easy</em>] and use those scores for the final polarity score calculation. Alternatively, we can play with different definitions of "strong_positive" etc. in terms of the weights [<em>easy</em>], or to use different scores for assigning polarity labels database [<em>easy</em>]. This can be more difficult to test, but we can use a validation set as in the previous point.</li>
<li>Performing <strong>POS Tagging</strong> by using the majority tag [<em>easy</em>], coding a POS Tagger based on learning [<em>hard</em>], or using an existing off-the-shelf POS Tagger (like e.g. <a href="http://nlp.lsi.upc.edu/freeling/" target="_blank">Freeling</a> or <a href="http://nlp.stanford.edu/software/corenlp.shtml" target="_blank">CoreNLP</a>) [<em>easy</em>]. After using a POS Tagger, the tags must be normalized or processed in order to retain the basic POS, as most of POS Taggers make use of sophisticated tag sets that represent morphology and so on. Obviously, the algorithm should be changed to perform only the search for the appropriate POS tag.</li>
<li>Performing <strong>Word Sense Disambiguation</strong> by using the first sense [<em>easy</em>], coding a WSD system based on learning using a dataset like <a href="http://www.cse.unt.edu/~rada/downloads.html#semcor" target="_blank">Semcor</a> [<em>hard</em>], coding a WSD system based on dictionaries -- e.g. using the WordNet glosses in the database itself [<em>easy</em>], or using a an existing off-the-shelf WSD system like e.g. <a href="http://www.cse.unt.edu/~rada/downloads.html#senselearner" target="_blank">SenseLearner</a> [<em>easy</em>]. You may need to perform data transformations in the case of using different database versions for WSD and sentiment analysis, and in terms of format.</li>
</ul>
<p>In a more exploratory work, I suggest to:</p>
<ul>
<li>Test the algorithm on other datasets like the classical <a href="http://www.cs.cornell.edu/people/pabo/movie-review-data/" target="_blank">Movie Review Datasets</a> by <a href="http://research.yahoo.com/Bo_Pang" target="_blank">Bo Pang</a> and <a href="http://www.cs.cornell.edu/home/llee" target="_blank">Lilian Lee</a>, or with other semantic lexicons (opinionated word databases) like the <a href="http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon" target="_blank">Opinion Lexicon</a> by <a href="http://www.cs.uic.edu/~liub/" target="_blank">Bing Liu</a> <em>et al</em>. or the <a href="http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/" target="_blank">Subjectivity Lexicon</a> by <a href="http://people.cs.pitt.edu/~wiebe/" target="_blank">Janyce Wiebe</a> <em>et al</em>..</li>
<li>Perform an exploratory analysis of the distribution of polarities at SentiWordNet and its implications on the basic algorithm.</li>
</ul>
<p>I am not sure if I will be making any other tests with the keyword-based approach to sentiment analysis, as I want to keep my focus on <a href="http://www.esp.uem.es/jmgomez/tmweka/" target="_blank">WEKA features for text mining</a>.</p>
<p>Anyway, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com10tag:blogger.com,1999:blog-36589303.post-46247701481452844112013-06-11T13:21:00.001+02:002013-06-11T13:21:32.147+02:00Baseline Sentiment Analysis with WEKA<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/opinion.mining.headpic.png" style="DISPLAY: inline" height="143" width="447"/></p>
<p><a href="http://en.wikipedia.org/wiki/Sentiment_analysis" target="_blank">Sentiment Analysis (and/or Opinion Mining)</a> is one of the hottest topics in <a href="http://en.wikipedia.org/wiki/Natural_language_processing" target="_blank">Natural Language Processing</a> nowadays. The task, defined in a simplistic way, consists of determining the polarity of a text utterance according to the opinion or sentiment of the speaker or writer, as positive or negative. This task has multiple applications, including e.g. Customer Relationship Management or predicting political elections.</p>
<p>While initial results dating back to early 2000 seem very promising, it is not such a simple task. We face from <a href="http://deepthoughtinc.com/wp-content/uploads/2011/01/Twitter-as-a-Corpus-for-Sentiment-Analysis-and-Opinion-Mining.pdf" target="_blank">the informal Twitter language</a> to the fact that <a href="http://times.cs.uiuc.edu/czhai/pub/www07-sent.pdf" target="_blank">opinions can be faceted</a> (for instance, I may like the software but not the hardware of a device), or <a href="http://www.cs.uic.edu/~liub/FBS/fake-reviews.html" target="_blank">opinion spam and fake reviews</a>, along with traditional and complex problems in Natural Language Processing as irony, sarcasm or negation. For a good overview of the task, please check <a href="http://www.cs.cornell.edu/home/llee/opinion-mining-sentiment-analysis-survey.html" target="_blank">the survey paper on opinion mining and sentiment analysis by Bo Pang and Lillian Lee</a>. A more practical overview is the <a href="http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html" target="_blank">Sentiment Tutorial with LingPîpe by Alias-i</a>.</p>
<p>In general, there are two main approaches to this task:</p>
<ul>
<li>Counting and/or weighting sentiment-related words that have been evaluated and tagged by experts, conforming a lexical collection like <a href="http://sentiwordnet.isti.cnr.it/" target="_blank">SentiWordNet</a>.</li>
<li>Learning a text classifier on a previously labelled text collection, like e.g. the <a href="http://www.sfu.ca/~mtaboada/research/SFU_Review_Corpus.html" target="_blank">SFU Review Corpus</a>.</li>
</ul>
<p>The SentiWordNet home page offers <a href="http://sentiwordnet.isti.cnr.it/code/SWN3.java" target="_blank">a simple Java program that follows the first approach</a>. I will follow the second one in order to show how to use an essential WEKA text mining class (<code><a href="http://weka.sourceforge.net/doc.dev/weka/core/converters/TextDirectoryLoader.html" target="_blank">weka.core.converters.TextDirectoryLoader</a></code>), and to provide another example of the <code><a href="http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html" target="_blank">weka.filters.unsupervised.attribute.StringToWordVector</a></code> class.</p>
<p>I will follow the process outlined in <a href="http://jmgomezhidalgo.blogspot.com.es/2013/05/language-identification-as-text.html" target="_blank">the previous post about Language Identification using WEKA</a>.</p>
<p><strong>Data Collection and Preprocessing</strong></p>
<p>For this demonstration, I will make use of a relatively small but interesting dataset named <a href="http://www.sfu.ca/~mtaboada/research/SFU_Review_Corpus.html" target="_blank">the SFU Review Corpus</a>. This corpus consists of 400 reviews in English extracted from the <em>Epinions</em> website in 2004 divided in 25 positive and 25 negative reviews for each of 8 product categories (Books, Cars, Computers, etc.). It also contains 400 reviews in Spanish extracted from <em>Ciao.es</em> divided in the same categories (except for the Cookware category in English, which --more or less-- maps to Lavadoras --Washing Machines-- in Spanish).</p>
<p>The original format of the collections is one directory per category of products, including 25 positive reviews including the word "yes" in the file name and 25 negative reviews including the word "no" in the file name. Unfortunately, this format does not allow to work directly with it in WEKA, but a couple of handy scripts transform it into a new format: two directories, one including the positive reviews (directory <code>yes</code>), and the other one including the negative reviews (directory <code>no</code>). I have kept the category in the name of the files (with patterns like <code>bookyes1.txt</code>) in order to allow others making a more detailed analysis per category.</p>
<p>Comparing the structure of the original and the new format of the text collections:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/structure.collections.sfu.opinion.mining.png" style="DISPLAY: inline" height="202" width="180"/></p>
<p>In order to construct an <a href="http://www.cs.waikato.ac.nz/ml/weka/arff.html" target="_blank">ARFF</a> file from this structure, we can use the <code><a href="http://weka.sourceforge.net/doc.dev/weka/core/converters/TextDirectoryLoader.html" target="_blank">weka.core.converters.TextDirectoryLoader</a></code> class, which is an evolution of a previously existing helper class named <code><a href="http://weka.wikispaces.com/Text+categorization+with+WEKA" target="_blank">TextDirectoryToArff.java</a></code> and available at <a href="http://weka.wikispaces.com/" target="_blank">WEKA Documentation at wikispaces</a>. Using this class is as simple as issuing the next command:</p>
<blockquote style="MARGIN-RIGHT: 0px" dir="ltr">
<p><code>$> java weka.core.converters.TextDirectoryLoader -dir SFU_Review_Corpus_WEKA > SFU_Review_Corpus.arff</code></p>
</blockquote>
<p>You have to call this command at the parent directory of <code>SFU_Review_Corpus_WEKA</code>, and the parameter <code>-dir</code> sets up the input directory. This class expects to have a single directory containing a directory per class value (<code>yes</code> and <code>no</code> in our case), which in turn should contain a number of files pertaining to the corresponding classes. As the output of this command goes to the standard output, I have to redirect it to a file.</p>
<p>I have left the output of the execution of this command for both the English (<code><a href="https://github.com/jmgomezh/tmweka/blob/master/OpinionMining/SFU_Review_Corpus.arff" target="_blank">SFU_Review_Corpus.arff</a></code>) and the Spanish (<code><a href="https://github.com/jmgomezh/tmweka/blob/master/OpinionMining/SFU_Spanish_Review.arff" target="_blank">SFU_Spanish_Review.arff</a></code>) collections at <a href="https://github.com/jmgomezh/tmweka/tree/master/OpinionMining" target="_blank">the OpinionMining folder</a> of <a href="https://github.com/jmgomezh/tmweka" target="_blank">my GitHub repository</a>.</p>
<p><strong>Data Analysis</strong></p>
<p>Previous models in my blog posts have been based on a relatively simple representation of texts as sequences of words. However, a trivial analysis of the problem easily drives us to think that multi-word expressions (e.g. "very bad" vs. "bad", or "a must" vs. "I must") can lead to better predictors of user sentiment or opinion about an item. Because of this, we will compare word n-grams vs. single words (or unigrams). As an basic set up, I propose to compare word unigrams, 3-grams, and 1-to-3-grams. The latter representation will include uni- to 3-grams with the hope of getting the best of all of them.</p>
<p>Keeping in ming that capitalization may matter in this problem ("BAD" is worse than "bad"), and that we can use standard punctuation (for each of the languages) as texts are long comments (several paragraphs each), I derive the following calls to the <a href="http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html" target="_blank">weka.filters.unsupervised.attribute.StringToWordVector</a> class:</p>
<blockquote>
<p><code>$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"<a>\\\\W\</a>" -min 1 -max 1" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.uni.arff
<br/>
$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"<a>\\\\W\</a>" -min 3 -max 3" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.tri.arff
<br/>
$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"<a>\\\\W\</a>" -min 1 -max 3" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.unitri.arff</code></p>
</blockquote>
<p>We follow the notation <code>vector.uni</code> to denote that the dataset is vectorized and that we are using word unigrams, and so on. The calls for the Spanish collection are similar to these ones.</p>
<p>The most important thing in these calls is that we are no longer using the <code><a href="http://weka.sourceforge.net/doc.dev/weka/core/tokenizers/WordTokenizer.html" target="_blank">weka.core.tokenizers.WordTokenizer</a></code> class. Instead, we are using <code><a href="http://weka.sourceforge.net/doc.dev/weka/core/tokenizers/NGramTokenizer.html" target="_blank">weka.core.tokenizers.NGramTokenizer</a></code>, which uses the options <code>-min</code> and <code>-max</code> to set the minimum and maximum size of the n-grams. But the most important thing is that there is a major difference between both classes, regarding the usage of delimiters:</p>
<ul>
<li>The <code>weka.core.tokenizers.WordTokenizer</code> class uses the deprecated Java class <code><a href="http://docs.oracle.com/javase/6/docs/api/java/util/StringTokenizer.html" target="_blank">java.util.StringTokenizer</a></code> , even in the latest versions of the WEKA package (as of the day of this writing). In <code>StringTokenizer</code>, the delimiters are the characters used as "spaces" to tokenize the input string: white space, punctuation marks, etc. So you have to explicitly define which will be the "spaces" in your text.</li>
<li>The <code>weka.core.tokenizers.NGramTokenizer</code> class uses the recommended Java String method <code><a href="http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#split(java.lang.String)" target="_blank">String[] split(String regex)</a></code> , in which the argument (and thus the delimiters string) is a <a href="http://en.wikipedia.org/wiki/Regular_expression" target="_blank">Regular Expression</a> (regex) in Java. The text is splitted into tokens separated by substrings that match the regex, so you can use all the power of regexes including e.g. special codes for characters. In this case I am using the code <code>\W</code> which denotes any non-word character, in order to get only alpha-numeric character sequences.</li>
</ul>
<p>After splitting the text into word n-grams (or more properly, after representing the texts as term-weight vectors in our Vector Space Model), we may want to examine which n-grams are most predictive. As <a href="http://jmgomezhidalgo.blogspot.com.es/2013/05/language-identification-as-text.html" target="_blank">in the Language Identification post</a>, we make use of the <code><a href="http://weka.sourceforge.net/doc.dev/weka/filters/supervised/attribute/AttributeSelection.html" target="_blank">weka.filters.supervised.attribute.AttributeSelection</a></code> class:</p>
<blockquote>
<p><code>$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.uni.arff -o SFU_Review_Corpus.vector.uni.ig0.arff
<br/>
$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.tri.arff -o SFU_Review_Corpus.vector.tri.ig0.arff
<br/>
$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.unitri.arff -o SFU_Review_Corpus.vector.unitri.ig0.arff</code></p>
</blockquote>
<p>After the selection of the most predictive n-grams, we get the following statistics in the test collections:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/opinion.mining.term.stats.png" style="DISPLAY: inline" height="171" width="240"/></p>
<p>The percentages in rows 3-6-9 measure the agressivity of feature selection. Overall, both collections have comparable statistics (in the same order of magnitude). Original unigrams are quite similar, but bigrams and trigrams are less in Spanish (despite the fact that there are more isolated words -- unigrams). Selecting n-grams with Information Gain is a bit more aggressive in Spanish for unigrams and possible bigrams, but less in trigrams.</p>
<p>Adding bigrams and trigrams to the representation substantially increases the number of predictive features (from 4 to 5 times). However, only trigrams result in a little increment of features, so bigrams will play a role here. The number of features is quite handy, and allows us to make quick experiments.</p>
<p>According to my previous post on <a href="http://jmgomezhidalgo.blogspot.com.es/2013/01/text-mining-in-weka-chaining-filters.html" target="_blank">setting up experiments with WEKA text classifiers and how to chain filters and classifiers</a>, you must note that these are not the final features if we configure a cross-validation experiment -- we have to chain the filters (<code>StringToWordVector</code> and <code>AttributeSelection</code>) and the classifier in order to perform a valid experiment, as the features for each folder should be different.</p>
<p><strong>Experiments and Results</strong></p>
<p>In order to simplify the example, and expecting to get good results, we will use <a href="http://jmgomezhidalgo.blogspot.com.es/2013/05/language-identification-as-text.html" target="_blank">the same algorithms we used in the Language Identification problem</a>. These are: Naive Bayes (NB, <code><a href="http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayes.html" target="_blank">weka.classifiers.bayes.NaiveBayes</a></code>), PART (<code><a href="http://weka.sourceforge.net/doc/weka/classifiers/rules/PART.html" target="_blank">weka.classifiers.rules.PART</a></code>), J48 (<code><a href="http://weka.sourceforge.net/doc/weka/classifiers/trees/J48.html" target="_blank">weka.classifiers.trees.J48</a></code>), k-Nearest Neighbors (<code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/lazy/IBk.html" target="_blank">weka.classifiers.lazy.IBk</a></code>) with k = 1,3,5, and Support Vector Machines (<code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/functions/SMO.html" target="_blank">weka.classifiers.functions.SMO</a></code>); all of them with the default options, except for kNN which uses 1, 3 and 5 neighbors. I am testing the three proposed representations (based on unigrams, trigrams and 1-3grams) by 10-fold cross-validation. An example experiment command line is the following one:</p>
<blockquote>
<p><code>$> java weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer <a>\\\"weka.core.tokenizers.NGramTokenizer</a> -delimiters <a>\\\\\\\"\\\\\\\W\\\\\\\</a>" -min 1 -max 1\\\" -W 10000000\" -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S <a>\\\"weka.attributeSelection.Ranker</a> -T 0.0\\\"\"" -W weka.classifiers.bayes.NaiveBayes -v -i -t SFU_Review_Corpus.arff > tests/uniNB.txt</code></p>
</blockquote>
<p>You can change the size of n-grams with the <code>-min</code> and <code>-max</code> parameters. Also, you can change the learning algorithm with the most external <code>-W</code> option. I am storing the results in a <code>tests</code> folder, in files with the convention <code><rep><alg>.txt</code>. The results of this test for the English language collection are the following ones:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/opinion.mining.results.english.png" style="DISPLAY: inline" height="376" width="338"/></p>
<p>Considering the class <code>yes</code> (positive sentiment) as the positive class, in each column we show the True Positives (hits on the <code>yes</code> class), False Positives (members of the <code>no</code> class mistakenly classified as <code>yes</code>), False Negatives (members of the <code>yes</code> class mistakenly classified as <code>no</code>) and True Negatives (hits on the <code>no</code> class); along with the <a href="http://datamin.ubbcluj.ro/wiki/index.php/Evaluation_methods_in_text_categorization" target="_blank">macro-averaged</a> <a href="http://en.wikipedia.org/wiki/F1_score" target="_blank">F1</a> (standard average F1 over both classes) and the general <a href="http://en.wikipedia.org/wiki/Accuracy_and_precision" target="_blank">accuracy</a>.</p>
<p>Additionally, the results for the Spanish language collection are the following ones:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/opinion.mining.results.spanish.png" style="DISPLAY: inline" height="376" width="341"/></p>
<p>So these are the results. Let us start the analysis...</p>
<p><strong>Results Analysis</strong></p>
<p>We can perform an analysis regarding different aspects:</p>
<ul>
<li>Which is the overall performance?</li>
<li>Which is the performance when comparing different languages?</li>
<li>Which are the best learning algorithms?</li>
<li>Which effect do have different text representations in the classifier performance?</li>
</ul>
<p>All in all, and taking into account that class balance is 50% (thus a trivial acceptor or a trivial rejector, or a random classifier accuracy would be 50%), most of the classifiers beat this baseline but not by a wide margin, and even the best one among all algorithms, languages and representations (SVMs on English 1-to-3-grams) reaches only a modest 71% -- far from a satisfying 90% or over. Let me remind we are facing a relatively simple problem -- long, few texts, and a binary classification. Most approaches in the literature get much better results in similar setups.</p>
<p>Results are better for English than for Spanish, comparing one on one. I will check the representations used in Spanish, for instance listing the first 20 n-grams for each representation, in order to explain it:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/opinion.mining.top.spanish.terms.png" style="DISPLAY: inline" height="339" width="290"/></p>
<p>Some of the n-grams (highlighted in <em>italics</em>) are just incorrect, because of the incorrect recognition of accents due to the inappropriate pattern I have used in the tokenization step. The tokenizer makes use of the string "<code>\W</code>" in order to recognize alphanumeric string -- which in Java do not include vowels with accents ("á", "é", "í", "ó", "ú") and other language-specific symbols (e.g. "ñ"). Most of the n-grams are just not opinionated words or n-grams; instead, they are either intensifiers (like e.g. "muy" -- "very") or just contingent (dependent on the training collection, e.g. "en el taller" -- "in the garage"; "tarjeta de memoria" -- "storage card"). Those clearly opinionated words are highlighted in <strong>boldface</strong>. Very few. So for this issue, we can conclude that the training collection is too small.</p>
<p>If we examine the performance of different classifiers, we can cluster them in three groups: top performers (SVMs, NB), medium performers (PART, J48) and losers for this problem (kNN). These groups are intuitive:</p>
<ul>
<li>Both SVMs and NB have often demonstrated their high performance in sparse datasets, and in text classification problems in particular. They both build a linear classifier with weights (or probabilities) for each of the features. Linear classifiers perform well here given that the dataset is built on representations that clearly promote over-fitting the dataset, as we have seen that many of the most predictive n-grams are collection-dependent.</li>
<li>Both PART and J48 (C4.5) are based on reducing error by progressively partitioning the dataset according to tests on the most predictive features. But the predictive features we have for such a small collection are not very good, indeed.</li>
<li>All versions of kNN perform very bad, most likely because the dataset is sparse and relatively small.</li>
</ul>
<p>However, we have to keep in mind that we have used the algorithms with their default configurations. For instance, kNN allows to use the cosine similarity instead of the <a href="http://en.wikipedia.org/wiki/Euclidean_distance" target="_blank">Euclidean distance</a> -- being the <a href="http://en.wikipedia.org/wiki/Cosine_similarity" target="_blank">cosine similarity</a> much better for text classification problems, as demonstrated many times during 50 years of research in <a href="http://en.wikipedia.org/wiki/Information_retrieval" target="_blank">Information Retrieval</a>.</p>
<p>And regarding dataset representations, the behavior is not uniform -- we do not systematically get better results with one representation in comparison with the others. In general, 1-to-3-grams perform better than the other representations in English, while unigrams are best in Spanish, and trigrams is most often the worst representation for both languages. If we focus on top performing classifiers (NB and SVMs), this latter comment is always true. In consequence, trigrams have --to some extent-- demonstrated their power in English (as a complement to uni- and bigrams), but not in Spanish (but knowing that the representation is incorrect because of character encoding).</p>
<p><strong>Concluding Remarks</strong></p>
<p>So all in all, we have a baseline learning-based method for Sentiment Analysis in English (and probably in Spanish, after correcting the representation), which is -- not surprisingly -- based on 1-to-3-grams and Support Vector Machines. And it is a baseline because its performance is relatively poor (with an accuracy of 71%), and we have not taken full advantage of the configuration, text representation and other parameters yet.</p>
<p>After this long (again!) post, I propose the next steps -- some of them left for the reader as an exercise:</p>
<ul>
<li>Build a Java class that classifies text files according their sentiment, for English at least, taking my previous post on Language Identification as an example -- left for the reader.</li>
<li>Test other algorithms, and in particular: play with SVM configuration, and add Boosting (using <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/meta/AdaBoostM1.html" target="_blank">weka.classifiers.meta.AdaBoostM1</a></code>) to Naive Bayes -- left for the realer.</li>
<li>Check differences of accuracy in terms of product type -- cars, movies, etc. -- left for the reader.</li>
<li>Improve the Spanish language representation using the appropriate regex in the tokenizer to cover Spanish letters and accents -- I will take this one myself.</li>
<li>Check the accuracy of the <a href="http://sentiwordnet.isti.cnr.it/code/SWN3.java" target="_blank">basic keyword-based algorithm</a> available in the <a href="http://sentiwordnet.isti.cnr.it/" target="_blank">SentiWordNet page</a> -- I will take this one as well.</li>
</ul>
<p>So that is all for the moment. You can expect one or more posts from me on this hot topic. Finally, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com13tag:blogger.com,1999:blog-36589303.post-76692288550493310962013-05-23T18:07:00.001+02:002013-05-23T18:07:30.427+02:00Compilation of Resources for Text-based Age Detection<p><em>Text-based age detection</em> consists of estimate the age of a user according to the kind of texts he/she writes. This task is atracting some attention in the latest years, as for instance it promises to add <em>one of the most interesting demographic features required in ad targetting</em>. There is even an online application, <a href="http://www.tweetgenie.nl/" target="_blank">TweetGenie</a>, which guesses the age of a Twitter user -- it works for Dutch and English.</p>
<p>Text-based age detection is a text classification task which has close relation with others like genre detection or authorship attribution, as it should be based on stylistic features (e.g. usage of capitalization, average word length, frequencies of prepositions, or even the usage of emoticons) instead of on content bearing words (mostly nouns and verbs) like e.g. in topical text categorization. However, this does not mean that a pure word-based learning would not be effective.</p>
<p>A particular feature of this task is that <em>it can be approached as classification if ages are divided in ranges, or as regression</em> if we try to approach the exact age of the user.</p>
<p>There is a currently ongoing scientific competition at this topic, namely the <a href="http://www.uni-weimar.de/medien/webis/research/events/pan-13/pan13-web/author-profiling.html" target="_blank">Author Profiling task</a> at the <a href="http://pan.webis.de/" target="_blank">9th evaluation lab on uncovering plagiarism, authorship, and social software misuse (PAN 2013)</a>. With this competition adding up new text collections, we have the following resources for trying and testing our approaches to text-based age detection:</p>
<ul>
<li>The <a href="http://www.uni-weimar.de/medien/webis/research/events/pan-13/pan13-web/author-profiling.html" target="_blank">PAN 2013 Training Corpus for Author Profiling Task</a>, consisting of a big number of posts and chats from three age ranges in Spanish and English.</li>
<li>The <a href="http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm" target="_blank">Blog Authorship Corpus</a>, referenced in PAN, consisting of a big number of blog posts from three age ranges in English.</li>
<li>The <a href="http://faculty.nps.edu/cmartell/NPSChat.htm" target="_blank">NPS Chat Corpus</a>, consisting on a relatively small number of chats from five age ranges in English (<a href="http://nltk.org/nltk_data/" target="_blank">download from the NLTK corpora page</a> or pay to the LDC).</li>
</ul>
<p>For your comfort, I summarize some statistics about the collections:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/agedetection.corpora.statistics.png" style="WIDTH: 476px; DISPLAY: inline; HEIGHT: 204px" height="204" width="476"/></p>
<p>And some notes on the information available in each collection:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/agedetection.corpora.description.png" style="WIDTH: 447px; DISPLAY: inline; HEIGHT: 159px" height="159" width="447"/></p>
<p>The following papers can be of interest in order to avoid repeating others work.</p>
<ul>
<li>J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). <strong><a href="http://www.cs.biu.ac.il/~schlerj/schler_springsymp06.pdf" target="_blank">Effects of Age and Gender on Blogging</a></strong> , Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs.</li>
<li>S. Argamon, M. Koppel, J. Pennebaker and J. Schler (2009), <strong><a href="http://u.cs.biu.ac.il/~koppel/papers/AuthorshipProfiling-cacm-final.pdf" target="_blank">Automatically profiling the author of an anonymous text</a></strong> , Communications of the ACM 52 (2): 119-123.</li>
<li>M.Koppel, S. Argamon and A. Shimoni (2003), <strong><a href="http://u.cs.biu.ac.il/~koppel/papers/male-female-llc-final.pdf" target="_blank">Automatically categorizing written texts by author gender</a></strong> , Literary and Linguistic Computing 17(4), November 2002, pp. 401-412.</li>
<li>Jenny K. Tam (2009). <strong><a href="https://www.google.es/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CDAQFjAA&url=http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA508858&ei=-_6cUYOdFvTT7Ab_GQ&usg=AFQjCNEw2YM65O_lL2kux4yZvNlwhJXosA&sig2=G3u0NRc-5gOWd1O5FkgeTA" target="_blank">Detecting Age in Online Chat</a></strong> , Master Thesis, Naval Postgraduate School.</li>
<li>Jane Lin (2007). <strong><a href="http://www.dtic.mil/cgi-bin/GetTRDoc?AD=ADA467087" target="_blank">Automatic Author Profiling of Online Chat Logs</a></strong> , Master Thesis, Naval Postgraduate School.</li>
</ul>
<p>Please feel free to send me a message or comment below if you find any other resource that I should add to this post. Thanks for reading.</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com0tag:blogger.com,1999:blog-36589303.post-38599562369703264902013-05-22T18:22:00.001+02:002013-05-22T18:22:38.213+02:00Presentación: "Menores y móviles: Usos, riesgos y controles parentales"<p>El día 19 de abril dí una charla en la Universidad Europea de Madrid, titulada "<strong>Menores y móviles: Usos, riesgos y controles parentales</strong>". Esta charla se corresponde con un trabajo de investigación que he realizado dentro del proyecto titulado "Protección de usuarios menores de edad de telefonía móvil inteligente", dirigido por <a href="http://joaquinpe.wordpress.com/" target="_blank">Joaquin Pérez</a> y financiado por la <a href="http://www.uem.es/" target="_blank">Universidad Europea de Madrid</a> (P2012 UEM14).</p>
<p>El resumen de la charla <a href="http://www.mavir.net/talks/159-gomezhidalgo-abr2013" target="_blank">está disponible en la página de la red MAVIR</a> (<a href="http://www.mavir.net/que-es-mavir" target="_blank">MA2VICMR: Mejorando el Acceso, el Análisis y la Visibilidad de la Información y los Contenidos Multilingüe y Multimedia en Red para la Comunidad de Madrid</a>), y la presentación utilizada durante la charla es la siguiente:</p>
<p style="TEXT-ALIGN: center"><iframe src="http://www.slideshare.net/slideshow/embed_code/21686368" height="400" width="476" marginwidth="0" marginheight="0" scrolling="no" frameborder="0"/></p>
<p style="TEXT-ALIGN: left">Si el tema te interesa, no dudes en hacer culaquier pregunta o sugerencia en los comentarios de este post.</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com0tag:blogger.com,1999:blog-36589303.post-45902268791462932022013-05-20T21:28:00.001+02:002013-05-20T21:28:19.615+02:00Language Identification as Text Classification with WEKA<p><a href="http://en.wikipedia.org/wiki/Language_identification" target="_blank">Language Identification</a>, consisting on guessing the natural language in which a text is written (or an utterance is spoken), is not one of the hardest problems in <a href="http://en.wikipedia.org/wiki/Natural_language_processing">Natural Language Processing</a>, and in consequence, I believe <em>it is a good starting point for learning about the text analysis capabilities available in WEKA</em>.</p>
<p>This is in fact one problem taken by others like in this <a href="http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html" target="_blank">tutorial on using LingPipe for Language Identification</a>, or by <a href="http://blog.alejandronolla.com/" target="_blank">Alejandro Nolla</a> at his post on <a href="http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/" target="_blank">Detecting Text Language With Python and NLTK</a>. Moreover you can find a wide number of language identification programs, APIs and demos in the <a href="http://en.wikipedia.org/wiki/Language_identification" target="_blank">Wikipedia article on Language Identification</a>. We may even consider this function as a natural language commodity, as you can see how <a href="http://translate.google.com/" target="_blank">Google Translate</a> does it on default in the next figure:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/google.translate.langid.png" style="WIDTH: 400px; DISPLAY: inline; HEIGHT: 159px" height="159" width="400"/></p>
<p>The most typical (and rather simple) approach to Language Identification is storing a list of the <em>most frequent character 3-grams</em> in each language and checking the target overlap with each of the lists. Alternatively, you can use stop words lists. Of course, the accuracy depends on how you compute the overlap, but even simple distances can make it rather effective.</p>
<p>However, I will not follow this approach here. Instead, I will show how to build an standard text classifier using <a href="http://weka.sourceforge.net/" target="_blank">WEKA</a> in order to show the options (and how to apply) the <code><a href="http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html" target="_blank">StringToWordVector</a></code> filter, which is <em>the main tool for text analysis in WEKA</em>.</p>
<p>The steps we have to follow are the next ones:</p>
<ol>
<li>To collect data from different languages in order to build a basic dataset.</li>
<li>To prepare the data for learning, which involves transforming it by using the <code>StringToWordVector</code> filter.</li>
<li>To analyze the resulting dataset, and hopefully, to improve it by using attribute selection.</li>
<li>To test over an independent test collection, which will give us a robust estimation of the accuracy of the approaches on real examples.</li>
<li>To learn the most accurate model as obtained from the previous step, and to use it for our classification program.</li>
</ol>
<p>So this will be a rather long post. Be prepared for it.</p>
<p><strong>Collecting the data and Creating the Datasets</strong></p>
<p>Following the <a href="http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html" target="_blank">LingPipe Language ID Tutorial</a>, I collect the data from the <a href="http://corpora.uni-leipzig.de/" target="_blank">Leipzig Corpora Home Page</a>. In particular, I will address guessing among English (EN), French (FR) and Spanish (SP), so I have gone to <a href="http://corpora.uni-leipzig.de/download.html" target="_blank">the download page</a>, completed the CAPTCHA to get the list of available corpora, and downloaded:</p>
<ul>
<li>The <a href="http://corpora.uni-leipzig.de/downloads/eng_news_2005_10K-text.tar.gz" target="_blank">2005 English 10k corpus of news in text format</a>.</li>
<li>The <a href="http://corpora.uni-leipzig.de/downloads/fra_news_2009_10K-text.tar.gz" target="_blank">2009 French 10k corpus of news in text format</a>.</li>
<li>The 2001-2002 Spanish 10k corpus of news in text format -- which is no longer there as far as I can see.</li>
</ul>
<p>For your comfort, I have put these corpora <a href="https://github.com/jmgomezh/tmweka/tree/master/LangID" target="_blank">in my LangID GITHub demo page</a>. The files have the following format:</p>
<blockquote>
<p><code>1 I didn't know it was police housing," officers quoted Tsuchida as saying.
<br/>
2 You would be a great client for Southern Indiana Homeownership's credit counseling but you are saying to yourself "Oh, we can pay that off."
<br/>
3 He believes the 21st century will be the "century of biology" just as the 20th century was the century of IT.</code></p>
</blockquote>
<p>So I have loaded them into an OpenOffice spreadsheet, and replaced the number columns by the corresponding tags for the different languages: <code>EN</code>, <code>FR</code>, and <code>SP</code>. Then I have escaped the <code>"</code> and <code>'</code> characters, because they are string delimiters in WEKA <a href="http://www.cs.waikato.ac.nz/ml/weka/arff.html" target="_blank">Attribute-Relation File Format</a> (ARFF). In order to build the datasets, I have split the data keeping the first 9K sentences of each language for training, and the remaining 1K for testing. As some learning algorithms may be sensitive to the instance order, I have mixed the instances in batches of 1K texts, so the first 1K sentences are in English, the next 1K sentences are in French, and so on. The training data has the following header:</p>
<blockquote>
<p><code>@relation langid_train
<br/>
<br/>
@attribute language_class {EN,FR,SP}
<br/>
@attribute text String
<br/>
<br/>
@data
<br/>
EN,'I didn\'t know it was police housing,\" officers quoted Tsuchida as saying.'
<br/>
EN,'You would be a great client for Southern Indiana Homeownership\'s credit counseling but you are saying to yourself \"Oh, we can pay that off.\"'
<br/>
EN,'He believes the 21st century will be the \"century of biology\" just as the 20th century was the century of IT.'
<br/>
../..</code></p>
</blockquote>
<p>The ARFF files for training and testing are available at the <a href="https://github.com/jmgomezh/tmweka/tree/master/LangID" target="_blank">GITHub repository for the demo</a> as well. You can open the training file (<code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/langid.collection.train.arff" target="_blank">langid.collection.train.arff</a></code>) in the WEKA Explorer, and setting the class to be the first attribute, you should be getting something like the following figure:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/explorer.training.langid.png" style="WIDTH: 450px; DISPLAY: inline; HEIGHT: 336px" height="336" width="450"/></p>
<p>So we have a training collection with 9K instances per class (language), and a test collection with 1K instances per class.</p>
<p><strong>Data Transformation</strong></p>
<p>As <a href="http://jmgomezhidalgo.blogspot.com/search/label/WEKA" target="_blank">in previous posts about text classification with WEKA</a>, we need to transform the text strings into term vector to enable learning. This is done by applying the <code>StringToWordVector</code> filter, that is the most remarkable text mining function in WEKA. In previous posts, I have applied this filter with default options, but it offers a wide range of possibilities that can be seen when opening it in the WEKA Explorer. If you click on the <em>Filter</em> button and browse the tree to "<em>weka > filters > unsupervised > attribute > StringToWordVector</em>", and then click on the filter name, you get the next window:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/explorer.stringtowordvector.png" style="WIDTH: 440px; DISPLAY: inline; HEIGHT: 623px" height="623" width="440"/></p>
<p>Those are a lot of options, aren't them? So let us focus on the minimum set of options in order to be productive with this example of Language Identification. Those are:</p>
<ul>
<li><code>doNoOperateOnPerClassBasis</code> - we set this option to <code>True</code> in order to make the filter collect word tokens over the classes as a whole. This should be the standard setting in nearly all text classification problems.</li>
<li><code>lowerCaseTokens</code> - we set this option to <code>True</code> because we are interested on the words independently of using upper or lower case. In other problems, like e.g. when processing Social Networks text, keeping the capitalization may be critical for getting a good accuracy.</li>
<li><code>tokenizer</code> - WEKA provides several tokenizers, intended to break the original texts into tokes according to a number of rules. The most simple tokenizer is the <code><a href="http://weka.sourceforge.net/doc.dev/weka/core/tokenizers/WordTokenizer.html" target="_blank">weka.core.tokenizers.WordTokenizer</a></code>, which splits the string into tokens by using a list of separators that can be set by clicking on the tokenizer name. It is a nice idea to give a look at the texts we have before setting up the list of separating characters. In our case, we have several languages and the default punctuation symbols may not fit our problem -- we need to add opening question and exclamation marks, apart from other symbols from HTML format like &, and other symbols. So our delimiters string will be " \r\n\t.,;:\"\'()?!-¿¡+*&#$%\\/=<>[]_`@" (backslash is escaped).</li>
<li>wordsToKeep - we set this option to keep as much words as we can, to include the full vocabulary of the dataset. An appropriate value may be one million.</li>
</ul>
<p>So we leave the rest of options on default. Most notably, we are not using <a href="http://en.wikipedia.org/wiki/Tf–idf" target="_blank">sophisticated weighting schemas (like TF or TF.IDF)</a>, nor <a href="http://en.wikipedia.org/wiki/Stop_words" target="_blank">stop words</a> or <a href="http://en.wikipedia.org/wiki/Stemming" target="_blank">stemming</a>. These options are very frequent in <a href="http://en.wikipedia.org/wiki/Information_retrieval" target="_blank">Information Retrieval</a> systems like <a href="http://lucene.apache.org/solr/" target="_blank">Apache Lucene/SOLR</a>, and they often lead to nice accuracy improvements in search systems.</p>
<p>We need to have the same vocabulary both in the training and the testing datasets, so we can apply this filter in the command line by using the batch (<code>-b</code>) option:</p>
<blockquote>
<p><code>$> java weka.filters.unsupervised.attribute.StringToWordVector -O -L -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\"\\'()?!-¿¡+*&#$%\\\\/=<>[]_`@\"" -W 10000000 -b -i langid.collection.train.arff -o langid.collection.train.vector.arff -r langid.collection.test.arff -s langid.collection.test.vector.arff</code></p>
</blockquote>
<p>The options -O, -L, -tokenizer and -W correspond to the options above. The delimiter string is escaped because it is included in the specification of the tokenizer. The resulting files are also <a href="https://github.com/jmgomezh/tmweka/tree/master/LangID" target="_blank">in the GITHub repository for the LangID example</a>, along with the script <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/stwv.sh" target="_blank">stwv.sh</a></code> (String To Word Vector) which includes this command.</p>
<p><strong>Data Analysis and Improvement</strong></p>
<p>If we take a quick look to the terms or tokens we have got, e.g.:</p>
<blockquote>
<p><code>@attribute archival numeric
<br/>
@attribute archivarlos numeric
<br/>
@attribute archivas numeric
<br/>
@attribute archives numeric
<br/>
@attribute archiving numeric
<br/>
@attribute archivo numeric
<br/>
@attribute archivos numeric</code></p>
</blockquote>
<p>We can imagine that most of them will be useless for Language Identification. This motivates making a more precise analysis of the tokens by using some kind of quality metric, like <a href="http://en.wikipedia.org/wiki/Information_gain_in_decision_trees" target="_blank">Information Gain</a>. In fact, I am applying the <code><a href="http://weka.sourceforge.net/doc.dev/weka/filters/supervised/attribute/AttributeSelection.html" target="_blank">weka.filters.supervised.attribute.AttributeSelection</a></code> filter as I did in my posts on <a href="http://jmgomezhidalgo.blogspot.com.es/2013/02/text-mining-in-weka-revisited-selecting.html" target="_blank">selecting attributes by chaining filters</a> and on <a href="http://jmgomezhidalgo.blogspot.com.es/2013/04/command-line-functions-for-text-mining.html" target="_blank">command line functions for text mining</a>. So I issue the following command:</p>
<blockquote>
<p><code>$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -b -i langid.collection.train.vector.arff -o langid.collection.train.vector.ig0.arff -r langid.collection.test.vector.arff -s langid.collection.test.vector.ig0.arff</code></p>
</blockquote>
<p>We apply the filter in batch mode as well, in order to get the same attributes both in the training and in the test collections. We also set up the first attribute as the class (with the option <code>-c</code>), and set the threshold for keeping attributes as <code>0.0</code> in the <code><a href="http://weka.sourceforge.net/doc.dev/weka/attributeSelection/Ranker.html" target="_blank">weka.attributeSelection.Ranker</a></code> search method. This means that we will keep only those attributes with Information Gain score over 0, and they will be sorted according to their score as well. This command is included in the <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/asig.sh" target="_blank">asig.sh</a></code> (Attribute Selection by Information Gain) script of <a href="https://github.com/jmgomezh/tmweka/tree/master/LangID" target="_blank">the GITHub repository for the LangID example</a>, along with the data files.</p>
<p>From the original 65,429 word attributes we got in the previous step, we have kept only 16,840 (a 25.73% of the original ones). We can be more aggressive by setting the threshold to a bigger value (e.g. 0.2).</p>
<p>The first twenty attributes are the next ones:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/forty.top.ig.terms.langid.png" style="WIDTH: 300px; DISPLAY: inline; HEIGHT: 163px" height="163" width="300"/></p>
<p>As we can see, all of them are very frequent words (in each language) that would be present in the stop lists for them. In consequence, our "pure" data mining approach is quite close to the traditional one based on stop words.</p>
<p>It makes sense to learn a J48 tree to get an idea of the complexity of the term relations. The <code><a href="http://weka.sourceforge.net/doc/weka/classifiers/trees/J48.html" target="_blank">weka.classifiers.trees.J48</a></code> algorithm implements the <a href="http://en.wikipedia.org/wiki/C4.5_algorithm" target="_blank">Quinlan's popular C4.5 learner</a>, and as it outputs a decision tree, it can give us valuable insights of the term relations, like e.g. which co-occurring terms are more predictive. If we train that classifier on our new training dataset with the following command:</p>
<blockquote>
<p><code>$> java weka.classifiers.trees.J48 -t langid.collection.train.vector.ig0.arff -no-cv</code></p>
</blockquote>
<p>However, we get a quite complex decision tree populated with 273 nodes and 137 leaves. All the tests in the tree have the following look: "<code>word > 0</code>" or "<code>word <= 0</code>". This means that the algorithm induces that only the occurrence of words is important, but not its weight. The root of the tree is obviously a test on "<code>the</code>", and the smallest side of the tree (its right hand side, with "<code>the > 0</code>") is the following one:</p>
<blockquote>
<p><code>the > 0
<br/>
| de <= 0: EN (5945.0/8.0)
<br/>
| de > 0
<br/>
| | el <= 0
<br/>
| | | and <= 0
<br/>
| | | | for <= 0
<br/>
| | | | | to <= 0: FR (24.0/3.0)
<br/>
| | | | | to > 0: EN (2.0)
<br/>
| | | | for > 0: EN (3.0)
<br/>
| | | and > 0: EN (7.0)
<br/>
| | el > 0: SP (3.0)</code></p>
</blockquote>
<p>This means, for instance, that the word "<code>the</code>" is an excellent predictive feature, and if it occurs in a text and the word "<code>de</code>" (from French or Spanish) does not occur in the text, that text is most likely written in English (with an estimated likelihood of 99.86% on the training collection). The overall accuracy of J48 over the training collection is 98.3963%.</p>
<p><strong>Training and then Evaluating on the Test Collection</strong></p>
<p>Before start training and evaluating, we have to decide which algorithms are most appropriate for the problem. In my experience with text learning, it is wise to test at least the following ones:</p>
<ul>
<li>The <em>Naive Bayes</em> probabilistic approach, quick and with good results in text learning on average problems. In WEKA, It is incarnated in the <code><a href="http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayes.html" target="_blank">weka.classifiers.bayes.NaiveBayes</a></code> class.</li>
<li>The <em>rule learner PART</em>, which induces a list of rules by learning partial decision trees. It is a symbolic algorithm that produces rules which can be very valuable as they are easy to understand. This algorithm is implemented by the <code><a href="http://weka.sourceforge.net/doc/weka/classifiers/rules/PART.html" target="_blank">weka.classifiers.rules.PART</a></code> class.</li>
<li>Of course, the J48 algorithm because of its visualization capabilities.</li>
<li>The lazy learner <em>k-Nearest Neighbors (kNN)</em>, which occasionally gives excellent results in text classification problems. The WEKA class that implements this algorithm is <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/lazy/IBk.html" target="_blank">weka.classifiers.lazy.IBk</a></code>.</li>
<li>The <em>Support Vector Machines</em> algorithm, which it is probably the most effective on text classification problems because of its ability to focus on the most relevant examples in order to separate the classes. It is a very good learning algorithm for sparse datasets, and it is implemented in WEKA via the <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/functions/SMO.html" target="_blank">weka.classifiers.functions.SMO</a></code> class or by the library <a href="http://weka.wikispaces.com/LibSVM" target="_blank">LibSVM</a>. I choose the Sequential Minimum Optimization implementation (SMO) embedded in WEKA.</li>
</ul>
<p>Also, when Naive Bayes or J48 are effective, I usually get from small to even big accuracy improvements by using <a href="http://en.wikipedia.org/wiki/Boosting_(machine_learning)" target="_blank">boosting</a>, implemented by the <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/meta/AdaBoostM1.html" target="_blank">weka.classifiers.meta.AdaBoostM1</a></code> class in WEKA. Boosting takes as input a weak classifier, and build a classifier committee by iteratively training that weak learner on those dataset subsets on which the previous learners are not effective. In this case, I will not apply boosting because the weak learners get rather high levels of accuracy, and it is most likely that boosting will only achieve a marginal improvement (if any) at the cost of a much bigger training time.</p>
<p>I have written an script named <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/test.sh" target="_blank">test.sh</a></code> to execute all these algorithms with default options at the <a href="https://github.com/jmgomezh/tmweka/tree/master/LangID" target="_blank">GITHub repository for the LangID demo</a>. The results obtained by the algorithms are included in the repository as well, and summarized in the next table:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/results.test.langid.png" style="WIDTH: 230px; DISPLAY: inline; HEIGHT: 136px" height="136" width="230"/></p>
<p>The different versions of the lazy algorithm kNN tested here appear to be very weak. It is likely we can improve its performance by changing the way the distance among examples is computed (from the Euclidean distance to a more appropriate one for text, that would be the cosine similarity), but their performance is so low that they will not score better than the rest of the algorithms.</p>
<p>The top algorithms in this test are <em>Naive Bayes</em> and <em>Support Vector Machines</em>. There is a trade off between both algorithms: SVMs are more effective (in fact, they are very effective) but they employ quite a lot of time to be trained, while Naive Bayes is less effective but quicker to be trained. In terms of classification time, both algorithms are linear on the number of attributes.</p>
<p>Even we have used a big number of attributes, there are some examples with rather weak representations. For instance, let us check the following instances or texts:</p>
<blockquote>
<p><code>{58 1,94 1,313 1,1663 1}
<br/>
{119 1,361 1,2644 1,16840 FR}
<br/>
{2 1,16840 SP}</code></p>
</blockquote>
<p>The first and second examples have only 3 occurring words (the class value for the first text is <code>EN</code> in the sparse format it is used by WEKA in this example), and the third example has only one word ("<code>el</code>"). The two first examples attribute numbers (58 or over) mean that the attributes are not the most informative ones, while in the third example we find a very informative word. If we apply a more aggressive selection using Information Gain, we will be missing a lot of examples (with null representations) in this example, thus making them fall to the most likely class. As the classes have a balanced distribution, the language chosen in that case will be <code>EN</code>, which is the default value for the class attribute.</p>
<p><strong>Learning the Best Classifier and Using it Programmatically</strong></p>
<p>So after our experiments, we know the best classifier in our tests is SVMs. So it is time to learn it and store the classifier into a file for further programmatic use. For this purpose, I have written an script that trains the classifier and stores the model into a file, using the following command-line call:</p>
<blockquote>
<p><code>$> java weka.classifiers.meta.FilteredClassifier -t langid.collection.train.arff -c first -no-cv -d smo.model.dat -v -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -O -L -tokenizer <a>\\\"weka.core.tokenizers.WordTokenizer</a> -delimiters <a>\\\\\\\</a>" <a>\\\\\\\r\\\\\\\n\\\\\\\t.,;:\\\\\\\\\\\\\\\"'()?!-¿¡+*&#$%/=<>[]_`@\\\\\\\"\\\</a>" -W 10000000\" -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S <a>\\\"weka.attributeSelection.Ranker</a> -T 0.0\\\"\"" -W weka.classifiers.functions.SMO</code></p>
</blockquote>
<p>This call is rather painful because of the nested, and nested, and nested, and nested quotes. So I have pretty-printed it in the script <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/learn.sh" target="_blank">learn.sh</a></code> script at the <a href="https://github.com/jmgomezh/tmweka/tree/master/LangID" target="_blank">GitHub repository for the LangID example</a>. For dealing with nested quotes, follow the advice in <a href="http://en.wikipedia.org/wiki/Nested_quotation" target="_blank">the Wikipedia article about nested quotation</a>.</p>
<p>With this call, we have stored a model in the file <code>smo.model.dat</code>, which chains the <code>StringToWordVector</code> filter, the <code>AttributeSelection</code> filter, and an <code>SMO</code> classifier by using the <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/meta/FilteredClassifier.html" target="_blank">weka.classifiers.meta.FilteredClassifier</a></code> and the <code><a href="http://weka.sourceforge.net/doc.dev/weka/filters/MultiFilter.html" target="_blank">weka.filters.MultiFilter</a></code> classes, as I have explained in the post on <a href="http://jmgomezhidalgo.blogspot.com.es/2013/04/command-line-functions-for-text-mining.html" target="_blank">Command Line Functions for Text Mining in WEKA</a>.</p>
<p>One good point of WEKA is that we can learn a model in the command line and use it in a program. I have modified the <code><a href="https://github.com/jmgomezh/tmweka/blob/master/FilteredClassifier/MyFilteredClassifier.java" target="_blank">MyFilteredClassifier.java</a></code> program I used in my post describing <a href="http://jmgomezhidalgo.blogspot.com.es/2013/04/a-simple-text-classifier-in-java-with.html" target="_blank">A Simple Text Classifier in Java with WEKA</a>, and I have committed it at the <a href="https://github.com/jmgomezh/tmweka/tree/master/LangID" target="_blank">GITHub repository</a> with the name <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/LanguageIdentifier.java" target="_blank">LanguageIdentifier.java</a></code>. I have created three sample test files as well, <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/test_en.txt" target="_blank">test_en.txt</a></code>, <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/test_fr.txt" target="_blank">test_fr.txt</a></code> and <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/test_sp.txt" target="_blank">test_sp.txt</a></code>. The operation of the program is the following one:</p>
<blockquote>
<p><code>$> javac LanguageIdentifier.java
<br/>
<br/>
$> java LanguageIdentifier
<br/>
Usage: java LanguageIdentifier <fileData> <fileModel>
<br/>
$> java LanguageIdentifier test_en.txt smo.model.dat
<br/>
===== Loaded text data: test_en.txt =====
<br/>
This is a sample test for the language identifier demo.
<br/>
===== Loaded model: smo.model.dat =====
<br/>
===== Instance created with reference dataset =====
<br/>
@relation 'Test relation'
<br/>
@attribute language_class {EN,FR,SP}
<br/>
@attribute text string
<br/>
@data
<br/>
?,' This is a sample test for the language identifier demo.'
<br/>
===== Classified instance =====
<br/>
Class predicted: EN
<br/>
<br/>
$> java LanguageIdentifier test_fr.txt smo.model.dat
<br/>
===== Loaded text data: test_fr.txt =====
<br/>
Ceci est un test de l'échantillon pour la démonstration de l'identificateur de langue.
<br/>
===== Loaded model: smo.model.dat =====
<br/>
===== Instance created with reference dataset =====
<br/>
@relation 'Test relation'
<br/>
@attribute language_class {EN,FR,SP}
<br/>
@attribute text string
<br/>
@data
<br/>
?,' Ceci est un test de l'échantillon pour la démonstration de l'identificateur de langue.'
<br/>
===== Classified instance =====
<br/>
Class predicted: FR
<br/>
<br/>
$> java LanguageIdentifier test_sp.txt smo.model.dat
<br/>
===== Loaded text data: test_sp.txt =====
<br/>
Esto es un texto de prueba para la demostración del identificador de idioma.
<br/>
===== Loaded model: smo.model.dat =====
<br/>
===== Instance created with reference dataset =====
<br/>
@relation 'Test relation'
<br/>
@attribute language_class {EN,FR,SP}
<br/>
@attribute text string
<br/>
@data
<br/>
?,' Esto es un texto de prueba para la demostración del identificador de idioma.'
<br/>
===== Classified instance =====
<br/>
Class predicted: SP</code></p>
</blockquote>
<p>So the program is correct on the three examples. Remember that you have to learn the model before using the program. As a side note, as the program only uses a <code>FilteredClassifier</code> object, you can change the script to accommodate a different algorithm. For instance, you can just change the text "<code>weka.classifiers.functions.SMO</code>" by "<code>weka.classifiers.bayes.NaiveBayes</code>" in the <code>learn.sh</code> script, and the program will be working the same way -- but with a different model.</p>
<p><strong>Concluding Remarks</strong></p>
<p>While being relatively simple, the Language Identification problem helps to identify the essential tasks we have to perform when building text classifiers with WEKA. It is a complete example in the sense that we have not only collected the dataset and learnt on it, but we have also dig a bit into the most suitable representation by playing with attribute selection and tentative classifier to visualize the data. It also demonstrates some basic configurations of the <code>StringToWordVector</code> filter, which is the most remarkable tool in WEKA for text mining.</p>
<p>If you have had the time to read all this post, and even tried the program: thank you! I hope it has been a valuable time investment. I am tempted to suggest you to modify the dataset to include more languages, as the problem I have addressed is relatively simple -- only three and quite different languages.</p>
<p>Finally, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on this topics!</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com7tag:blogger.com,1999:blog-36589303.post-16596372708858056612013-05-02T01:41:00.001+02:002013-05-02T09:42:18.440+02:00Mapping Vocabulary from Train to Test Datasets in WEKA Text Classifiers<p>There are several ways of evaluating a (text) classifier: <a href="http://en.wikipedia.org/wiki/Cross-validation_(statistics)" target="_blank">cross validation</a>, splitting your dataset into train and test subsets, or even evaluating the classifier on the training set itself (not recommended). I will not discuss the merits of each method, instead I will focus on a train/test split evaluation.</p>
<p>When you start to work with your train and test text datasets, you have got two labelled text collections like e.g. those I make available at <a href="https://github.com/jmgomezh/tmweka" target="_blank">my GITHub project</a>: <a href="https://github.com/jmgomezh/tmweka/blob/master/InputMappedClassifier/smsspam.small.train.arff" target="_blank"><code>smsspam.small.train.arff</code></a> and <a href="https://github.com/jmgomezh/tmweka/blob/master/InputMappedClassifier/smsspam.small.test.arff" target="_blank"><code>smsspam.small.test.arff</code></a> . In this case, we have two collections that are a 50% split of my original simple collection <a href="https://github.com/jmgomezh/tmweka/blob/master/FilteredClassifier/smsspam.small.arff" target="_blank"><code>smsspam.small.arff</code></a> , which in turn is a subset of the the original <a href="http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/" target="_blank">SMS Spam Collection</a>. The files are formatted according to the <a href="http://weka.sourceforge.net/" target="_blank">WEKA</a> <a href="http://www.cs.waikato.ac.nz/ml/weka/arff.html" target="_blank">ARFF</a>:</p>
<blockquote>
<p><code>@relation sms_test
<br/>
<br/>
@attribute spamclass {spam,ham}
<br/>
@attribute text String
<br/>
<br/>
@data
<br/>
ham,'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
<br/>
spam,'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C\'s apply 08452810075over18\'s'
<br/>
...</code></p>
</blockquote>
<p>That is, one text instance per line, the first attribute being the nominal class spam/ham, and the second attribute being the text itself.</p>
<p>In text classification, you have to transform this original representation into a vector of terms/words/stems/etc. in order to allow the classifier to learn expressions like: "if the word "win" occurs in a text, then classify it as spam". In other words, you have to represent your texts as feature vectors, where the features are words and the values are e.g. binary weights, <a href="http://en.wikipedia.org/wiki/Tf–idf" target="_blank">TF weights, or TF.IDF weights</a>. In fact, WEKA provides the handy <a href="http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html" target="_blank"><code>StringToWordVector</code></a> filter for this purpose (Thanks, WEKA!).</p>
<p>However, it is most likely that the vocabulary used in your training set and in your test set is not identical. For instance, if you directly apply the <code>StringToWordVector</code> filter to the previous files, you get a bit different results, summarized in the following table:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/table.train.test.attributes.png" style="DISPLAY: inline" height="185" width="273"/></p>
<p>Obviously, to enable learning you have to ensure that the representation of both datasets is the same. For instance, imagine that the root of the decision tree you have learnt on your training collection poses a test on an attribute that does not exist on your test collection, then what happens?</p>
<p>Fortunately, WEKA provides at least three ways of getting the same vocabulary in your train and test subcollections. Here are them:</p>
<ol>
<li>Using a <strong>batch filter</strong> that takes both training and test collections at the same time, using the first for getting the attributes and representing the last using those attributes.</li>
<li>Using a <strong><code><strong><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/meta/FilteredClassifier.html" target="_blank"><strong><code><strong>FilteredClasifier</strong></code></strong></a></strong></code></strong> (that I have discussed <a href="http://jmgomezhidalgo.blogspot.com.es/2013/01/text-mining-in-weka-chaining-filters.html" target="_blank">in previous posts</a>), which feeds both the filter and the classifier into a single classifier that takes the original representation class/text as input for both the training and the test sets.</li>
<li>A more recent method, that is separately getting the representations and using an <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/misc/InputMappedClassifier.html" target="_blank"><strong>InputMappedClassifier</strong></a></code> that acts as a wrapper of an underlying classifier, and tries to match attributes from the training collection into the corresponding ones of the test subset.</li>
</ol>
<p>The first method is quite simple, and it just makes use of the <code>-b</code> option of the WEKA filters. The corresponding command line calls are the next ones:</p>
<blockquote>
<p><code>$> java weka.filters.unsupervised.attribute.StringToWordVector -b -i smsspam.small.train.arff -o smsspam.small.train.vector.arff -r smsspam.small.test.arff -s smsspam.small.test.vector.arff
<br/>
$> java weka.classifiers.lazy.IBk -t smsspam.small.train.vector.arff -T smsspam.small.test.vector.arff -i -c first
<br/>
...
<br/>
=== Confusion Matrix ===
<br/>
a b <-- classified as
<br/>
1 15 | a = spam
<br/>
0 84 | b = ham</code></p>
</blockquote>
<p>The second method, conveniently discussed <a href="http://jmgomezhidalgo.blogspot.com.es/2013/01/text-mining-in-weka-chaining-filters.html" target="_blank">in my previous post</a>, can be applied with the following call:</p>
<blockquote>
<p><code>$> java weka.classifiers.meta.FilteredClassifier -t smsspam.small.train.arff -T smsspam.small.test.arff -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.lazy.IBk -i -c first
<br/>
...
<br/>
=== Confusion Matrix ===
<br/>
a b <-- classified as
<br/>
1 15 | a = spam
<br/>
0 84 | b = ham</code></p>
</blockquote>
<p>As it is shown in the previous results, both methods achieve the same results. In this case, I have opted for using <code>StringToWordVector</code> without parameters (default tokenization, term weights, no stemming, etc.) with the relatively weak classifier <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/lazy/IBk.html" target="_blank">IBk</a></code> , which implements a k-Nearest-Neighbor learner that, instead of building a model from the training collection, it searches the closest training instance to the test instance (<code>k</code> is 1 on default) and assigns its class to the test instance.</p>
<p>However, the third method achieves different results, as the mapping involves some attributes from the training collection disappearing, and ignoring new attributes in the test collection. It is called the following way:</p>
<blockquote>
<p><code>$> java weka.filters.unsupervised.attribute.StringToWordVector -i smsspam.small.train.arff -o smsspam.small.train.vector.arff
<br/>
$> java weka.filters.unsupervised.attribute.StringToWordVector -i smsspam.small.test.arff -o smsspam.small.test.vector.arff
<br/>
$> java weka.classifiers.misc.InputMappedClassifier -W weka.classifiers.lazy.IBk -t smsspam.small.train.vector.arff -T smsspam.small.test.vector.arff -i -c first
<br/>
Attribute mappings:
<br/>
Model attributes Incoming attributes
<br/>
------------------------------ ----------------
<br/>
(nominal) spamclass --> 1 (nominal) spamclass
<br/>
(numeric) #&gt --> 2 (numeric) #&gt
<br/>
(numeric) $1 --> - missing (no match)
<br/>
(numeric) &amp --> - missing (no match)
<br/>
(numeric) &lt --> 6 (numeric) &lt
<br/>
(numeric) *9 --> 7 (numeric) *9
<br/>
(numeric) + --> - missing (no match)
<br/>
(numeric) - --> 8 (numeric) -
<br/>
...
<br/>
=== Confusion Matrix ===
<br/>
a b <-- classified as
<br/>
2 14 | a = spam
<br/>
1 83 | b = ham</code></p>
</blockquote>
<p style="MARGIN-RIGHT: 0px">In fact, this time we get a bit more spam (2 over 14) with a false positive, although the general accuracy is exactly the same: 85%. You can see how some of the attributes are missing (they do not occur in the test dataset), like: "<code>$1</code>", "<code>+</code>", etc. This for sure affects the performance of the classifier, so beware.</p>
<p>With these options, my recommendation is using the first method, as it allows you to fully examine the representation of the datasets (term weight vectors) and it decouples filtering from training, what may be convenient in terms of efficiency.</p>
<p>Before ending this post, I have to thank Tiago Pasqualini Silva, <a href="http://www.dt.fee.unicamp.br/~tiago/index.html" target="_blank">Tiago Almeida</a> and <a href="http://paginaspersonales.deusto.es/isantos/en/about.shtml" target="_blank">Igor Santos</a> for our experiments with the SMS Spam Collection, and to Tiago Pasqualini in particular because he showed me the <code>InputMappedClassifier</code>.</p>
<p>And last but not least, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on this topics!</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com5tag:blogger.com,1999:blog-36589303.post-72289290153873052312013-04-26T00:46:00.001+02:002013-05-24T12:42:22.742+02:00URL Text Classification with WEKA, Part 1: Data Analysis<p>I have recently came across a website named <a href="http://squidblacklist.org/" target="_blank">SquidBlackList.org</a>, which features a number or URL lists for safe web browsing using the open source proxy <a href="http://www.squid-cache.org/" target="_blank">Squid</a>. In particular, it features a <a href="http://squidblacklist.org/downloads.html" target="_blank">quite big porn domains list</a>, so I wondered: <strong>Is it possible to make a text classification system with <a href="http://www.cs.waikato.ac.nz/ml/weka/" target="_blank">WEKA</a> to detect porn domains using the text in the URLs?</strong></p>
<p>Just to note that SquidBlackList on porn (and most of the rest of the lists they provide is licensed under Creative Commons Attribution 3.0 Unported License: <span>Blacklists</span> (<a href="http://www.squidblacklist.org" rel="cc:attributionURL">Squidblacklist.org</a>) / <a href="http://creativecommons.org/licenses/by/3.0/" rel="license">CC BY 3.0</a></p>
<p><big><strong>The Filtering Problem</strong></big></p>
<p>Most <a href="http://en.wikipedia.org/wiki/Content-control_software" target="_blank">web filtering systems</a> work by using a manually classified list of URLs into a list of categories that are used to define filtering profiles (e.g. block <em>porn</em> but allow <em>press</em>). The URL lists or database must be manually maintained, and it has to be quite comprehensive regarding user browsing behaviour. As (aggregated) web browsing follows a <a href="http://en.wikipedia.org/wiki/Zipf's_law" target="_blank">Zipfian distribution</a> (that is, relatively few URLs accumulate most of the traffic), you can provide a rather effective service by ensuring that your URL database covers the most popular URLs. URL-based filtering is rather efficient (if your database is well implemented), and it can easily cover around 95% of the web traffic (in terms of #requests, not in terms or #URLs).</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/zipf.distribution.url.png" style="WIDTH: 450px; DISPLAY: inline; HEIGHT: 342px" height="342" width="450"/></p>
<p>However, covering the remaining 5% requires performing some kind of analysis. My target here is dynamically classifying that 5% of web requests (which may account for millions of URLs or even just domains) into two classes: <em>notporn</em> and <em>porn</em>. This way, we can cover the 100% of the traffic, and it is likely that we concentrate our classification mistakes (that may be possible at the URL database as well) only into that small 5% - so our filter can be 98% effective or more.</p>
<p>Why analyzing the URL text? For a matter of <strong>efficiency</strong> - you do not have to go to the Internet and get the actual <em>Web</em> content in order to analyze it, so all the processing is local to the proxy and you eventually avoid performing unnecessary Web requests at the proxy itself.</p>
<p><big><strong>Collecting the Dataset</strong></big></p>
<p>So we start with an 880k porn domains list, but although it is possible to learn only from positive examples, we may expect better effectiveness if we collect negative examples (not porn domains). A handy resource is the <a href="http://s3.amazonaws.com/alexa-static/top-1m.csv.zip" target="_blank">Top 1M Sites</a> list by <a href="http://www.alexa.com/" target="_blank">Alexa</a>, a Web research company that provides this ranked list in a daily basis. Having 1M negative examples and 880k positive examples makes a good class balance and quite populated dataset -- nice for learning, specially when its instances are relatively short text sequences (e.g. <code>google.com</code> vs. <code>porn.com</code>).</p>
<p>First we have to make both lists comparable. The format of the Alexa list is <code><rank>,<domain></code>, while the format of the Squid black list is <code><dot><domain></code> (in order to match the Squid URL list format). A couple of <code>cut</code> and <code>sed</code> commands will do the trick.</p>
<p>Then we can just add the class and mix the lists.</p>
<p><big><strong>Cleaning the Dataset, first step</strong></big></p>
<p>But... <em>Hey, Internet is for porn!</em> -- we should expect that some of the URLs in the Alexa ranking are pornographic. In fact, a simple search demonstrate it:</p>
<blockquote>
<p><code><code>$ grep porn alexa.csv | more
<br/>
pornhub.com
<br/>
youporn.com
<br/>
...
<br/>
$ grep porn alexa.csv | wc -l
<br/>
5719</code></code></p>
</blockquote>
<p>We can just substract the porn list from the Alexa list with a handy grep:</p>
<blockquote>
<p><code><code>grep -f porn.csv -v alexa.csv > alexaclean.csv</code></code></p>
</blockquote>
<p>But it takes a loooooong time, so I prefer to sort Alexa list, transforming it to Linux format (as the original one has DOS format), and use <code>comm</code>:</p>
<blockquote>
<p><code><code>$ sort alexa.csv > alexasorted.csv
<br/>
$ fromdos alexasorted.csv
<br/>
$ comm -23 alexasorted.csv porn.csv > alexaclean.csv
<br/>
$ wc -l alexaclean.csv
<br/>
975088 alexaclean.csv</code></code></p>
</blockquote>
<p>Good point, only 25k URLs are pornographic... Well, lets check:</p>
<blockquote>
<p><code><code>$ grep porn alexaclean.csv | head
<br/>
001porno.com
<br/>
0dayporn.org
<br/>
1000porno.net
<br/>
...</code></code></p>
</blockquote>
<p>So we still have some porn in there.</p>
<p><big><strong>Cleaning the Dataset, second step</strong></big></p>
<p>Cleaning Alexa list from porn is a bit more complex. How to find those popular porn sites, if they are not even in such a comprehensive list as the Squidblacklist one? Another resource comes to help, and it is the <a href="http://www.pornmd.com/" target="_blank">sex-related search engine PornMD</a>. This engine has recently published a list of popular porn searches in the form of a dynamic infography named <a href="http://www.pornmd.com/sex-search" target="_blank">Global Internet Porn Habits</a>:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/infography.pornmd.png" style="WIDTH: 450px; DISPLAY: inline; HEIGHT: 319px" height="319" width="450"/></p>
<p>So, if you collect a list of the top searches in five of the biggest speaking countries, you get:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/top.searches.porn.png" style="DISPLAY: inline" height="178" width="404"/></p>
<p>Cleaning the list from duplicated words, adding "porn", "sex" and "xxx" (rule of thumb), and computing the number of domains they occur in the Alexa (cleaned) and the Squidblacklist lists, we get:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/top.searches.distribution.png" style="DISPLAY: inline" height="338" width="227"/></p>
<p>Looking at the list, a relatively safe proportion between the number of occurrences in Squid's versus Alexa's (clean) list is 9 -- this way, we keep most obvious words and remove the most ambiguous ones (although there are some borderline examples, as "asian"). We can see the effects:</p>
<blockquote>
<p><code><code>$ grep "amateur\|anal\|asian\|creampie\|hentai\|lesbian\|mature\|milf\|squirt\|teen\|porn\|sex\|xxx" alexaclean.csv | wc -l
<br/>
17389
<br/>
$ grep "porn\|sex\|xxx" alexaclean.csv | wc -l
<br/>
12342
<br/>
$ grep -v "amateur\|anal\|asian\|creampie\|hentai\|lesbian\|mature\|milf\|squirt\|teen\|porn\|sex\|xxx" alexaclean.csv > alexacleanfinal.csv
<br/>
$ wc -l alexacleanfinal.csv
<br/>
964735 alexacleanfinal.csv</code></code></p>
</blockquote>
<p>You can see that just "porn", "sex" and "xxx" account for 70,97% of domains, so there is some <strong>domain knowledge</strong> in the process. I must note I may use another, much more extensive list of porn-related searches like the one featured <a href="http://www.pornmd.com/most-popular" target="_blank">PornMD Most Popular page</a>.</p>
<p><big><strong>Additional Analysis</strong></big></p>
<p>To get a feeling of how the previous porn-related keywords are distributed across the original Alexa ranking, I have computed the number of lines (domains) they occur in 100k intervals, to get the following chart:</p>
<p style="TEXT-ALIGN: center"><img src="http://www.esp.uem.es/jmgomez/blogimg/distribution.keywords.intervals.png" style="WIDTH: 450px; DISPLAY: inline; HEIGHT: 287px" height="287" width="450"/></p>
<p>Where <code>#query1</code> represents the number of occurrences of "porn\|sex\|xxx" and <code>#query2</code> represents the full list of keywords. The growth is nearly linear with an average of 1234.2 URLs per interval in <code>#query1</code>, and 1738.9 URLs per interval in <code>#query2</code>. The curves are smooth, and there are more domains in the first intervals (e.g. 1482 hits in the first 100k Alexa URLs for <code>#query1</code>) than in the latest ones (e.g. 1077 hits in the last 100k Alexa URLs for <code>#query1</code>).</p>
<p>There are other dataset statistics that may provide better insights regarding the classification problem, or in other words, that may be more informative or predictive in terms of classification accuracy. For instance:</p>
<ul>
<li>What is the length of an average domain name in each category?</li>
<li>How many points and/or dashes do domain have in average per category?</li>
<li>Which is the distribution of different TLDs (<a href="http://en.wikipedia.org/wiki/Top-level_domain" target="_blank">Top Level Domains</a>) across both categories?</li>
</ul>
<p>Can you imagine any other interesting statistics?</p>
<p><big><strong>The Dataset</strong></big></p>
<p>Once we have got the original Squidblacklist and the Alexa cleaned one (after substraction and removing the keyword hitting lines), we add some format to get a <a href="http://www.cs.waikato.ac.nz/ml/weka/arff.html" target="_blank">WEKA ARFF</a> file. For instance, <code>0000free.com</code> must be transformed into <code>'0000free.com',safe</code>. A bit of <code>sed</code> trickery does the job, and then we mix the lists with the following command:</p>
<blockquote>
<p><code><code>$ paste -d '\n' alexacleanfinal.csv porn.csv > urllist.csv</code></code></p>
</blockquote>
<p>The rationale behind mixing the lists is that some learning algorithms are dependent on the order of examples, and for those algorithms it is clever not to expose first all the examples of one class, the other class' ones. As the paste command adds new lines when one of the lists finish, we have to remove double CRs (<code>\n\n</code>) with another <code>sed</code> call, and we finally add the ARFF header to get a file starting the following way:</p>
<blockquote>
<p><code><code>@relation URLs
<br/>
<br/>
@attribute urltext String
<br/>
@attribute class {safe,porn}
<br/>
<br/>
@data
<br/>
'0000free.com',safe
<br/>
'0000000000000000000sex.com',porn
<br/>
'0000.jp',safe
<br/>
'000000000gratisporno.ontheweb.nl',porn
<br/>
...</code></code></p>
</blockquote>
<p>I have left that file named <code><a href="https://github.com/jmgomezh/tmweka/blob/master/URLAnalysis/urllist.arff" target="_blank">urllist.arff</a></code> in <a href="https://github.com/jmgomezh/tmweka" target="_blank">my GitHub folder</a> for your convenience, so you can start playing with it. Beware, it is over 40Mb.</p>
<p>So that is all for the moment. Stay tuned for my next steps if you liked this post.</p>
<p>Thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on this topic!</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com3tag:blogger.com,1999:blog-36589303.post-50660941768136873872013-04-08T09:31:00.001+02:002013-06-28T06:58:56.924+02:00A Simple Text Classifier in Java with WEKA<p>In previous posts [<a href="http://jmgomezhidalgo.blogspot.com.es/2013/01/text-mining-in-weka-chaining-filters.html" target="_blank">1</a>, <a href="http://jmgomezhidalgo.blogspot.com.es/2013/02/text-mining-in-weka-revisited-selecting.html" target="_blank">2</a>, <a href="http://jmgomezhidalgo.blogspot.com.es/2013/04/command-line-functions-for-text-mining.html" target="_blank">3</a>], I have shown how to make use of the <a href="http://www.cs.waikato.ac.nz/ml/weka/" target="_blank">WEKA</a> classes <code><a href="http://weka.sourceforge.net/doc/weka/classifiers/meta/FilteredClassifier.html" target="_blank">FilteredClassifier</a></code> and <code><a href="http://weka.sourceforge.net/doc.dev/weka/filters/MultiFilter.html" target="_blank">MultiFilter</a></code> in order to properly build and evaluate a text classifier using WEKA. For this purpose, I have made use of the <a href="http://www.cse.yorku.ca/course_archive/2008-09/W/4412/ExplorerGuide.pdf" target="_blank">Explorer</a> GUI provided by WEKA, and its <a href="http://jmgomezhidalgo.blogspot.com.es/2013/04/command-line-functions-for-text-mining.html" target="_blank">command-line interface</a>.</p>
<p>In my opinion, it is a good idea to get familiar with both the Explorer and the command-line interface if you want to get a feeling of the amazing power of this data mining library. However, where you can take full advantage its power is in your own Java programs. Now it is time to deal with it.</p>
<p>Following <a href="http://dl.acm.org/citation.cfm?id=1095427" target="_blank">Salton</a>, and <a href="http://dl.acm.org/citation.cfm?id=138861" target="_blank">Belkin and Croft</a>, the process of text classification involves two main steps:</p>
<ul>
<li>Representing your text database in order to enable learning, and to train a classifier on it.</li>
<li>Using the classifier to predict text labels of new, unseen documents.</li>
</ul>
<p>The first step is a batch process, in the sense that you can do it periodically (as long as your labelled data set gets improved with time -- bigger sizes, new labels or categories, corrected predictions via user feedback). The second step is actually the moment in which you get advantage of the knowledge distilled by the learning process, and it is online in the sense that it is don by demand (when new documents arrive). This distinction is conceptual, I mean that modern text classifiers retrain on the added documents as soon as they get them, in order to keep or improve accuracy with time.</p>
<p>In consequence, what we need to demonstrate the text classification process is <strong>two programs</strong>: one to <strong>learn</strong> from the text dataset, and another to use the learnt model to <strong>classify</strong> new documents. Let us start showing a very simple text learner in Java, using WEKA. The class is named <code><a href="https://github.com/jmgomezh/tmweka/blob/master/FilteredClassifier/MyFilteredLearner.java" target="_blank">MyFilteredLearner.java</a></code>, and its <code>main()</code> method demonstrates its usage, which involves:</p>
<ol>
<li>Loading the text dataset.</li>
<li>Evaluating the classifier.</li>
<li>Training the classifier.</li>
<li>Storing the classifier.</li>
</ol>
<p>The most interesting parts of the process are:</p>
<ul>
<li>We read the dataset by simply using the method <code>getData()</code> of an <code><a href="http://weka.sourceforge.net/doc.dev/weka/core/converters/ArffLoader.ArffReader.html" target="_blank">ArffReader</a></code> object that wraps a <code>BufferedReader</code>.</li>
<li>We programmatically create the classifier by combining a <code><a href="http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html" target="_blank">StringToWordVector</a></code> filter (in order to represent the texts as feature vectors) and a <code><a href="http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayes.html" target="_blank">NaiveBayes</a></code> classifier (for learning), using the <code>FilteredClassifier</code> class discussed in previous posts.</li>
</ul>
<p>The process of creating the classifier is demonstrated in the next code snippet:</p>
<blockquote>
<p><code>trainData.setClassIndex(0);
<br/>
filter = new StringToWordVector();
<br/>
filter.setAttributeIndices("last");
<br/>
classifier = new FilteredClassifier();
<br/>
classifier.setFilter(filter);
<br/>
classifier.setClassifier(new NaiveBayes());</code></p>
</blockquote>
<p>So we set the class of the dataset as being the first attribute, then we create the filter and set the attribute to be transformed from text into a feature vector (the last one), and then we create the <code>FilteredClassifier</code> object and add the previous filter and a new <code>NaiveBayes</code> classifier to it. Given the attributes above, the dataset has to have the class as the first attribute, and the text as the second (and last) one, like in my typical example of the SMS spam subset example (<code><a href="https://github.com/jmgomezh/tmweka/blob/master/FilteredClassifier/smsspam.small.arff" target="_blank">smsspam.small.arff</a></code>).</p>
<p>You can execute this class with the following commands to get the following output:</p>
<blockquote>
<p><code><code>$>javac MyFilteredLearner.java
<br/>
$>java MyFilteredLearner smsspam.small.arff myClassifier.dat
<br/>
===== Loaded dataset: smsspam.small.arff =====
<br/>
<br/>
Correctly Classified Instances 187 93.5 %
<br/>
Incorrectly Classified Instances 13 6.5 %
<br/>
Kappa statistic 0.7277
<br/>
Mean absolute error 0.0721
<br/>
Root mean squared error 0.2568
<br/>
Relative absolute error 25.8792 %
<br/>
Root relative squared error 69.1763 %
<br/>
Coverage of cases (0.95 level) 94 %
<br/>
Mean rel. region size (0.95 level) 51.75 %
<br/>
Total Number of Instances 200
<br/>
<br/>
=== Detailed Accuracy By Class ===
<br/>
<br/>
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
<br/>
0,636 0,006 0,955 0,636 0,764 0,748 0,943 0,858 spam
<br/>
0,994 0,364 0,933 0,994 0,962 0,748 0,943 0,986 ham
<br/>
Weighted Avg. 0,935 0,305 0,936 0,935 0,930 0,748 0,943 0,965
<br/>
===== Evaluating on filtered (training) dataset done =====
<br/>
===== Training on filtered (training) dataset done =====
<br/>
===== Saved model: myClassifier.dat =====</code></code></p>
</blockquote>
<p>The evaluation has been performed with default values except for the number of folds, that has been set to 4 as shown in the next code snippet:</p>
<blockquote>
<p><code><code>Evaluation eval = new Evaluation(trainData);
<br/>
eval.crossValidateModel(classifier, trainData, 4, new Random(1));
<br/>
System.out.println(eval.toSummaryString());</code></code></p>
</blockquote>
<p>For the case you don want to evaluate the classifier on the training data, you can omit the call to the <code>evaluate()</code> method.</p>
<p>Now let us deal with the classification program, which is far more complex but only for the process of creating an instance. The class is named <code><a href="https://github.com/jmgomezh/tmweka/blob/master/FilteredClassifier/MyFilteredClassifier.java" target="_blank">MyFilteredClassifier.java</a></code>, and its <code>main()</code> method demonstrates its usage, which involves:</p>
<ol>
<li>Reading the text to be classified from a file.</li>
<li>Reading the model or classifier from a file.</li>
<li>Creating the instance.</li>
<li>Classifying it.</li>
</ol>
<p>Creating the instance is performed in the <code>makeInstance()</code> method, and its code is the following one:</p>
<blockquote>
<p><code><code>// Create the attributes, class and text
<br/>
FastVector fvNominalVal = new FastVector(2);
<br/>
fvNominalVal.addElement("spam");
<br/>
fvNominalVal.addElement("ham");
<br/>
Attribute attribute1 = new Attribute("class", fvNominalVal);
<br/>
Attribute attribute2 = new Attribute("text",(FastVector) null);
<br/>
// Create list of instances with one element
<br/>
FastVector fvWekaAttributes = new FastVector(2);
<br/>
fvWekaAttributes.addElement(attribute1);
<br/>
fvWekaAttributes.addElement(attribute2);
<br/>
instances = new Instances("Test relation", fvWekaAttributes, 1);
<br/>
// Set class index
<br/>
instances.setClassIndex(0);
<br/>
// Create and add the instance
<br/>
DenseInstance instance = new DenseInstance(2);
<br/>
instance.setValue(attribute2, text);
<br/>
// instance.setValue((Attribute)fvWekaAttributes.elementAt(1), text);
<br/>
instances.add(instance);</code></code></p>
</blockquote>
<p>The classifier learnt with <code>MyFilteredLearner.java</code> expects that an instance has two attributes: the first one is the class, it is a nominal one with values <code>"spam"</code> or <code>"ham"</code>; the second one is a <code>String</code>, which is the text to be classified. Instead of creating one instance, we create a whole new dataset which first instance is the one that we want to classify. This is required in order to let the classifier know the schema of the dataset, which is stored in the <code><a href="http://weka.sourceforge.net/doc.dev/weka/core/Instances.html" target="_blank">Instances</a></code> object (and not in each instance).</p>
<p>So first we create the attributes by using the <code><a href="http://weka.sourceforge.net/doc.dev/weka/core/FastVector.html" target="_blank">FastVector</a></code> class provided by WEKA. The case of the nominal attribute (<code>"class"</code>) is relatively simple, but the case of the <code>String</code> one is a bit more complex because it requires the second argument of the constructor to be <code>null</code>, but casted to <code>FastVector</code>. Then we create an <code>Instances</code> object by using a <code>FastVector</code> to store the two previous attributes, and set the class index to 0 (which means that the first attribute will be the class). As a note, the <code>FastVector</code> class is deprecated in the WEKA development version.</p>
<p>The latest step is to create an actual instance. I am using the WEKA development version in this code (as of the date of this post), so we have to use a <code><a href="http://weka.sourceforge.net/doc.dev/weka/core/DenseInstance.html" target="_blank">DenseInstance</a></code> object. However, if you make use of the stable version, then you can use <code><a href="http://weka.sourceforge.net/doc.stable/weka/core/Instance.html" target="_blank">Instance</a></code> (link to the stable version doc), and must change this code to:</p>
<blockquote>
<p><code><code>Instance instance = new Instance(2);</code></code></p>
</blockquote>
<p>As a note, I have commented in the code a different way of setting the value of the second attribute. I must note that we do not set the value of the first attribute, as it is unknown.</p>
<p>The rest of the methods are (more or less) straightforward if you follow the documentation (<a href="http://weka.wikispaces.com/Programmatic+Use" target="_blank">weka - Programmatic Use</a>, and <a href="http://weka.wikispaces.com/Use+WEKA+in+your+Java+code" target="_blank">weka - Use WEKA in your Java code</a>). You get the class prediction on your text with the following lines:</p>
<blockquote>
<p><code><code>double pred = classifier.classifyInstance(instances.instance(0));
<br/>
System.out.println("Class predicted: " + instances.classAttribute().value((int) pred));</code></code></p>
</blockquote>
<p>And if you feed this classifier with a file (<code><a href="https://github.com/jmgomezh/tmweka/blob/master/FilteredClassifier/smstest.txt" target="_blank">smstest.txt</a></code>) that stores the text <code>"this is spam or not, who knows?"</code>, and the model learnt with <code>MyFilteredLearner.java</code> (that is stored in <code>myClassifier.dat</code>), then you get the following result:</p>
<blockquote>
<p><code><code>$>javac MyFilteredClassifier.java
<br/>
$>java MyFilteredClassifier smstest.txt myClassifier.dat
<br/>
===== Loaded text data: smstest.txt =====
<br/>
this is spam or not, who knows?
<br/>
===== Loaded model: myClassifier.dat =====
<br/>
===== Instance created with reference dataset =====
<br/>
@relation 'Test relation'
<br/>
<br/>
@attribute class {spam,ham}
<br/>
@attribute text string
<br/>
<br/>
@data
<br/>
?,' this is spam or not, who knows?'
<br/>
===== Classified instance =====
<br/>
Class predicted: ham</code></code></p>
</blockquote>
<p>It is interesting to see that the class assigned to the instance before classifying it is <code>"?"</code>, which means <em>undefined</em> or <em>unknown</em>.</p>
<p>For those interested on using the classifiers discussed in my previous posts (I mean including <code><a href="http://weka.sourceforge.net/doc.dev/weka/filters/supervised/attribute/AttributeSelection.html" target="_blank">AttributeSelection</a></code>, and using <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/rules/PART.html" target="_blank">PART</a></code> and <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/functions/SMO.html" target="_blank">SMO</a></code> as classifiers), the only part of this code that you have to change is the <code>learn()</code> and <code>evaluate()</code> methods in <code>MyFilteredLearner.java</code>. Just play with it, and have fun.</p>
<p>Thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for futher articles on this topic!</p>
<p><strong>UPDATE (June 26th, 2013):</strong> Since I wrote this post, I have moved <a href="https://github.com/jmgomezh/tmweka" target="_blank">my code examples and other stuff to a GitHub repository</a>. I have just updated the links.</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com65tag:blogger.com,1999:blog-36589303.post-23818818920909030322013-04-01T18:21:00.000+02:002013-05-02T09:45:30.878+02:00Command Line Functions for Text Mining in WEKA<p>In previous posts I have explained <a href="http://jmgomezhidalgo.blogspot.com.es/2013/01/text-mining-in-weka-chaining-filters.html" target="_blank">how to chain filters and classifiers in WEKA</a>, in order to avoid incorrect results when evaluating text classifiers by using cross-fold validation, and <a href="http://jmgomezhidalgo.blogspot.com.es/2013/02/text-mining-in-weka-revisited-selecting.html" target="_blank">how to integrate feature selection in the text classification process</a>. For this purpose, I have used the <a href="http://weka.sourceforge.net/doc/weka/classifiers/meta/FilteredClassifier.html" target="_blank">FilteredClassifier</a> and the <a href="http://weka.sourceforge.net/doc.dev/weka/filters/MultiFilter.html">MultiFilter</a> in the <a href="http://www.cse.yorku.ca/course_archive/2008-09/W/4412/ExplorerGuide.pdf" target="_blank">Explorer</a> GUI provided by <a href="http://www.cs.waikato.ac.nz/ml/weka/" target="_blank">WEKA</a>. Now it is time to do so in the command line.</p>
<p>WEKA essentially provides three usage modes:</p>
<ol>
<li>Using the Explorer, and other GUIs like the <a href="http://www.cse.yorku.ca/course_archive/2006-07/W/4412/doc/weka/ExperimenterTutorial-3.5.5.pdf" target="_blank">Experimenter</a>, which allow to setup experiments and to examine the results graphically.</li>
<li>Using the command line functions, which allow to setup filters, classifiers and clusterers with plenty of configuration options.</li>
<li>Using the classes programmatically, that is, in your own programs in Java.</li>
</ol>
<p>One major difference between modes 1 and 2 is that in the first mode, you spend some of the memory in the GUI, while in the second one, you do not. That can be a significant difference when you load big datasets. In both cases you can control the memory assigned to WEKA using Java command line options like <code>-Xms</code>, <code>-Xms</code> and so, but it may be interesting to save the memory used in the graphic elements in order to be able to deal with bigger datasets.</p>
<p>I will deal with the usage of WEKA in your programs in the future, in this post I focus on the command line. Before trying the following examples, please ensure <code>weka.jar</code> is added to your <code>CLASSPATH</code>. The first thing we must know is that WEKA filters and classifiers can be called in the command line, and that the call without arguments will show their configuration options. For instance, when you call a rule learner like <a href="http://weka.sourceforge.net/doc/weka/classifiers/rules/PART.html" target="_blank">PART</a> (which I used in my previous posts), you get the following options:</p>
<blockquote>
<p><code>$>java weka.classifiers.rules.PART
<br/>
Weka exception: No training file and no object input file given.
<br/>
General options:
<br/>
-h or -help
<br/>
Output help information.
<br/>
-synopsis or -info
<br/>
Output synopsis for classifier (use in conjunction with -h)
<br/>
-t <name of training file>
<br/>
Sets training file.
<br/>
-T <name of test file>
<br/>
Sets test file. If missing, a cross-validation will be performed
<br/>
on the training data.
<br/>
...
<br/>
Options specific to weka.classifiers.rules.PART:
<br/>
-C <pruning confidence>
<br/>
Set confidence threshold for pruning.
<br/>
(default 0.25)
<br/>
...</code></p>
</blockquote>
<p>I omit the full list of options. Options are divided into two groups, those that are accepted by any classifier and those specific to the PART classifier. General options include three usage modes:</p>
<ul>
<li>Evaluating the classifier on the training collection it self, possibly using cross validation, or on a test collection.</li>
<li>Training a classifier and storing the model in a file for further use.</li>
<li>Training a classifier and getting its output (classification of instances) on a test collection.</li>
</ul>
<p>However, when calling a filter in the command line, the input file (the dataset) is read from the standard input, so you have to redirect the input from your file by using the appropriate operator (<code><</code>), or to use the option <code>-h</code> to get the options of the filter.</p>
<p>In my previous post on chaining filters and classifiers, I performed an experiment running a PART classifier on an ARFF-formatted subset of the <a href="http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/" target="_blank">SMS Spam Collection</a>, namely the <code>smsspam.small.arff</code> file. As every instance is of the form <code>[spam|ham],"message text"</code>, we have to transform the text of the message into a term weight vector by using the StringToWordVector filter. You can combine the filter and the classifier evaluation into one command by using the FilteredClassifier class as in the following command:</p>
<blockquote>
<p><code>$>java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.rules.PART</code></p>
</blockquote>
<p>To get the following output:</p>
<blockquote>
<p><code>=== Stratified cross-validation ===
<br/>
Correctly Classified Instances 173 86.5 %
<br/>
Incorrectly Classified Instances 27 13.5 %
<br/>
Kappa statistic 0.4181
<br/>
Mean absolute error 0.1625
<br/>
Root mean squared error 0.3523
<br/>
Relative absolute error 58.2872 %
<br/>
Root relative squared error 94.9031 %
<br/>
Total Number of Instances 200
<br/>
<br/>
=== Confusion Matrix ===
<br/>
<br/>
a b <-- classified as
<br/>
13 20 | a = spam
<br/>
7 160 | b = ham</code></p>
</blockquote>
<p>Which is exactly the one I showed <a href="http://jmgomezhidalgo.blogspot.com.es/2013/01/text-mining-in-weka-chaining-filters.html" target="_blank">in my previous post</a>. I have used the following general options:</p>
<ul>
<li><code>-t smsspam.small.arff</code> to specify the dataset to train (and on default, to evaluate on by using cross-validation).</li>
<li><code>-c 1</code> to specify the first attribute as the class.</li>
<li><code>-x 3</code> to specify that the number of folds to be used in the cross-validation evaluation is 3.</li>
<li><code>-v</code> and <code>-o</code> to avoid outputting the classifiers and statistics on the training collection, respectively.</li>
</ul>
<p>Plus the specific options of the FilteredClassifier <code>-F</code> to define the filter, and <code>-W</code> to define the classifier.</p>
<p>In my subsequent post on chaining filters, I proposed to make use of attribute selection to improve the representation of our learning problem. This can be done by issuing the following command:</p>
<blockquote>
<p><code><code>$>java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F "weka.filters.MultiFilter -F weka.filters.unsupervised.attribute.StringToWordVector -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.rules.PART</code></code></p>
</blockquote>
<p>To get the following output:</p>
<blockquote>
<p><code><code>=== Stratified cross-validation ===
<br/>
Correctly Classified Instances 167 83.5 %
<br/>
Incorrectly Classified Instances 33 16.5 %
<br/>
Kappa statistic 0.1959
<br/>
Mean absolute error 0.1967
<br/>
Root mean squared error 0.38
<br/>
Relative absolute error 70.53 %
<br/>
Root relative squared error 102.3794 %
<br/>
Total Number of Instances 200
<br/>
<br/>
=== Confusion Matrix ===
<br/>
<br/>
a b <-- classified as
<br/>
6 27 | a = spam
<br/>
6 161 | b = ham</code></code></p>
</blockquote>
<p>Which in turn, it is the same I got <a href="http://jmgomezhidalgo.blogspot.com.es/2013/02/text-mining-in-weka-revisited-selecting.html" target="_blank">in that post</a>. If we replace PART by the <a href="http://weka.sourceforge.net/doc/weka/classifiers/functions/SMO.html" target="_blank">SMO</a> implementation of Support Vector Machines included in WEKA (by changing <code>weka.classifiers.rules.PART</code> to <code>weka.classifiers.functions.SMO</code>), we get the accuracy figure of 91%, as described in the post.</p>
<p>While most of the options are the same as in the previous command, two things deserve special attention in this one:</p>
<ul>
<li>We chain the <a href="http://weka.sourceforge.net/doc/weka/filters/unsupervised/attribute/StringToWordVector.html" target="_blank">StringToWordVector</a> and the <a href="http://weka.sourceforge.net/doc.dev/weka/filters/supervised/attribute/AttributeSelection.html" target="_blank">AttributeSelection</a> filters by using the MultiFilter described in the previous post. The order of calls is obviously relevant, as we first need to tokenize the messages into words, and then selecting the most informative words. Moreover, while we apply StringToWordVector with the default options, the AttributeSelection filter makes use of the <a href="http://weka.sourceforge.net/doc/weka/attributeSelection/InfoGainAttributeEval.html" target="_blank">InfoGainAttributeEval</a> function as quality metric, and the <a href="http://weka.sourceforge.net/doc/weka/attributeSelection/Ranker.html" target="_blank">Ranker</a> class as the search method. The Ranker class is applied with the option <code>-T 0.0</code> in order to specify that the filter has to rank the attributes (words or tokens) according to the quality metric, but to keep only which score is over the threshold defined by T, that is 0.0.</li>
<li>As the order of options is not relevant, it is required to link the options to the appropriate class by using the quotation mark symbol ("). Unfortunately, we have three nested expressions:</li>
<li style="LIST-STYLE-TYPE: none">
<ul class="noindent">
<li>The whole MultiFilter filter, enclosed by the isolated quotation marks (").</li>
<li>The AttributeSelection filter, enclosed by the escaped quotation mark (\").</li>
<li>The Ranker search method, enclosed by the double escaped quotation mark (\\\"). Here we escape the escape symbol itself (\) along with the quotation mark.</li>
</ul>
</li>
<li style="LIST-STYLE-TYPE: none">So many escaping symbols make it a bit <em>dirty</em>, but still functional.</li>
</ul>
<p>Si I have shown how we can chain filters and classifiers, and apply several chained filters as well, in the command line. In next posts I will explain how to train, store and then evaluate a classifier by using the command line, and how to make use of WEKA filters and classifiers in your Java programs.</p>
<p>Thanks for reading, and please feel free to leave a comment if you think I can improve this article!</p>
<p>NOTE: You can find the collection I used in this post, along with other stuff related to WEKA and text mining in my <a href="http://www.esp.uem.es/jmgomez/tmweka/" target="_blank">Text Mining in WEKA</a> page.</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com0tag:blogger.com,1999:blog-36589303.post-66524861400986988612013-02-11T10:50:00.001+01:002013-05-02T09:46:53.386+02:00Text Mining in WEKA Revisited: Selecting Attributes by Chaining Filters<p>Two weeks ago, I wrote <a href="http://jmgomezhidalgo.blogspot.com.es/2013/01/text-mining-in-weka-chaining-filters.html" target="_blank">a post on how to chain filters and classifiers in WEKA</a>, in order to avoid misleading results when performing experiments with text collections. The issue was that, when using <a href="http://en.wikipedia.org/wiki/Cross-validation" target="_blank">N Fold Cross Validation</a> (CV) in your data, you should not apply the <a href="http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html" target="_blank">StringToWordVector</a> (STWV) filter on the full data collection and then perform the CV evaluation on your data, because you would be using words that are present in your test subset (but not in your training subset) for each run. Moreover, the STWV filter can extract and use simple statistics to filter out the terms (e.g. minimum number of occurrences), but those statistics over the full collection are not valid because in each CV run you use only a subset of it.</p>
<p>Now I would like to deal with a more general setting in which you want to apply <strong><a href="http://en.wikipedia.org/wiki/Dimension_reduction" target="_blank">dimensionality reduction</a></strong> because, in general text classification tasks, the documents or examples are represented by hundreds (if not thousands) of tokens, what makes the classification problem very hard for many learners. In <a href="http://www.cs.waikato.ac.nz/ml/weka/" target="_blank">WEKA</a>, this involves using the <a href="http://weka.sourceforge.net/doc.dev/weka/filters/supervised/attribute/AttributeSelection.html" target="_blank">AttributeSelection</a> filter along with the STWV one. Before thinking about dimensionality reduction, we must reflect a bit about it.</p>
<p>Dimensionality reduction is a typical step in many data mining problems, which involves transforming our data representation (the schema of our table, the list of current attributes) into a shorter, more compact, and hopefully, more predictive one. Basically, this can be done in two ways:</p>
<ul>
<li>With <strong>feature reduction</strong>, which maps the original representation (list of attributes) onto a new and more compact one. The new attributes are synthetic, that is, they somehow combine the information from subsets of the original ones which share statistical properties. Typical feature reduction techniques include algebraic analysis methods like <a href="http://en.wikipedia.org/wiki/Principal_component_analysis" target="_blank">Principal Component Analysis</a> (PCA) and <a href="http://en.wikipedia.org/wiki/Singular_value_decomposition" target="_blank">Singular Value Decomposition</a> (SVD). In text analysis, the most popular method is, by far, <a href="http://en.wikipedia.org/wiki/Latent_semantic_indexing" target="_blank">Latent Semantic Analysis</a>, which involves obtaining the principal components or buckets into the term-to-document sparse matrix.</li>
<li>With <strong>feature selection</strong>, which just selects a subset of the original representation attributes, according to some Information Theory quality metric like <a href="http://en.wikipedia.org/wiki/Information_gain_in_decision_trees" target="_blank">Information Gain</a> or <a href="http://en.wikipedia.org/wiki/Chi-squared_distribution" target="_blank">X^2 (Chi-Square)</a>. This method can be far more simple and less time consuming than the previous one, as you only have to compute the value of the metric for each attribute, and rank the attributes. Then you simply decide a threshold in the metric (e.g. 0 for Information Gain) and keep the attributes with a value over it. Alternatively, you can choose a percentage of the number of original attributes (e.g. 1% and 10% are typical numbers in text classification), and just keep those top ranking ones. However, there are other more time consuming alternatives, like exploring the predictive power of subsets of attributes using search algorithms.</li>
</ul>
<p>A major difference between both methods is that feature reduction leads to <em>synthetic</em> attributes, but feature selection just keeps some of the original ones. This may affect the ability of the data scientist to understand the results, as synthetic attributes can be statistically relevant but meaningless. Another difference is that feature reduction does not make use of the <em>class information</em>, while feature selection does. In consequence, the second method is very likely to lead to a more predictive subset of attributes than the original one. But beware, more theoretical predictive power does not always mean more effectiveness. I recommend to read the old (?) but always helpful <a href="http://dl.acm.org/citation.cfm?id=657137" target="_blank">paper by Yimming Yang & Jan Pedersen</a> on the topic.</p>
<p>The WEKA package supports both methods, mainly with the <a href="http://weka.sourceforge.net/doc/weka/attributeSelection/PrincipalComponents.html" target="_blank">weka.attributeSelection.PrincipalComponents</a> (feature reduction) and <a href="http://weka.sourceforge.net/doc.dev/weka/filters/supervised/attribute/AttributeSelection.html" target="_blank">weka.filters.supervised.attribute.AttributeSelection</a> (feature selection) filters. But an important question is: Do you really need to make dimensionality reduction in text analysis? There are two clear arguments against it:</p>
<ol>
<li>Some algorithms get no hurt with using all the features, even if they are really many and very sparse. For instance, Support Vector Machines excel in text classification problems exactly for that: they are able to deal with thousands of attributes, and they get better results when no reduction is performed. A typical text classification problem in which dimensionality reduction can be a big mistake is spam filtering.</li>
<li>If it is a matter of computing time, like e.g. in symbolic learners like decision trees (C4.5) or rules (Ripper), then there is no worry. Big Data techniques come to help, as you can configure cheap and big clusters over e.g. Hadoop to perform your computations!</li>
</ol>
<p>But having the algorithms in my favourite data analysis package, and knowing that sometimes they lead to effectiveness improvements, why not using them?</p>
<p>Because of the reasons above, I will focus on feature selection. In consequence, I will deal with the AttributeSelection filter, leaving the PrincipalComponents one for another post. Let us start with the same text collection that I used in my previous post about chaining filters and classifiers in WEKA. It is an small subset of the <a href="http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/" target="_blank">SMS Spam Collection</a>, made with the first 200 messages for brevity and simplicity.</p>
<p>Our goal is to perform a 3-fold CV experiment with any algorithm in WEKA. But, in order to do it correctly, we know we must chain the STWV filter with the classifier by using the FilteredClassifier learner in WEKA. However, we want to perform feature selection as well, and the FilteredClassifier allows us to chain a single filter and a single classifier. So, how to combine both the STWV and the AttributeSelection filters into a single one?</p>
<p>Let us start doing it manually. After loading the dataset into the WEKA Explorer, applying the STWV filter with the default settings, and setting the class attribute to the "spamclass" one, we get something like this:</p>
<p><img src="https://lh5.googleusercontent.com/-aVqAh2gsXS0/URirr3HJB5I/AAAAAAAABlw/Chtd-kGNXGs/s800/weka01.PNG" style="TEXT-ALIGN: center; WIDTH: 400px; DISPLAY: block; HEIGHT: 299px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="299" width="400"/></p>
<p>Now we can either go to the "Select attributes" tab, or just stay in the "Preprocess" tab and choose the AttributeSelection filter. I opt for the second way, so you can browse the filters folder by clicking on the "Choose" button at the "Filters" area. After selecting the "weka > filters > supervised > attribute > AttributeSelection", you can see the selected filter in the "Filters" area, as shown in the next picture:</p>
<p><img src="https://lh6.googleusercontent.com/-Ru7jWvVqFc8/URirsCXFAvI/AAAAAAAABl8/zZdeU8KMQgI/s800/weka02.PNG" style="TEXT-ALIGN: center; WIDTH: 400px; DISPLAY: block; HEIGHT: 299px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="299" width="400"/></p>
<p>In order to set up the filter, we can click on the name of the filter. The "weka.gui.GenericObjectEditor" window we get is a generic window that allows to configure filters, classifiers, etc. according to a number of object-defined properties. In this case, it allows us to set up the AttributeSelection filter configuration options, which are:</p>
<ul>
<li>The <a href="http://weka.sourceforge.net/doc/weka/attributeSelection/AttributeEvaluator.html" target="_blank">evaluator</a>, which is the quality metric we use to evaluate the predictive properties of an attribute or a set of them. There you can choose among a wide number of them (which depends on your WEKA version), including specially Chi Square (<a href="http://weka.sourceforge.net/doc/weka/attributeSelection/ChiSquaredAttributeEval.html" target="_blank">ChiSquaredAttributeEval</a>), Information Gain (<a href="http://weka.sourceforge.net/doc/weka/attributeSelection/InfoGainAttributeEval.html" target="_blank">InfoGainAttributeEval</a>), and Gain Ratio (<a href="http://weka.sourceforge.net/doc/weka/attributeSelection/GainRatioAttributeEval.html" target="_blank">GainRatioAttributeEval</a>).</li>
<li>The <a href="http://weka.sourceforge.net/doc/weka/attributeSelection/ASSearch.html" target="_blank">search algorithm</a>, which is the way we will select the remaining group of attributes, and includes very clever but time consuming group search algorithms, and my favourite one, the Ranker (<a href="http://weka.sourceforge.net/doc/weka/attributeSelection/Ranker.html" target="_blank">weka.attributeSelection.Ranker</a>). This one just ranks the attributes according to the chosen quality metric, and keeps those meeting some criterion (like e.g. having a value over a predefined threshold).</li>
</ul>
<p>In the next picture, you can see the AttributeSelection configuration window with the evaluator set up to Information Gain, and the search set up as Ranker, with the default options.</p>
<p><img src="https://lh5.googleusercontent.com/-T1b2VbGK7j8/URirsViyp5I/AAAAAAAABl0/kw5Up1j3vi4/s465/weka03.PNG" style="TEXT-ALIGN: center; WIDTH: 350px; DISPLAY: block; HEIGHT: 185px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="185" width="350"/></p>
<p>The Ranker evaluator has two main properties:</p>
<ul>
<li>The <em>numToSelect</em> property, which defines the number of attributes to keep, an Integer number that is -1 (all) by default.</li>
<li>The <em>threshold</em> property, which defines the minimum value that an attribute has to get in the evaluator in order to be kept. The default value for this property is the minimum Long integer in Java.</li>
</ul>
<p>In consequence, if we want to keep those attributes scoring over 0, we have just to write that number in the threshold area of the window we get when we click on the Ranker at the previous window:</p>
<p><img src="https://lh5.googleusercontent.com/-i_NBv6nmAmI/URirs2mLekI/AAAAAAAABmE/rJ1zMMRhEz8/s435/weka04.PNG" style="TEXT-ALIGN: center; WIDTH: 350px; DISPLAY: block; HEIGHT: 247px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="247" width="350"/></p>
<p>By clicking OK on all the previous windows, we get a configuration of the AttributeSelection filter which involves keeping those attributes with Information Gain score over 0. If we apply that filter to our current collection, we get the following result:</p>
<p><img src="https://lh6.googleusercontent.com/-beT4BfclLfM/URirtSCg3VI/AAAAAAAABmI/DiGsDjn0l7U/s800/weka05.PNG" style="TEXT-ALIGN: center; WIDTH: 400px; DISPLAY: block; HEIGHT: 299px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="299" width="400"/></p>
<p>As you can see, we get a ranked list of 82 attributes (plus the class one), in which the top scoring attribute is the token "to". This attribute occurs in 69 messages (value 1), but many of them are spam ones, so it is quite predictive for this particular class. We can see as well that we only keep a 5.93% of the original attributes (82 over 1382).</p>
<p>Now we can go to the "Classify" tab and select the rule learner PART ("weka > classifiers > rules > PART") to be evaluated on the training collection itself ("Test options" area, "Use training set option"), getting the next result:</p>
<p><img src="https://lh3.googleusercontent.com/--5_RJkc4KcU/URirtrncGAI/AAAAAAAABmY/qKJN90Unj58/s800/weka06.PNG" style="TEXT-ALIGN: center; WIDTH: 400px; DISPLAY: block; HEIGHT: 300px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="300" width="400"/></p>
<p>We get an accuracy of 95.5%, much better than <a href="http://jmgomezhidalgo.blogspot.com.es/2013/01/text-mining-in-weka-chaining-filters.html" target="_blank">the results I reported in my previous post</a>. Of course, these results cannot be compared because this quick experiment is a test on the training collection, not done with 3-fold CV and the FilteredClassifier. But if we want to run a CV experiment, how to do it as we have 2 filters instead of one, in our set up?</p>
<p>What we need now is to start with the original text collection in <a href="http://www.cs.waikato.ac.nz/ml/weka/arff.html" target="_blank">ARFF format</a> (no STWV yet), and to use the <a href="http://weka.sourceforge.net/doc.dev/weka/filters/MultiFilter.html" target="_blank">MultiFilter</a> that WEKA provides for these situations. We start then with the original collection, and go to the "Classify" tab. If we try to choose any classic learner (<a href="http://weka.sourceforge.net/doc/weka/classifiers/trees/J48.html" target="_blank">J48 for the C4.5 decision tree learner</a>, <a href="http://weka.sourceforge.net/doc/weka/classifiers/functions/SMO.html" target="_blank">SMO for Support Vector Machines</a>, etc.), it will be impossible because we have just one attribute (the text of the SMS messages) along with the class, but we can use the <a href="http://weka.sourceforge.net/doc/weka/classifiers/meta/FilteredClassifier.html" target="_blank">weka.classifiers.meta.FilteredClassifier</a>. After selecting it, we will see something similar to the next picture:</p>
<p><img src="https://lh3.googleusercontent.com/-4afnYiVvy2I/URiruOiStAI/AAAAAAAABmU/n8yKF72A1H4/s800/weka07.PNG" style="TEXT-ALIGN: center; WIDTH: 400px; DISPLAY: block; HEIGHT: 300px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="300" width="400"/></p>
<p>If we click on the name of the classifier at the "Classifier" area and we select <a href="http://weka.sourceforge.net/doc/weka/classifiers/rules/PART.html" target="_blank">weka.classifiers.rules.PART</a> as the classifier (with default options), we get the next set up in the FilteredClassifier editor window:</p>
<p><img src="https://lh5.googleusercontent.com/-4XZFdmv8zvs/URiruLNDKGI/AAAAAAAABmQ/cI6tfCaGkaY/s465/weka08.PNG" style="TEXT-ALIGN: center; WIDTH: 350px; DISPLAY: block; HEIGHT: 208px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="208" width="350"/></p>
<p>Then we can choose the weka.filters.MultiFilter in the filter area, which starts with a dummy AllFilter. Time to set up our filter combining STWV and AttributeSelection. We click on the filter name area and we get a new filter edition window with an area to define the filters to be applied. If we click on it, we get a new window that allows to add, configure and delete filters. The selected filters will be applied in the order we add them, so we start deleting the AllFilter and adding a STWV filter with the default options, getting something similar to the next picture:</p>
<p><img src="https://lh3.googleusercontent.com/-iFaiyK_72F0/URiruwTjqAI/AAAAAAAABms/4qwN0tY_rEU/s260/weka09.PNG" style="TEXT-ALIGN: center; WIDTH: 260px; DISPLAY: block; HEIGHT: 194px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="194" width="260"/></p>
<p>Filters are added by clicking on the "Choose" button to select them, and clicking on the "Add" button to add them to the list. We can now add the AttributeSelection filter with the Information Gain evaluator and the Ranker with threshold 0 search, by editing the filter when clicking on the "Edit" button with the AttributeSelection filter selected in the list. If you manually re-dimension the window, you can see a set up similar to this one:</p>
<p><img src="https://lh5.googleusercontent.com/-1vKcKDynrhE/URirvPPpFpI/AAAAAAAABmk/mtPvr2JHBO0/s629/weka10.PNG" style="TEXT-ALIGN: center; WIDTH: 400px; DISPLAY: block; HEIGHT: 146px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="146" width="400"/></p>
<p>The set up is nearly finished. We close this window by clicking on the "X" button, and click on the "OK" button at the MultiFilter and FilteredClassifier windows. In the "Classify" tab at the explorer, we select "Cross-validation" in the "Test options" area, entering 3 as the number of folds, and we select the class attribute as "spamclass". Having done this, we can just click on the "Start" button to get the next result:</p>
<p><img src="https://lh6.googleusercontent.com/-fpzIgIh2h04/URirvv9Db8I/AAAAAAAABmo/oHaVNTGT_RI/s800/weka11.PNG" style="TEXT-ALIGN: center; WIDTH: 400px; DISPLAY: block; HEIGHT: 300px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="300" width="400"/></p>
<p>So we get an accuracy of 83.5%, which is worse than the one we got without using feature selection (which was 86.5%). Oh oh, all this clever (?) set up to get a drop of 3 points in accuracy! :-(</p>
<p>But what happens if, instead of using a relatively weak learner on text problems like PART, we turn to Support Vector Machines? WEKA includes the <a href="http://weka.sourceforge.net/doc/weka/classifiers/functions/SMO.html" target="_blank">weka.classifiers.functions.SMO</a> classifier, which implements <a href="http://dl.acm.org/citation.cfm?id=299105" target="_blank">John Platt's sequential minimal optimization algorithm</a> for training a support vector classifier. If we choose this classifier with default options, we get a quite different results:</p>
<ul>
<li>Using only the STWV filter, we get an accuracy of 90.5% with 18 spam messages classified as legitimate ("ham"), and 1 false positive.</li>
<li>Using the MultiFilter with AttributeSelection in the same setup, we get an accuracy of 91% with 16 spam messages classified as ham, and 2 false positives.</li>
</ul>
<p>So we get an improvement of accuracy on a more accurate learner, what is nice. However, the difference is just 0.5% (1 message in our 200 instances collection), so it is moderate. Moreover, we get one more false positive, what is bad for this particular problem. In spam filtering, it is much worse to make a false positive (sending a legitimate message to the spam folder) than the opposite, because the user has the risk to miss an important message. Check <a href="http://dl.acm.org/citation.cfm?id=508911" target="_blank">my paper on cost sensitive evaluation of spam filtering at ACM SAC 2002</a>.</p>
<p>But all in all, I expect this post shows the merits of feature selection in text classification problems, and how to do it with my favourite library, WEKA. Thanks for reading, and please feel free to leave a comment if you think I can improve this article!</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com29tag:blogger.com,1999:blog-36589303.post-35038799970740459232013-01-29T13:21:00.001+01:002013-05-02T09:48:01.472+02:00Text Mining in WEKA: Chaining Filters and Classifiers<p>One of the most interesting features of <a href="http://www.cs.waikato.ac.nz/ml/weka/" target="_blank">WEKA</a> is its flexibility for text classification. Over the years, I have had the chance to make a lot of experiments on text collections with WEKA, most of them in <a href="http://en.wikipedia.org/wiki/Supervised_learning" target="_blank">supervised tasks</a> that are commonly mentioned as <a href="http://en.wikipedia.org/wiki/Document_classification" target="_blank">Text Categorization</a>, that is, classifying text segments (documents, paragraphs, collocations) into a set of predefined classes. Examples of Text Categorization tasks include assigning topics labels to news items, classifying email messages into folders, or, more close to my research, classifying messages as spam or not (<a href="http://en.wikipedia.org/wiki/Bayesian_spam_filtering" target="_blank">Bayesian spam filters</a>) and web pages as inappropriate or not (e.g. pornographic content vs. educational resources).</p>
<p>WEKA support for Text Categorization is <em>impressive</em>. A prominent feature is that this package supports breaking text utterances into indexing terms (word stems, collocations) and assigning them a weight in term vectors, a required step in nearly every text classification task. This tokenization and indexing process is achieved by using a super-flexible filter named <a href="http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html" target="_blank">StringToWordVector</a>. Lets me show an example of how it works.</p>
<p>I will start with a simple text collection, which is an small sample of the publicly available <a href="http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/" target="_blank">SMS Spam Collection</a>. Some colleagues and me built this collection for experimenting with Bayesian SMS spam filters, and it contains 4,827 legitimate messages and 747 mobile spam messages, for a total of 5,574 short messages collected from several sources. I will make use of an small subset in order to better show my points in this post. The subset is made with the first 200 messages, and it is the following one right formatted in the suitable WEKA ARFF format:</p>
<blockquote style="MARGIN-RIGHT: 0px" dir="ltr">
<p>@relation sms_test</p>
<p>@attribute spamclass {spam,ham}
<br/>
@attribute text String</p>
<p>@data
<br/>
ham,'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
<br/>
ham,'Ok lar... Joking wif u oni...'
<br/>
spam,'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C\'s apply 08452810075over18\'s'
<br/>
ham,'U dun say so early hor... U c already then say...'
<br/>
ham,'Nah I don\'t think he goes to usf, he lives around here though'
<br/>
spam,'FreeMsg Hey there darling it\'s been 3 week\'s now and no word back! I\'d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv'
<br/>
...
<br/>
ham,'Hi its Kate how is your evening? I hope i can see you tomorrow for a bit but i have to bloody babyjontet! Txt back if u can. :) xxx'</p>
</blockquote>
<p>In the first 200 messages of the collection, 33 of them are spam and 167 are legitimate ("ham"). This collection can be loaded in the <a href="https://www.google.es/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CDAQFjAA&url=http://www.cse.yorku.ca/course_archive/2008-09/W/4412/ExplorerGuide.pdf&ei=NLwHUY2FBMSYhQed7oG4Cg&usg=AFQjCNGMB6VSKlDT54vaURKZUzpE84JzSA&sig2=XqaJy2aFRWyNEb8skoVbcw&bvm=bv.41524429,d.ZG4" target="_blank">WEKA Explorer</a>, showing something similar to the following window:</p>
<p style="TEXT-ALIGN: center"><img src="https://lh6.googleusercontent.com/-X1T58FONe78/UQey9KzvS_I/AAAAAAAABkM/nn6PVpXg9J4/s735/wekaexample01.PNG" style="TEXT-ALIGN: center; WIDTH: 400px; DISPLAY: block; HEIGHT: 299px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="299" width="400"/></p>
<p>The point is that messages are featured as string attributes, so you have to break them into words in order to allow learning algorithms to induce classifiers with rules like:</p>
<blockquote>
<p><strong>if</strong> ("urgent" <strong>in</strong> message) <strong>then</strong> class(message) == spam</p>
</blockquote>
<p>Here is where the StringToWordVector filter comes to help. You can just select it by clicking the "Choose" button in the "Filter" area, and browsing the folders to "weka > filters > unsupervised > attribute" one. Once selected, you should be able to see something like this:</p>
<p><img src="https://lh6.googleusercontent.com/-gzV8Vf_venI/UQey9O3vPFI/AAAAAAAABkM/q3KG3PpAF6s/s735/wekaexample02.PNG" style="TEXT-ALIGN: center; WIDTH: 400px; DISPLAY: block; HEIGHT: 299px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="299" width="400"/></p>
<p>If you click on the name of the filter, you will get a lot of options, which I leave for another post. For the my goals in this one, you can just apply this filter with the default options to get an indexed collection of 200 messages and 1,382 indexing tokens (plus the class attribute), shown in the next picture:</p>
<p><img src="https://lh5.googleusercontent.com/-t09zkp9O55c/UQey9LSa0pI/AAAAAAAABkM/CwWsNKVkvI0/s735/wekaexample03.PNG" style="TEXT-ALIGN: center; WIDTH: 400px; DISPLAY: block; HEIGHT: 299px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="299" width="400"/></p>
<p>If you want to see colors showing the distribution of attributes (tokens) according to the class, you can just select the "class" attribute as the class for the collection in the bottom-left area of the WEKA Explorer. So, you can see that the attribute "Available" occurs just in one message, which happens to be a legitimate (ham) one:</p>
<p><img src="https://lh3.googleusercontent.com/-35XJu0ccyLs/UQey955xiLI/AAAAAAAABkM/QAikmabxlU0/s735/wekaexample04.PNG" style="TEXT-ALIGN: center; WIDTH: 400px; DISPLAY: block; HEIGHT: 299px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="299" width="400"/></p>
<p>Now, we can make our experiments in the Classify tab. We can just select cross-validation using 3 folds (1), point to the appropriate attribute to be used as a class (which is the "spamclass" one) (2), and select a rule learner like <a href="http://weka.sourceforge.net/doc/weka/classifiers/rules/PART.html" target="_blank">PART</a> in the classifier area (3). You can find that classifier at the "weka > classifiers > rules" folder when clicking on the "Choose" button at the "Classifier" area. This setup is shown in the next figure:</p>
<p><img src="https://lh4.googleusercontent.com/-7EPITpS_vNo/UQey-cGsD-I/AAAAAAAABkM/jOlj3LUM2OU/s735/wekaexample05.PNG" style="TEXT-ALIGN: center; WIDTH: 400px; DISPLAY: block; HEIGHT: 299px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="299" width="400"/></p>
<p>The selected evaluation method, <a href="http://en.wikipedia.org/wiki/Cross-validation" target="_blank">cross-validation</a>, instructs WEKA to divide the training collection into 3 sub-collections (folds), and perform three experiments. Each experiment is done by using two of the folds for training, and the remaining one for testing the learnt classifier. The sub-collections are sampled randomly, the way that each instance belong only to one of them, and the class distribution (50% in our example) is kept inside each fold.</p>
<p>So, if we click on the "Start" button, we will get the output of our experiment, featuring the classifier learnt over the full collection, and the values for the typical accuracy metrics averaged over the three experiments, along with the confusion matrix. The classifier learnt over the full collection is the following one:</p>
<blockquote>
<p>PART decision list
<br/>
------------------</p>
<p>or <= 0 AND
<br/>
to <= 0 AND
<br/>
2 <= 0: ham (119.0/3.0)</p>
<p>£1000 <= 0 AND
<br/>
FREE <= 0 AND
<br/>
call <= 0 AND
<br/>
Reply <= 0 AND
<br/>
i <= 0 AND
<br/>
all <= 0 AND
<br/>
final <= 0 AND
<br/>
50 <= 0 AND
<br/>
mobile <= 0 AND
<br/>
ur <= 0 AND
<br/>
text <= 0: ham (26.0/2.0)</p>
<p>i <= 0 AND
<br/>
all <= 0: spam (30.0/3.0)</p>
<p>: ham (25.0/1.0)</p>
<p>Number of Rules : 4</p>
</blockquote>
<p>This notation can be read as:</p>
<blockquote>
<p><strong>if</strong> (("or" <strong>not in</strong> message) <strong>and</strong> ("to" <strong>not in</strong> message) <strong>and</strong> ("2" <strong>not in</strong> message)) <strong>then</strong> class(message) == ham
<br/>
...
<br/>
<strong>otherwise</strong> class(message) == ham</p>
</blockquote>
<p>And the confusion matrix is the next one:</p>
<blockquote>
<p>=== Confusion Matrix ===</p>
<p>a b <-- classified as
<br/>
17 16 | a = spam
<br/>
12 155 | b = ham</p>
</blockquote>
<p>Which means that the PART learner is able to get 17+155 correct classifications, and it makes 12+16 mistakes. It leads to an accuracy of 86%.</p>
<p style="TEXT-ALIGN: center"><strong><em>But we have done it wrong!</em></strong></p>
<p>Do you remember the "Available" token, which occurs only on one of the messages? In which fold is it? When it is on a training fold, we are using it for training (making the learner trying to generalize from a token that does not occur in the test collection). And when it is on the test collection, the learner should not even know about it! Moreover, what happens with attributes that are highly predictive for the full collection (according to their statistics when computing e.g. the <a href="http://en.wikipedia.org/wiki/Information_gain_in_decision_trees" target="_blank">Information Gain</a> metric)? They may have worse (or better) statistics when a subset of their occurrences is not seen, as they can be on the test collection!</p>
<p>The right way to perform a correct text classification experiment with cross validation in WEKA is feeding the indexing process into the classifier itself, that is, chaining the indexing filter (StringToWordVector) and the learner, the way that we index and train for every sub-set in the cross-validation run. Thus, you have to use the <a href="http://weka.sourceforge.net/doc/weka/classifiers/meta/FilteredClassifier.html" target="_blank">FilteredClassifier</a> class provided by WEKA.</p>
<p>In fact, this is not that difficult. Let us go back to the original test collection, which features two attributes: the message (as a string) and the class. Then you can go to the Classify tab, and choose the FilteredClassifier learner, which is available at the "weka > classifiers > meta", and shown in the next picture:</p>
<p><img src="https://lh6.googleusercontent.com/-5IfFFabokhY/UQey-YCXzSI/AAAAAAAABkM/HtGGQzUMED4/s738/wekaexample06.PNG" style="TEXT-ALIGN: center; WIDTH: 400px; DISPLAY: block; HEIGHT: 298px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="298" width="400"/></p>
<p>Then you must choose the filter and the classifier you are going to apply to the collection, by clicking on the classifier name at the "Classifier" area. I choose StringToWordFilter and PART with their default options:</p>
<p><img src="https://lh5.googleusercontent.com/-QtG8vqffTiA/UQey-5KnaXI/AAAAAAAABkM/4XSXu9Q_GJs/s465/wekaexample07.PNG" style="TEXT-ALIGN: center; WIDTH: 300px; DISPLAY: block; HEIGHT: 178px; MARGIN-LEFT: auto; MARGIN-RIGHT: auto" height="178" width="300"/></p>
<p>If we now run our experiment with 3-fold cross-validation and the filtered classifier we have just configured, we get different results:</p>
<p>=== Confusion Matrix ===</p>
<p>a b <-- classified as
<br/>
13 20 | a = spam
<br/>
7 160 | b = ham</p>
<p>For an accuracy of 86.5%, a bit better than the one obtained with the wrong setup. However, we catch 4 less spam messages, and the True Positive ratio goes down from 0.515 to 0.394. This setup is more realistic and it better mimics what will happen in the real world, in which we will find highly relevant but unseen events, and our statistics may change dramatically over time.</p>
<p>So now we can run our experiment safely, as no unseen events will be used in the classification. Moreover, if we apply any kind of Information Theory based filter like e.g. ranking the attributes according to their Information Gain value, the statistics will be correct, as they will be based on the training set for each cross-validation run.</p>
<p>Thanks for reading, and please feel free to leave a comment if you think I can improve this article!</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com23tag:blogger.com,1999:blog-36589303.post-29959290053824424322013-01-16T19:13:00.001+01:002013-05-02T09:49:11.731+02:00A note on WEKA limitations and big data<p style="TEXT-ALIGN: center"><img src="http://users.dsic.upv.es/~cferri/weka/weka.jpg" style="WIDTH: 283px; DISPLAY: inline; HEIGHT: 156px" height="30" width="28"/></p>
<p>I love <a href="http://en.wikipedia.org/wiki/Weka_(machine_learning)" target="_blank">WEKA</a> since it was first introduced to me by my friend <a href="http://orion.esp.uem.es/gsi/index.php/Enrique-Puertas.html" target="_blank">Enrique Puertas</a> back in 1999, when he used it for programming a Usenet News client with spam filtering capabilities based on Machine Learning (what we usually call a <a href="http://en.wikipedia.org/wiki/Bayesian_spam_filtering" target="_blank">bayesian spam filter</a> now). I got impressed by its flexibility and functionality, and the ease of experimenting with WEKA and using it in my Java programs. I quickly got familiar with it and I used it for making <a href="https://www.aclweb.org/anthology-new/W/W00/W00-0719.pdf">my very first experiments on spam filtering</a>.</p>
<p>Over the years, WEKA has being updated, getting more algorithms and making some tasks easier for text miners. For instance, the <a href="http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html" target="_blank">StringToWordVector filter</a> allows to get a <a href="http://en.wikipedia.org/wiki/Vector_space_model" target="_blank">Vector Space Model</a> (or bag of words) representation of your problem texts, a task that I had to do manually (with my own programs or scripts) at the beginning. Another example: the <a href="http://www.cs.waikato.ac.nz/ml/weka/arff.html">Sparse ARFF</a> format allows to get a compact representation of your word vectors, instead of getting thousands of attribute values per instance, most of them being "0" or "no". Moreover, WEKA has attracted so much attention that other platforms have integrated it (e.g. <a href="http://gate.ac.uk/" target="_blank">GATE</a>) or provided covering environments that augment its functionality (e.g. <a href="http://www.rapidminer.com/" target="_blank">RapidMiner</a>).</p>
<p>However, our needs as researchers have evolved as well. One of the most important issues now is data size. While working with average computers in my early experiments was enough, given the size of standard collections (<a href="http://qwone.com/~jason/20Newsgroups/" target="_blank">20 Newsgroups</a>, <a href="http://www.daviddlewis.com/resources/testcollections/reuters21578/" target="_blank">Reuters-21578</a>, <a href="http://csmining.org/index.php/ling-spam-datasets.html" target="_blank">LingSpam</a>, etc. - all of the order of tens of thousand instances), now that is nearly impossible. Most of my experiments involve from hundreds of thousand to millions of instances. In those cases, WEKA can spend days for a single learn-and-test cycle, or it can simply run out of memory; and not with an average machine, even with a really big server!</p>
<p>So now, what?</p>
<p>Before dealing with this question, I must say that I have been a heavy user of the WEKA <em>command line</em> and the <em><a href="http://www.cse.yorku.ca/course_archive/2008-09/W/4412/ExplorerGuide.pdf" target="_blank">Explorer GUI</a></em> . However, I have never considered or used the WEKA <em><a href="http://www.cse.yorku.ca/course_archive/2006-07/W/4412/doc/weka/ExperimenterTutorial-3.5.5.pdf" target="_blank">Experimenter GUI</a></em> . I know from friends and diagonal readings that the Experimenter allows to distribute experiments over a number of machines. However, if I am going to distribute my experiments, why not using newer technologies (less ad-hoc, WEKA-dependent), just 100% compatible/standard/implemented with-in cloud providers? Why not getting advantage of elastic cloud capabilities (grow and pay as you need)?</p>
<p>Given said this, and keeping up with the latest news and trends in data and text mining, I see two options:</p>
<ul>
<li><strong>Going for <a href="http://www.r-project.org/" target="_blank">R</a></strong> . This language/platform has grown incredibly in the latest years, and it has quickly became a standard language for data mining, present in many curricula, and much often considered an absolute requirement in data science job offers. There are nice books about it as well, like "<a href="http://shop.oreilly.com/product/0636920022008.do" target="_blank">R in a Nutshell</a>", and other strategical books recommend/use it (like "<a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/" target="_blank">The Elements of Statistical Learning</a>"). R supports map reduce algorithms over <a href="http://hadoop.apache.org/" target="_blank">Hadoop</a> for distributed experiments with tons of data. And R interfaces with Java as well.</li>
<li><strong>Choosing <a href="http://mahout.apache.org/" target="_blank">Mahout</a></strong> (plus <strong><a href="http://lucene.apache.org/solr/" target="_blank">Lucene/SOLR</a></strong> ). This platform is Java-based, tightly integrated with Hadoop, and it makes use of Lucene for text representation tasks -- Lucene could be considered a standard for deploying search engines now. There are good books on Mahout and Lucene/SOLR as well ("<a href="http://manning.com/owen/" target="_blank">Mahout in Action</a>", "<a href="http://www.manning.com/hatcher3/" target="_blank">Lucene in Action</a>", "<a href="http://www.packtpub.com/solr-3-1-enterprise-search-server-cookbook/book" target="_blank">Apache SOLR Cookbook</a>").</li>
</ul>
<p>But still I do not feel any option is better than the other one. Both are challenging and appealing, and I have not taken a decision yet. And I am willing to hear your opinion, of course.</p>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com2tag:blogger.com,1999:blog-36589303.post-66001133808034512002013-01-10T19:14:00.000+01:002013-10-08T20:20:36.574+02:00A list of datasets for opinion mining in Twitter<div style="TEXT-ALIGN: left" dir="ltr">
<div style="TEXT-ALIGN: left">In a recent thread at the <a href="http://tech.groups.yahoo.com/group/SentimentAI/" target="_blank">SentimentAI group (list)</a>, a number of links to datasets for training / testing opinion mining / sentiment classifiers over Twitter have been contributed. I list them here for the case somebody considers this information useful:</div>
<div style="TEXT-ALIGN: left">
<ul style="TEXT-ALIGN: left">
<li><a href="http://www.tweenator.com/index.php?page_id=8" target="_blank">Three datasets</a> provided by Hassan Saif, including an annotated subset of the <strong>Stanford Twitter Sentiment Corpus</strong>, and two for the specific topics of the <strong>Health Care Reform</strong> and the <strong>Obama-McCain Debate</strong>.</li>
<li>The <a href="http://help.sentiment140.com/for-students" target="_blank"><strong>Stanford Twitter Corpus</strong></a> itself, provided by Alec Go and others at <a href="http://www.sentiment140.com/" target="_blank">Sentiment140</a>. You can download the <a href="http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip" target="_blank">ST Corpus directly</a> (70Mb).</li>
<li>The <strong><a href="http://www.sananalytics.com/lab/twitter-sentiment/" target="_blank">Sanders Analytics Twitter Sentiment Corpus</a></strong> , provided by Niek Sanders.</li>
<li>The <strong><a href="http://nibir.me/projects/mejaj/datasets.html" target="_blank">mejaj datasets</a></strong> , provided by <a href="http://nibir.me/" target="_blank">Nibir Bora</a> and others.</li>
<li>The <strong><a href="http://www.cs.york.ac.uk/semeval-2013/task2/" target="_blank">SemEval-2013: Sentiment Analysis in Twitter</a></strong> evaluation campaign (or competition) dataset. <em>Note the competition is still active</em>, you can join it! Check the dates at the <a href="http://www.cs.york.ac.uk/semeval-2013/index.php?id=call-for-participation" target="_blank">SemEval-2013 website</a>.</li>
<li>The <a style="FONT-WEIGHT: bold" href="http://www.limosine-project.eu/events/replab2012#Profiling_task" target="_blank">RepLab 2012 Profiling task dataset</a>. The profiling task is a bit different from the standard sentiment classification task. For instance, factual tweets can imply bad reputation ("Lehmann Brothers goes bankrupt") and negative sentiment tweets can imply good reputation ("R.I.P. Michael Jackson. We'll miss you").</li>
<li><strong>UPDATE (8/10/2013)</strong>: Contributed by <a href="http://www.blogger.com/profile/12092678025880000860" target="_blank">Eugenio Martínez Cámara</a> (thanks!), the <a href="http://www.daedalus.es/TASS2013/corpus.php" target="_blank"><strong>Spanish-language dataset</strong></a> used in <a href="http://www.daedalus.es/TASS2013/about.php" target="_blank">the TASS workshop</a> organized at the anual meeting of the <a href="http://www.sepln.org/?lang=en" target="_blank">SEPLN</a>.</li>
</ul>
</div>
You can find the <a href="http://tech.groups.yahoo.com/group/SentimentAI/message/589" target="_blank">SentimentAI thread on Twitter datasets here</a>.</div>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com8tag:blogger.com,1999:blog-36589303.post-89621680054775658152013-01-08T16:55:00.000+01:002013-05-02T10:53:01.328+02:00Spam en LinkedIn al estilo "Robin Sage"<div style="TEXT-ALIGN: left" dir="ltr"><a href="http://es.linkedin.com/in/jmgomezh/">Yo mismo</a>, y algunos de mis contactos en <a href="http://www.linkedin.com/">LinkedIn</a>, han recibido recientemente una solicitud de contacto por parte de una tal "Elena Domínguez" (<a href="http://www.linkedin.com/pub/elena-domínguez/62/196/45">enlace</a>*). Se trata de un perfil un poco extraño, por cuanto está bastante poco detallado (experiencia profesional, formación, etc.), pero pertenece a varios grupos de ingenieros (se auto-califica como ingeniera), pero tiene cientos de contactos sumamente heterogéneos de temas TIC. Ésta es la imagen del perfil:
<br/>
<br/>
<div style="TEXT-ALIGN: center; CLEAR: both" class="separator"><a style="MARGIN-LEFT: 1em; MARGIN-RIGHT: 1em" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHloVMxOfJCM4wMcKBCmvkKsk2CAyelOa3TuzvWF74UtVCsiIvzdI6LB-aNbsVXqj-tc-2B85rehGMZdgDtJpz1yOE6aALtc7EG-kvifqUBcDpnFoU_TULsEweauQbO7OwjXMDoA/s1600/Dibujo.bmp"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgHloVMxOfJCM4wMcKBCmvkKsk2CAyelOa3TuzvWF74UtVCsiIvzdI6LB-aNbsVXqj-tc-2B85rehGMZdgDtJpz1yOE6aALtc7EG-kvifqUBcDpnFoU_TULsEweauQbO7OwjXMDoA/s1600/Dibujo.bmp" height="266" border="0" width="320"/></a></div>
<br/>
Si se acepta a esta "persona", en pocos días (u horas), se recibirá un correo invitando a unirse al grupo de LinkedIn "<strong>International Master's in Theoretical & Practical Application of Finite Element Method</strong>" (<a href="http://www.linkedin.com/groups?home=&gid=3808981&trk=anet_ug_hm&goback=.con.npv_221408693_*1_*1_name_DGj4_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1">enlace</a>*). Aunque el master promocionado mediante este grupo de LinkedIn parece razonablemente legítimo, tanto el perfil como el grupo parecen ser spam.
<br/>
<br/>
Una cosa que llama especialmente la atención es que <strong>su foto de perfil</strong> es bastante rara, como "demasiado aséptica", casi artificial. Una evidencia adicional de spam la obtenemos cuando realizamos una búsqueda por imágenes en Google, usando esta imagen como consulta. Primero obtenemos la URL de la imagen:
<br/>
<br/>
<br/>
<div style="TEXT-ALIGN: center; CLEAR: both" class="separator"><a style="MARGIN-LEFT: 1em; MARGIN-RIGHT: 1em" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi1MeI4h5QaQcvbWEdFT_ooMolPsuJWWezqTR6V2r_STFd6psJsIT8CovFoiEjk33x1-6MBSnDn4MI-P85NCl2R-j-czWDQTbE4cONUnz22sz1XsIm2NPPDp2A3se-OH4OcT3yAFw/s1600/Dibujo2.bmp"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi1MeI4h5QaQcvbWEdFT_ooMolPsuJWWezqTR6V2r_STFd6psJsIT8CovFoiEjk33x1-6MBSnDn4MI-P85NCl2R-j-czWDQTbE4cONUnz22sz1XsIm2NPPDp2A3se-OH4OcT3yAFw/s1600/Dibujo2.bmp" height="225" border="0" width="320"/></a></div>
<div style="TEXT-ALIGN: center; CLEAR: both" class="separator"><br/></div>
<div style="TEXT-ALIGN: left; CLEAR: both" class="separator">A continuación, buscamos la foto en Google Images, pulsando sobre el botón de la cámara e introduciendo la URL que hemos obtenido antes:</div>
<div style="TEXT-ALIGN: left; CLEAR: both" class="separator"><br/></div>
<br/>
<div style="TEXT-ALIGN: center; CLEAR: both" class="separator"><a style="MARGIN-LEFT: 1em; MARGIN-RIGHT: 1em" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBtmBjAO4ErRNvfnsc5F6g2olnw02Qzb2H9xE46TbQ28w9Q4VbOrHAmKU2bhcxLxz44QkZXizqQGH4KkR5QRXk8N6NYVSBwao9OUwRGaEt32-lZQBXICsiFKczrd2yoYA3ulh8RQ/s1600/Dibujo3.bmp"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhBtmBjAO4ErRNvfnsc5F6g2olnw02Qzb2H9xE46TbQ28w9Q4VbOrHAmKU2bhcxLxz44QkZXizqQGH4KkR5QRXk8N6NYVSBwao9OUwRGaEt32-lZQBXICsiFKczrd2yoYA3ulh8RQ/s1600/Dibujo3.bmp" height="141" border="0" width="320"/></a></div>
<div style="TEXT-ALIGN: center; CLEAR: both" class="separator"><br/></div>
<div style="TEXT-ALIGN: left; CLEAR: both" class="separator">Y los resultados son los siguientes:</div>
<div style="TEXT-ALIGN: left; CLEAR: both" class="separator"><br/></div>
<br/>
<div style="TEXT-ALIGN: center; CLEAR: both" class="separator"><a style="MARGIN-LEFT: 1em; MARGIN-RIGHT: 1em" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVGtpZRED38wZiRzIdR9hsQaYutIMa69SKB47-_PamcY43pftZwe-bXJw8iK5wGhpzu1AoBjw-cwIXduOO4oue1WamnGN3HSws1659uqugNARPX_nU8zqrAUjA5J4VuC4r0St-wQ/s1600/Dibujo4.bmp"><img src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgVGtpZRED38wZiRzIdR9hsQaYutIMa69SKB47-_PamcY43pftZwe-bXJw8iK5wGhpzu1AoBjw-cwIXduOO4oue1WamnGN3HSws1659uqugNARPX_nU8zqrAUjA5J4VuC4r0St-wQ/s1600/Dibujo4.bmp" height="320" border="0" width="297"/></a></div>
<div style="TEXT-ALIGN: center; CLEAR: both" class="separator"><br/></div>
A partir de estos resultados, se puede deducir con bastante certeza que la foto es de "stock", es decir, de catálogo, y que aparece en varios catálogos como imagen de archivo de mujer de negocios con expresión neutra, realizada en estudio. Usar una foto como esta para nuestro perfil en una red como LinkedIn es posible, pero bastante poco probable.
<br/>
<br/>
Por tanto, considero esta fotografía como una evidencia fuerte que, unida al comportamiento del "usuario" (enviando el correo de invitación a un grupo tan focalizado en un producto educativo) como al número tan alto de contactos para un perfil tan poco detallado), me lleva a pensar que se trata de un perfil de spam, pero real en el sentido de que no es un experimento de ingeniería social como el realizado por <a href="http://www.thomasryan.net/"><strong>Thomas Ryan</strong></a> con el perfil " <a href="http://www.networkworld.com/news/2010/070810-the-robin-sage-experiment-fake.html"><strong>Robin Sage</strong></a> ".
<br/>
<br/>
Como conclusión, pienso que hasta LinkedIn, que es una de las redes menos explotadas para el spam, se irá viendo invadida crecientemente por este fenómeno, cada vez con mayor nivel de personalización y sofisticación.
<br/>
<br/>
(*) No asocio el enlace al nombre del perfil o del grupo para no generar spam web.</div>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com0tag:blogger.com,1999:blog-36589303.post-30837790760990870032012-12-04T15:16:00.004+01:002012-12-04T15:19:33.082+01:00Report on ERA Course: Fighting Child Pornography on the Internet<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://1.bp.blogspot.com/-jZD6Qn-DhZQ/UL3zdc73yeI/AAAAAAAABbM/W0ZHb3awEXw/s1600/424898_4622121118371_313664852_n.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://1.bp.blogspot.com/-jZD6Qn-DhZQ/UL3zdc73yeI/AAAAAAAABbM/W0ZHb3awEXw/s1600/424898_4622121118371_313664852_n.jpg" height="320" width="320" /></a></div>
<div style="text-align: center;">
<br /></div>
I have had the pleasure of attending as a student to the <a href="http://www.era.int/">European Academy of Law</a> course on "<a href="https://www.era.int/cgi-bin/cms?_SID=6520b7451e95482bd8da749563e3306207b9af0900219030656915&_sprache=en&_bereich=artikel&_aktion=detail&idartikel=123272" target="_blank">Fighting Child Pornography on the Internet</a>", at Madrid 29-30 November 2012. I was supported by the Spanish child protection NGO <a href="http://www.protegeles.com/" target="_blank">Protégeles</a>, as I work with then whenever I can in order to push their mission.<br />
<br />
It was a nice course, with a good coverage of topics, including legal aspects, and technical issues both from the view of prosecuting sex offenders and from Web filtering. Speakers were excellent and provided a lok of useful hints and links. I also crafted a backlog hashtag for the event in Twitter <a href="https://twitter.com/search?q=#ERAChildPornCourse&src=hash" target="_blank">(#ERAChildPornCourse</a>), but I am afraid that neither attendents nor speakers are very happy with Twitter (with scarce exceptions). I collected some comments during the event, organized in terms of the topic:<br />
<br />
<strong>Legal issues</strong><br />
<ul>
<li>Media types that do not involve real children are child porn? </li>
<li>Internet and digital cameras have led to an explosion of child porn, now a home industry </li>
<li>There is a thousand years history on child porn (e.g. paintings) but cameras imply children are really abused to get it recorded </li>
<li>What does mean child porn possesion? What about cloud drives? And streaming? </li>
<li>Internet is world-wide, so who has the jurisdiction? Should anybody have it? </li>
<li>Eurojust helps coordination of child porn prosecution, examples of operations: "lost boy", "nanny", "dreamboard" </li>
<li>Lanzarote Convention says accesing a child porn site, if knowing it hosts that stuff, is illegal </li>
<li>Providing lists of links of web sites hosting child porn is illegal under Lanzarote Convention</li>
</ul>
<strong>Protection, prosecution, technical issues</strong><br />
<ul>
<li>For preparing cases against child porn, prosecutors check nature of material, offender involvement and number of images </li>
<li>The 10% of photographs ever taken, were taken during the latest year Note: all kind of pics </li>
<li>Groomers and child sex offenders play "the jailbait game" in vidro chat sites </li>
<li>Youngsters are extemely vulnerable to grooming: they nearly accept all frienship requests, have 3-4k+ contacts </li>
<li>Haebephilia is the sexual preference for individuals in early years of puberty (generally 11-14) </li>
<li>LEAs make use of a plethora of image analysis tools to process suspect pics; Microsoft PhotoDNA just one in the box </li>
<li>About 20% of child porn stuff is delivered through commercial platforms </li>
<li>Project HAVEN aims at stop child abuse by EU citizens in foreign countries (Asia, South America...) </li>
<li>Law Enforcement Agencies cooperate and share a Child Abuse International database </li>
<li>Law Enforcement Agencies (e.g. Europol) are getting more and more focused on victim identification </li>
<li>INHOPE has not authority to release block lists of child porn sites </li>
</ul>
An aditional fact is that after hearing Interpol and Europol, one gets proud of having such great professionals working against child porn.<br />
<br />
All in all, it has been a great course and I am very happy of being able to attend to it.</div>
Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com0tag:blogger.com,1999:blog-36589303.post-70500472085493868742012-05-17T12:31:00.000+02:002012-05-17T12:31:00.112+02:00Artículo en Novática: comprometiendo la seguridad de reCAPTCHA<p>En el número 215 de <a href="http://www.ati.es/novatica/" target="_blank">Novática</a> hemos publicado un artículo que versa sobre la utilización de diversas técnicas de normalización de imagen y el OCR Tesseract de Google para realizar ataques de reconocimiento de texto sobre dos versiones de reCAPTCHA. La referencia del artículo es:</p>
<blockquote>
<p>Noemí Carranza, Ricardo Palma Durán, Gonzalo Álvarez Marañón, <em>José María Gómez Hidalgo</em>, 2012. <strong><a href="http://www.ati.es/novatica/2012/215/nv215sum.html#art43" target="_blank">Análisis de la seguridad del sistema reCAPTCHA</a></strong>. <a href="http://www.ati.es/novatica/" target="_blank">Revista Novática</a> 215, enero-febrero 2012, pág. 43-48.</p>
</blockquote>
<p>El resumen del artículo es el siguiente:</p>
<blockquote>
<p>En los últimos tiempos se han popularizado extraordinariamente los sistemas CAPTCHA, que protegen servicios Web planteando al usuario una prueba destinada a verificar que se trata de un ser humano y no de un robot, o sistema automático para el envío de correo basura o difusión de malware. Estos sistemas están siempre expuestos a que spammers y hackers sean capaces de comprometer su seguridad, y abusar de los recursos subyacentes (cuentas de correo, blogs, etc.) para realizar sus actividades ilícitas. Por ello, es necesario comprobar periódicamente su seguridad usando herramientas como sistemas de reconocimiento óptico de caracteres (OCR), sistemas de análisis de imagen, y otras. En este artículo realizamos un análisis de la seguridad del sistema reCAPTCHA, que probablemente es el más usado en Internet actualmente. Para ello, utilizamos diversas técnicas de análisis de imagen orientadas
<br/>
a corregir las deformaciones y distorsiones realizadas por el sistema en las imágenes que muestra al usuario, así como el eficaz sistema de OCR Tesseract. Se han analizado dos versiones del sistema reCAPTCHA y se ha comprobado que la seguridad del sistema probablemente ha aumentado en la segunda versión, más reciente, aunque es posible comprometer la seguridad del sistema si se cuenta con recursos suficientes en forma de una botnet de tamaño medio (unos 10.000 ordenadores).</p>
</blockquote>Jose Maria Gomez Hidalgohttp://www.blogger.com/profile/17053588779560658723noreply@blogger.com2