Nihil Obstat: mayo 2013

El día 19 de abril dí una charla en la Universidad Europea de Madrid, titulada "Menores y móviles: Usos, riesgos y controles parentales". Esta charla se corresponde con un trabajo de investigación que he realizado dentro del proyecto titulado "Protección de usuarios menores de edad de telefonía móvil inteligente", dirigido por Joaquin Pérez y financiado por la Universidad Europea de Madrid (P2012 UEM14).

El resumen de la charla está disponible en la página de la red MAVIR (MA2VICMR: Mejorando el Acceso, el Análisis y la Visibilidad de la Información y los Contenidos Multilingüe y Multimedia en Red para la Comunidad de Madrid), y la presentación utilizada durante la charla es la siguiente:

</p> <p style="TEXT-ALIGN: left">Si el tema te interesa, no dudes en hacer culaquier pregunta o sugerencia en los comentarios de este post.</p> <div style='clear: both;'></div> </div> <div class='post-footer'> <div class='post-footer-line post-footer-line-1'><span class='post-author vcard'> Publicado por <span class='fn'> <a href='https://www.blogger.com/profile/17053588779560658723' rel='author' title='author profile'> Jose Maria Gomez Hidalgo </a> </span> </span> <span class='post-timestamp'> en <a class='timestamp-link' href='http://jmgomezhidalgo.blogspot.com/2013/05/presentacion-y-moviles-usos-riesgos-y.html' rel='bookmark' title='permanent link'><abbr class='published' title='2013-05-22T18:22:00+02:00'>6:22 p. m.</abbr></a> </span> <span class='post-icons'> <span class='item-action'> <a href='https://www.blogger.com/email-post/36589303/3859956236970326490' title='Enviar entrada por correo electrónico'> <img alt='' class='icon-action' height='13' src='http://img1.blogblog.com/img/icon18_email.gif' width='18'/> </a> </span> <span class='item-control blog-admin pid-186762725'> <a href='https://www.blogger.com/post-edit.g?blogID=36589303&postID=3859956236970326490&from=pencil' title='Editar entrada'> <img alt='' class='icon-action' height='18' src='https://resources.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/> </a> </span> </span> <span class='post-backlinks post-comment-link'> </span> <div class='post-share-buttons goog-inline-block'> <a class='goog-inline-block share-button sb-email' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=3859956236970326490&target=email' target='_blank' title='Enviar por correo electrónico'><span class='share-button-link-text'>Enviar por correo electrónico</span></a><a class='goog-inline-block share-button sb-blog' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=3859956236970326490&target=blog' onclick='window.open(this.href, "_blank", "height=270,width=475"); return false;' target='_blank' title='Escribe un blog'><span class='share-button-link-text'>Escribe un blog</span></a><a class='goog-inline-block share-button sb-twitter' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=3859956236970326490&target=twitter' target='_blank' title='Compartir en X'><span class='share-button-link-text'>Compartir en X</span></a><a class='goog-inline-block share-button sb-facebook' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=3859956236970326490&target=facebook' onclick='window.open(this.href, "_blank", "height=430,width=640"); return false;' target='_blank' title='Compartir con Facebook'><span class='share-button-link-text'>Compartir con Facebook</span></a><a class='goog-inline-block share-button sb-pinterest' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=3859956236970326490&target=pinterest' target='_blank' title='Compartir en Pinterest'><span class='share-button-link-text'>Compartir en Pinterest</span></a> </div> </div> <div class='post-footer-line post-footer-line-2'><span class='post-labels'> Etiquetas: <a href='http://jmgomezhidalgo.blogspot.com/search/label/Control%20parental' rel='tag'>Control parental</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/Privacidad' rel='tag'>Privacidad</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/Protecci%C3%B3n%20del%20menor' rel='tag'>Protección del menor</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/Seguridad' rel='tag'>Seguridad</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/Smartphone' rel='tag'>Smartphone</a> </span> <span class='post-comment-link'> <a class='comment-link' href='https://www.blogger.com/comment/fullpage/post/36589303/3859956236970326490' onclick='javascript:window.open(this.href, "bloggerPopup", "toolbar=0,location=0,statusbar=1,menubar=0,scrollbars=yes,width=640,height=500"); return false;'>0 comentarios</a> </span> </div> <div class='post-footer-line post-footer-line-3'><span class='reaction-buttons'> </span> </div> </div> </div> </div> </div></div> <div class="date-outer"> <h2 class='date-header'><span>20.5.13</span></h2> <div class="date-posts"> <div class='post-outer'> <div class='post hentry'> <a name='4590226879146293202'></a> <h3 class='post-title entry-title'> <a href='http://jmgomezhidalgo.blogspot.com/2013/05/language-identification-as-text.html'>Language Identification as Text Classification with WEKA</a> </h3> <div class='post-header'> <div class='post-header-line-1'></div> </div> <div class='post-body entry-content' id='post-body-4590226879146293202'> <p><a href="http://en.wikipedia.org/wiki/Language_identification" target="_blank">Language Identification</a>, consisting on guessing the natural language in which a text is written (or an utterance is spoken), is not one of the hardest problems in <a href="http://en.wikipedia.org/wiki/Natural_language_processing">Natural Language Processing</a>, and in consequence, I believe <em>it is a good starting point for learning about the text analysis capabilities available in WEKA</em>.</p> <p>This is in fact one problem taken by others like in this <a href="http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html" target="_blank">tutorial on using LingPipe for Language Identification</a>, or by <a href="http://blog.alejandronolla.com/" target="_blank">Alejandro Nolla</a> at his post on <a href="http://blog.alejandronolla.com/2013/05/15/detecting-text-language-with-python-and-nltk/" target="_blank">Detecting Text Language With Python and NLTK</a>. Moreover you can find a wide number of language identification programs, APIs and demos in the <a href="http://en.wikipedia.org/wiki/Language_identification" target="_blank">Wikipedia article on Language Identification</a>. We may even consider this function as a natural language commodity, as you can see how <a href="http://translate.google.com/" target="_blank">Google Translate</a> does it on default in the next figure:</p> <p style="TEXT-ALIGN: center"><img height="159" src="http://www.esp.uem.es/jmgomez/blogimg/google.translate.langid.png" style="WIDTH: 400px; DISPLAY: inline; HEIGHT: 159px" width="400"/></p> <p>The most typical (and rather simple) approach to Language Identification is storing a list of the <em>most frequent character 3-grams</em> in each language and checking the target overlap with each of the lists. Alternatively, you can use stop words lists. Of course, the accuracy depends on how you compute the overlap, but even simple distances can make it rather effective.</p> <p>However, I will not follow this approach here. Instead, I will show how to build an standard text classifier using <a href="http://weka.sourceforge.net/" target="_blank">WEKA</a> in order to show the options (and how to apply) the <code><a href="http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html" target="_blank">StringToWordVector</a></code> filter, which is <em>the main tool for text analysis in WEKA</em>.</p> <p>The steps we have to follow are the next ones:</p> <ol> <li>To collect data from different languages in order to build a basic dataset.</li> <li>To prepare the data for learning, which involves transforming it by using the <code>StringToWordVector</code> filter.</li> <li>To analyze the resulting dataset, and hopefully, to improve it by using attribute selection.</li> <li>To test over an independent test collection, which will give us a robust estimation of the accuracy of the approaches on real examples.</li> <li>To learn the most accurate model as obtained from the previous step, and to use it for our classification program.</li> </ol> <p>So this will be a rather long post. Be prepared for it.</p> <p><strong>Collecting the data and Creating the Datasets</strong></p> <p>Following the <a href="http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html" target="_blank">LingPipe Language ID Tutorial</a>, I collect the data from the <a href="http://corpora.uni-leipzig.de/" target="_blank">Leipzig Corpora Home Page</a>. In particular, I will address guessing among English (EN), French (FR) and Spanish (SP), so I have gone to <a href="http://corpora.uni-leipzig.de/download.html" target="_blank">the download page</a>, completed the CAPTCHA to get the list of available corpora, and downloaded:</p> <ul> <li>The <a href="http://corpora.uni-leipzig.de/downloads/eng_news_2005_10K-text.tar.gz" target="_blank">2005 English 10k corpus of news in text format</a>.</li> <li>The <a href="http://corpora.uni-leipzig.de/downloads/fra_news_2009_10K-text.tar.gz" target="_blank">2009 French 10k corpus of news in text format</a>.</li> <li>The 2001-2002 Spanish 10k corpus of news in text format -- which is no longer there as far as I can see.</li> </ul> <p>For your comfort, I have put these corpora <a href="https://github.com/jmgomezh/tmweka/tree/master/LangID" target="_blank">in my LangID GITHub demo page</a>. The files have the following format:</p> <blockquote> <p><code>1 I didn't know it was police housing," officers quoted Tsuchida as saying. <br/> 2 You would be a great client for Southern Indiana Homeownership's credit counseling but you are saying to yourself "Oh, we can pay that off." <br/> 3 He believes the 21st century will be the "century of biology" just as the 20th century was the century of IT.</code></p> </blockquote> <p>So I have loaded them into an OpenOffice spreadsheet, and replaced the number columns by the corresponding tags for the different languages: <code>EN</code>, <code>FR</code>, and <code>SP</code>. Then I have escaped the <code>"</code> and <code>'</code> characters, because they are string delimiters in WEKA <a href="http://www.cs.waikato.ac.nz/ml/weka/arff.html" target="_blank">Attribute-Relation File Format</a> (ARFF). In order to build the datasets, I have split the data keeping the first 9K sentences of each language for training, and the remaining 1K for testing. As some learning algorithms may be sensitive to the instance order, I have mixed the instances in batches of 1K texts, so the first 1K sentences are in English, the next 1K sentences are in French, and so on. The training data has the following header:</p> <blockquote> <p><code>@relation langid_train <br/> <br/> @attribute language_class {EN,FR,SP} <br/> @attribute text String <br/> <br/> @data <br/> EN,'I didn\'t know it was police housing,\" officers quoted Tsuchida as saying.' <br/> EN,'You would be a great client for Southern Indiana Homeownership\'s credit counseling but you are saying to yourself \"Oh, we can pay that off.\"' <br/> EN,'He believes the 21st century will be the \"century of biology\" just as the 20th century was the century of IT.' <br/> ../..</code></p> </blockquote> <p>The ARFF files for training and testing are available at the <a href="https://github.com/jmgomezh/tmweka/tree/master/LangID" target="_blank">GITHub repository for the demo</a> as well. You can open the training file (<code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/langid.collection.train.arff" target="_blank">langid.collection.train.arff</a></code>) in the WEKA Explorer, and setting the class to be the first attribute, you should be getting something like the following figure:</p> <p style="TEXT-ALIGN: center"><img height="336" src="http://www.esp.uem.es/jmgomez/blogimg/explorer.training.langid.png" style="WIDTH: 450px; DISPLAY: inline; HEIGHT: 336px" width="450"/></p> <p>So we have a training collection with 9K instances per class (language), and a test collection with 1K instances per class.</p> <p><strong>Data Transformation</strong></p> <p>As <a href="http://jmgomezhidalgo.blogspot.com/search/label/WEKA" target="_blank">in previous posts about text classification with WEKA</a>, we need to transform the text strings into term vector to enable learning. This is done by applying the <code>StringToWordVector</code> filter, that is the most remarkable text mining function in WEKA. In previous posts, I have applied this filter with default options, but it offers a wide range of possibilities that can be seen when opening it in the WEKA Explorer. If you click on the <em>Filter</em> button and browse the tree to "<em>weka > filters > unsupervised > attribute > StringToWordVector</em>", and then click on the filter name, you get the next window:</p> <p style="TEXT-ALIGN: center"><img height="623" src="http://www.esp.uem.es/jmgomez/blogimg/explorer.stringtowordvector.png" style="WIDTH: 440px; DISPLAY: inline; HEIGHT: 623px" width="440"/></p> <p>Those are a lot of options, aren't them? So let us focus on the minimum set of options in order to be productive with this example of Language Identification. Those are:</p> <ul> <li><code>doNoOperateOnPerClassBasis</code> - we set this option to <code>True</code> in order to make the filter collect word tokens over the classes as a whole. This should be the standard setting in nearly all text classification problems.</li> <li><code>lowerCaseTokens</code> - we set this option to <code>True</code> because we are interested on the words independently of using upper or lower case. In other problems, like e.g. when processing Social Networks text, keeping the capitalization may be critical for getting a good accuracy.</li> <li><code>tokenizer</code> - WEKA provides several tokenizers, intended to break the original texts into tokes according to a number of rules. The most simple tokenizer is the <code><a href="http://weka.sourceforge.net/doc.dev/weka/core/tokenizers/WordTokenizer.html" target="_blank">weka.core.tokenizers.WordTokenizer</a></code>, which splits the string into tokens by using a list of separators that can be set by clicking on the tokenizer name. It is a nice idea to give a look at the texts we have before setting up the list of separating characters. In our case, we have several languages and the default punctuation symbols may not fit our problem -- we need to add opening question and exclamation marks, apart from other symbols from HTML format like &, and other symbols. So our delimiters string will be " \r\n\t.,;:\"\'()?!-¿¡+*&#$%\\/=<>[]_`@" (backslash is escaped).</li> <li>wordsToKeep - we set this option to keep as much words as we can, to include the full vocabulary of the dataset. An appropriate value may be one million.</li> </ul> <p>So we leave the rest of options on default. Most notably, we are not using <a href="http://en.wikipedia.org/wiki/Tf–idf" target="_blank">sophisticated weighting schemas (like TF or TF.IDF)</a>, nor <a href="http://en.wikipedia.org/wiki/Stop_words" target="_blank">stop words</a> or <a href="http://en.wikipedia.org/wiki/Stemming" target="_blank">stemming</a>. These options are very frequent in <a href="http://en.wikipedia.org/wiki/Information_retrieval" target="_blank">Information Retrieval</a> systems like <a href="http://lucene.apache.org/solr/" target="_blank">Apache Lucene/SOLR</a>, and they often lead to nice accuracy improvements in search systems.</p> <p>We need to have the same vocabulary both in the training and the testing datasets, so we can apply this filter in the command line by using the batch (<code>-b</code>) option:</p> <blockquote> <p><code>$> java weka.filters.unsupervised.attribute.StringToWordVector -O -L -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\"\\'()?!-¿¡+*&#$%\\\\/=<>[]_`@\"" -W 10000000 -b -i langid.collection.train.arff -o langid.collection.train.vector.arff -r langid.collection.test.arff -s langid.collection.test.vector.arff</code></p> </blockquote> <p>The options -O, -L, -tokenizer and -W correspond to the options above. The delimiter string is escaped because it is included in the specification of the tokenizer. The resulting files are also <a href="https://github.com/jmgomezh/tmweka/tree/master/LangID" target="_blank">in the GITHub repository for the LangID example</a>, along with the script <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/stwv.sh" target="_blank">stwv.sh</a></code> (String To Word Vector) which includes this command.</p> <p><strong>Data Analysis and Improvement</strong></p> <p>If we take a quick look to the terms or tokens we have got, e.g.:</p> <blockquote> <p><code>@attribute archival numeric <br/> @attribute archivarlos numeric <br/> @attribute archivas numeric <br/> @attribute archives numeric <br/> @attribute archiving numeric <br/> @attribute archivo numeric <br/> @attribute archivos numeric</code></p> </blockquote> <p>We can imagine that most of them will be useless for Language Identification. This motivates making a more precise analysis of the tokens by using some kind of quality metric, like <a href="http://en.wikipedia.org/wiki/Information_gain_in_decision_trees" target="_blank">Information Gain</a>. In fact, I am applying the <code><a href="http://weka.sourceforge.net/doc.dev/weka/filters/supervised/attribute/AttributeSelection.html" target="_blank">weka.filters.supervised.attribute.AttributeSelection</a></code> filter as I did in my posts on <a href="http://jmgomezhidalgo.blogspot.com.es/2013/02/text-mining-in-weka-revisited-selecting.html" target="_blank">selecting attributes by chaining filters</a> and on <a href="http://jmgomezhidalgo.blogspot.com.es/2013/04/command-line-functions-for-text-mining.html" target="_blank">command line functions for text mining</a>. So I issue the following command:</p> <blockquote> <p><code>$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -b -i langid.collection.train.vector.arff -o langid.collection.train.vector.ig0.arff -r langid.collection.test.vector.arff -s langid.collection.test.vector.ig0.arff</code></p> </blockquote> <p>We apply the filter in batch mode as well, in order to get the same attributes both in the training and in the test collections. We also set up the first attribute as the class (with the option <code>-c</code>), and set the threshold for keeping attributes as <code>0.0</code> in the <code><a href="http://weka.sourceforge.net/doc.dev/weka/attributeSelection/Ranker.html" target="_blank">weka.attributeSelection.Ranker</a></code> search method. This means that we will keep only those attributes with Information Gain score over 0, and they will be sorted according to their score as well. This command is included in the <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/asig.sh" target="_blank">asig.sh</a></code> (Attribute Selection by Information Gain) script of <a href="https://github.com/jmgomezh/tmweka/tree/master/LangID" target="_blank">the GITHub repository for the LangID example</a>, along with the data files.</p> <p>From the original 65,429 word attributes we got in the previous step, we have kept only 16,840 (a 25.73% of the original ones). We can be more aggressive by setting the threshold to a bigger value (e.g. 0.2).</p> <p>The first twenty attributes are the next ones:</p> <p style="TEXT-ALIGN: center"><img height="163" src="http://www.esp.uem.es/jmgomez/blogimg/forty.top.ig.terms.langid.png" style="WIDTH: 300px; DISPLAY: inline; HEIGHT: 163px" width="300"/></p> <p>As we can see, all of them are very frequent words (in each language) that would be present in the stop lists for them. In consequence, our "pure" data mining approach is quite close to the traditional one based on stop words.</p> <p>It makes sense to learn a J48 tree to get an idea of the complexity of the term relations. The <code><a href="http://weka.sourceforge.net/doc/weka/classifiers/trees/J48.html" target="_blank">weka.classifiers.trees.J48</a></code> algorithm implements the <a href="http://en.wikipedia.org/wiki/C4.5_algorithm" target="_blank">Quinlan's popular C4.5 learner</a>, and as it outputs a decision tree, it can give us valuable insights of the term relations, like e.g. which co-occurring terms are more predictive. If we train that classifier on our new training dataset with the following command:</p> <blockquote> <p><code>$> java weka.classifiers.trees.J48 -t langid.collection.train.vector.ig0.arff -no-cv</code></p> </blockquote> <p>However, we get a quite complex decision tree populated with 273 nodes and 137 leaves. All the tests in the tree have the following look: "<code>word > 0</code>" or "<code>word <= 0</code>". This means that the algorithm induces that only the occurrence of words is important, but not its weight. The root of the tree is obviously a test on "<code>the</code>", and the smallest side of the tree (its right hand side, with "<code>the > 0</code>") is the following one:</p> <blockquote> <p><code>the > 0 <br/> | de <= 0: EN (5945.0/8.0) <br/> | de > 0 <br/> | | el <= 0 <br/> | | | and <= 0 <br/> | | | | for <= 0 <br/> | | | | | to <= 0: FR (24.0/3.0) <br/> | | | | | to > 0: EN (2.0) <br/> | | | | for > 0: EN (3.0) <br/> | | | and > 0: EN (7.0) <br/> | | el > 0: SP (3.0)</code></p> </blockquote> <p>This means, for instance, that the word "<code>the</code>" is an excellent predictive feature, and if it occurs in a text and the word "<code>de</code>" (from French or Spanish) does not occur in the text, that text is most likely written in English (with an estimated likelihood of 99.86% on the training collection). The overall accuracy of J48 over the training collection is 98.3963%.</p> <p><strong>Training and then Evaluating on the Test Collection</strong></p> <p>Before start training and evaluating, we have to decide which algorithms are most appropriate for the problem. In my experience with text learning, it is wise to test at least the following ones:</p> <ul> <li>The <em>Naive Bayes</em> probabilistic approach, quick and with good results in text learning on average problems. In WEKA, It is incarnated in the <code><a href="http://weka.sourceforge.net/doc/weka/classifiers/bayes/NaiveBayes.html" target="_blank">weka.classifiers.bayes.NaiveBayes</a></code> class.</li> <li>The <em>rule learner PART</em>, which induces a list of rules by learning partial decision trees. It is a symbolic algorithm that produces rules which can be very valuable as they are easy to understand. This algorithm is implemented by the <code><a href="http://weka.sourceforge.net/doc/weka/classifiers/rules/PART.html" target="_blank">weka.classifiers.rules.PART</a></code> class.</li> <li>Of course, the J48 algorithm because of its visualization capabilities.</li> <li>The lazy learner <em>k-Nearest Neighbors (kNN)</em>, which occasionally gives excellent results in text classification problems. The WEKA class that implements this algorithm is <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/lazy/IBk.html" target="_blank">weka.classifiers.lazy.IBk</a></code>.</li> <li>The <em>Support Vector Machines</em> algorithm, which it is probably the most effective on text classification problems because of its ability to focus on the most relevant examples in order to separate the classes. It is a very good learning algorithm for sparse datasets, and it is implemented in WEKA via the <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/functions/SMO.html" target="_blank">weka.classifiers.functions.SMO</a></code> class or by the library <a href="http://weka.wikispaces.com/LibSVM" target="_blank">LibSVM</a>. I choose the Sequential Minimum Optimization implementation (SMO) embedded in WEKA.</li> </ul> <p>Also, when Naive Bayes or J48 are effective, I usually get from small to even big accuracy improvements by using <a href="http://en.wikipedia.org/wiki/Boosting_(machine_learning)" target="_blank">boosting</a>, implemented by the <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/meta/AdaBoostM1.html" target="_blank">weka.classifiers.meta.AdaBoostM1</a></code> class in WEKA. Boosting takes as input a weak classifier, and build a classifier committee by iteratively training that weak learner on those dataset subsets on which the previous learners are not effective. In this case, I will not apply boosting because the weak learners get rather high levels of accuracy, and it is most likely that boosting will only achieve a marginal improvement (if any) at the cost of a much bigger training time.</p> <p>I have written an script named <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/test.sh" target="_blank">test.sh</a></code> to execute all these algorithms with default options at the <a href="https://github.com/jmgomezh/tmweka/tree/master/LangID" target="_blank">GITHub repository for the LangID demo</a>. The results obtained by the algorithms are included in the repository as well, and summarized in the next table:</p> <p style="TEXT-ALIGN: center"><img height="136" src="http://www.esp.uem.es/jmgomez/blogimg/results.test.langid.png" style="WIDTH: 230px; DISPLAY: inline; HEIGHT: 136px" width="230"/></p> <p>The different versions of the lazy algorithm kNN tested here appear to be very weak. It is likely we can improve its performance by changing the way the distance among examples is computed (from the Euclidean distance to a more appropriate one for text, that would be the cosine similarity), but their performance is so low that they will not score better than the rest of the algorithms.</p> <p>The top algorithms in this test are <em>Naive Bayes</em> and <em>Support Vector Machines</em>. There is a trade off between both algorithms: SVMs are more effective (in fact, they are very effective) but they employ quite a lot of time to be trained, while Naive Bayes is less effective but quicker to be trained. In terms of classification time, both algorithms are linear on the number of attributes.</p> <p>Even we have used a big number of attributes, there are some examples with rather weak representations. For instance, let us check the following instances or texts:</p> <blockquote> <p><code>{58 1,94 1,313 1,1663 1} <br/> {119 1,361 1,2644 1,16840 FR} <br/> {2 1,16840 SP}</code></p> </blockquote> <p>The first and second examples have only 3 occurring words (the class value for the first text is <code>EN</code> in the sparse format it is used by WEKA in this example), and the third example has only one word ("<code>el</code>"). The two first examples attribute numbers (58 or over) mean that the attributes are not the most informative ones, while in the third example we find a very informative word. If we apply a more aggressive selection using Information Gain, we will be missing a lot of examples (with null representations) in this example, thus making them fall to the most likely class. As the classes have a balanced distribution, the language chosen in that case will be <code>EN</code>, which is the default value for the class attribute.</p> <p><strong>Learning the Best Classifier and Using it Programmatically</strong></p> <p>So after our experiments, we know the best classifier in our tests is SVMs. So it is time to learn it and store the classifier into a file for further programmatic use. For this purpose, I have written an script that trains the classifier and stores the model into a file, using the following command-line call:</p> <blockquote> <p><code>$> java weka.classifiers.meta.FilteredClassifier -t langid.collection.train.arff -c first -no-cv -d smo.model.dat -v -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -O -L -tokenizer <a>\\\"weka.core.tokenizers.WordTokenizer</a> -delimiters <a>\\\\\\\</a>" <a>\\\\\\\r\\\\\\\n\\\\\\\t.,;:\\\\\\\\\\\\\\\"'()?!-¿¡+*&#$%/=<>[]_`@\\\\\\\"\\\</a>" -W 10000000\" -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S <a>\\\"weka.attributeSelection.Ranker</a> -T 0.0\\\"\"" -W weka.classifiers.functions.SMO</code></p> </blockquote> <p>This call is rather painful because of the nested, and nested, and nested, and nested quotes. So I have pretty-printed it in the script <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/learn.sh" target="_blank">learn.sh</a></code> script at the <a href="https://github.com/jmgomezh/tmweka/tree/master/LangID" target="_blank">GitHub repository for the LangID example</a>. For dealing with nested quotes, follow the advice in <a href="http://en.wikipedia.org/wiki/Nested_quotation" target="_blank">the Wikipedia article about nested quotation</a>.</p> <p>With this call, we have stored a model in the file <code>smo.model.dat</code>, which chains the <code>StringToWordVector</code> filter, the <code>AttributeSelection</code> filter, and an <code>SMO</code> classifier by using the <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/meta/FilteredClassifier.html" target="_blank">weka.classifiers.meta.FilteredClassifier</a></code> and the <code><a href="http://weka.sourceforge.net/doc.dev/weka/filters/MultiFilter.html" target="_blank">weka.filters.MultiFilter</a></code> classes, as I have explained in the post on <a href="http://jmgomezhidalgo.blogspot.com.es/2013/04/command-line-functions-for-text-mining.html" target="_blank">Command Line Functions for Text Mining in WEKA</a>.</p> <p>One good point of WEKA is that we can learn a model in the command line and use it in a program. I have modified the <code><a href="https://github.com/jmgomezh/tmweka/blob/master/FilteredClassifier/MyFilteredClassifier.java" target="_blank">MyFilteredClassifier.java</a></code> program I used in my post describing <a href="http://jmgomezhidalgo.blogspot.com.es/2013/04/a-simple-text-classifier-in-java-with.html" target="_blank">A Simple Text Classifier in Java with WEKA</a>, and I have committed it at the <a href="https://github.com/jmgomezh/tmweka/tree/master/LangID" target="_blank">GITHub repository</a> with the name <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/LanguageIdentifier.java" target="_blank">LanguageIdentifier.java</a></code>. I have created three sample test files as well, <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/test_en.txt" target="_blank">test_en.txt</a></code>, <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/test_fr.txt" target="_blank">test_fr.txt</a></code> and <code><a href="https://github.com/jmgomezh/tmweka/blob/master/LangID/test_sp.txt" target="_blank">test_sp.txt</a></code>. The operation of the program is the following one:</p> <blockquote> <p><code>$> javac LanguageIdentifier.java <br/> <br/> $> java LanguageIdentifier <br/> Usage: java LanguageIdentifier <fileData> <fileModel> <br/> $> java LanguageIdentifier test_en.txt smo.model.dat <br/> ===== Loaded text data: test_en.txt ===== <br/> This is a sample test for the language identifier demo. <br/> ===== Loaded model: smo.model.dat ===== <br/> ===== Instance created with reference dataset ===== <br/> @relation 'Test relation' <br/> @attribute language_class {EN,FR,SP} <br/> @attribute text string <br/> @data <br/> ?,' This is a sample test for the language identifier demo.' <br/> ===== Classified instance ===== <br/> Class predicted: EN <br/> <br/> $> java LanguageIdentifier test_fr.txt smo.model.dat <br/> ===== Loaded text data: test_fr.txt ===== <br/> Ceci est un test de l'échantillon pour la démonstration de l'identificateur de langue. <br/> ===== Loaded model: smo.model.dat ===== <br/> ===== Instance created with reference dataset ===== <br/> @relation 'Test relation' <br/> @attribute language_class {EN,FR,SP} <br/> @attribute text string <br/> @data <br/> ?,' Ceci est un test de l'échantillon pour la démonstration de l'identificateur de langue.' <br/> ===== Classified instance ===== <br/> Class predicted: FR <br/> <br/> $> java LanguageIdentifier test_sp.txt smo.model.dat <br/> ===== Loaded text data: test_sp.txt ===== <br/> Esto es un texto de prueba para la demostración del identificador de idioma. <br/> ===== Loaded model: smo.model.dat ===== <br/> ===== Instance created with reference dataset ===== <br/> @relation 'Test relation' <br/> @attribute language_class {EN,FR,SP} <br/> @attribute text string <br/> @data <br/> ?,' Esto es un texto de prueba para la demostración del identificador de idioma.' <br/> ===== Classified instance ===== <br/> Class predicted: SP</code></p> </blockquote> <p>So the program is correct on the three examples. Remember that you have to learn the model before using the program. As a side note, as the program only uses a <code>FilteredClassifier</code> object, you can change the script to accommodate a different algorithm. For instance, you can just change the text "<code>weka.classifiers.functions.SMO</code>" by "<code>weka.classifiers.bayes.NaiveBayes</code>" in the <code>learn.sh</code> script, and the program will be working the same way -- but with a different model.</p> <p><strong>Concluding Remarks</strong></p> <p>While being relatively simple, the Language Identification problem helps to identify the essential tasks we have to perform when building text classifiers with WEKA. It is a complete example in the sense that we have not only collected the dataset and learnt on it, but we have also dig a bit into the most suitable representation by playing with attribute selection and tentative classifier to visualize the data. It also demonstrates some basic configurations of the <code>StringToWordVector</code> filter, which is the most remarkable tool in WEKA for text mining.</p> <p>If you have had the time to read all this post, and even tried the program: thank you! I hope it has been a valuable time investment. I am tempted to suggest you to modify the dataset to include more languages, as the problem I have addressed is relatively simple -- only three and quite different languages.</p> <p>Finally, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on this topics!</p> <div style='clear: both;'></div> </div> <div class='post-footer'> <div class='post-footer-line post-footer-line-1'><span class='post-author vcard'> Publicado por <span class='fn'> <a href='https://www.blogger.com/profile/17053588779560658723' rel='author' title='author profile'> Jose Maria Gomez Hidalgo </a> </span> </span> <span class='post-timestamp'> en <a class='timestamp-link' href='http://jmgomezhidalgo.blogspot.com/2013/05/language-identification-as-text.html' rel='bookmark' title='permanent link'><abbr class='published' title='2013-05-20T21:28:00+02:00'>9:28 p. m.</abbr></a> </span> <span class='post-icons'> <span class='item-action'> <a href='https://www.blogger.com/email-post/36589303/4590226879146293202' title='Enviar entrada por correo electrónico'> <img alt='' class='icon-action' height='13' src='http://img1.blogblog.com/img/icon18_email.gif' width='18'/> </a> </span> <span class='item-control blog-admin pid-186762725'> <a href='https://www.blogger.com/post-edit.g?blogID=36589303&postID=4590226879146293202&from=pencil' title='Editar entrada'> <img alt='' class='icon-action' height='18' src='https://resources.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/> </a> </span> </span> <span class='post-backlinks post-comment-link'> </span> <div class='post-share-buttons goog-inline-block'> <a class='goog-inline-block share-button sb-email' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=4590226879146293202&target=email' target='_blank' title='Enviar por correo electrónico'><span class='share-button-link-text'>Enviar por correo electrónico</span></a><a class='goog-inline-block share-button sb-blog' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=4590226879146293202&target=blog' onclick='window.open(this.href, "_blank", "height=270,width=475"); return false;' target='_blank' title='Escribe un blog'><span class='share-button-link-text'>Escribe un blog</span></a><a class='goog-inline-block share-button sb-twitter' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=4590226879146293202&target=twitter' target='_blank' title='Compartir en X'><span class='share-button-link-text'>Compartir en X</span></a><a class='goog-inline-block share-button sb-facebook' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=4590226879146293202&target=facebook' onclick='window.open(this.href, "_blank", "height=430,width=640"); return false;' target='_blank' title='Compartir con Facebook'><span class='share-button-link-text'>Compartir con Facebook</span></a><a class='goog-inline-block share-button sb-pinterest' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=4590226879146293202&target=pinterest' target='_blank' title='Compartir en Pinterest'><span class='share-button-link-text'>Compartir en Pinterest</span></a> </div> </div> <div class='post-footer-line post-footer-line-2'><span class='post-labels'> Etiquetas: <a href='http://jmgomezhidalgo.blogspot.com/search/label/English' rel='tag'>English</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/Information%20Retrieval' rel='tag'>Information Retrieval</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/Machine%20Learning' rel='tag'>Machine Learning</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/NLP' rel='tag'>NLP</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/Opensource' rel='tag'>Opensource</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/Text%20Mining' rel='tag'>Text Mining</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/WEKA' rel='tag'>WEKA</a> </span> <span class='post-comment-link'> <a class='comment-link' href='https://www.blogger.com/comment/fullpage/post/36589303/4590226879146293202' onclick='javascript:window.open(this.href, "bloggerPopup", "toolbar=0,location=0,statusbar=1,menubar=0,scrollbars=yes,width=640,height=500"); return false;'>7 comentarios</a> </span> </div> <div class='post-footer-line post-footer-line-3'><span class='reaction-buttons'> </span> </div> </div> </div> </div> </div></div> <div class="date-outer"> <h2 class='date-header'><span>2.5.13</span></h2> <div class="date-posts"> <div class='post-outer'> <div class='post hentry'> <a name='1659637270885805661'></a> <h3 class='post-title entry-title'> <a href='http://jmgomezhidalgo.blogspot.com/2013/05/mapping-vocabulary-from-train-to-test.html'>Mapping Vocabulary from Train to Test Datasets in WEKA Text Classifiers</a> </h3> <div class='post-header'> <div class='post-header-line-1'></div> </div> <div class='post-body entry-content' id='post-body-1659637270885805661'> <p>There are several ways of evaluating a (text) classifier: <a href="http://en.wikipedia.org/wiki/Cross-validation_(statistics)" target="_blank">cross validation</a>, splitting your dataset into train and test subsets, or even evaluating the classifier on the training set itself (not recommended). I will not discuss the merits of each method, instead I will focus on a train/test split evaluation.</p> <p>When you start to work with your train and test text datasets, you have got two labelled text collections like e.g. those I make available at <a href="https://github.com/jmgomezh/tmweka" target="_blank">my GITHub project</a>: <a href="https://github.com/jmgomezh/tmweka/blob/master/InputMappedClassifier/smsspam.small.train.arff" target="_blank"><code>smsspam.small.train.arff</code></a> and <a href="https://github.com/jmgomezh/tmweka/blob/master/InputMappedClassifier/smsspam.small.test.arff" target="_blank"><code>smsspam.small.test.arff</code></a> . In this case, we have two collections that are a 50% split of my original simple collection <a href="https://github.com/jmgomezh/tmweka/blob/master/FilteredClassifier/smsspam.small.arff" target="_blank"><code>smsspam.small.arff</code></a> , which in turn is a subset of the the original <a href="http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/" target="_blank">SMS Spam Collection</a>. The files are formatted according to the <a href="http://weka.sourceforge.net/" target="_blank">WEKA</a> <a href="http://www.cs.waikato.ac.nz/ml/weka/arff.html" target="_blank">ARFF</a>:</p> <blockquote> <p><code>@relation sms_test <br/> <br/> @attribute spamclass {spam,ham} <br/> @attribute text String <br/> <br/> @data <br/> ham,'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...' <br/> spam,'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C\'s apply 08452810075over18\'s' <br/> ...</code></p> </blockquote> <p>That is, one text instance per line, the first attribute being the nominal class spam/ham, and the second attribute being the text itself.</p> <p>In text classification, you have to transform this original representation into a vector of terms/words/stems/etc. in order to allow the classifier to learn expressions like: "if the word "win" occurs in a text, then classify it as spam". In other words, you have to represent your texts as feature vectors, where the features are words and the values are e.g. binary weights, <a href="http://en.wikipedia.org/wiki/Tf–idf" target="_blank">TF weights, or TF.IDF weights</a>. In fact, WEKA provides the handy <a href="http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html" target="_blank"><code>StringToWordVector</code></a> filter for this purpose (Thanks, WEKA!).</p> <p>However, it is most likely that the vocabulary used in your training set and in your test set is not identical. For instance, if you directly apply the <code>StringToWordVector</code> filter to the previous files, you get a bit different results, summarized in the following table:</p> <p style="TEXT-ALIGN: center"><img height="185" src="http://www.esp.uem.es/jmgomez/blogimg/table.train.test.attributes.png" style="DISPLAY: inline" width="273"/></p> <p>Obviously, to enable learning you have to ensure that the representation of both datasets is the same. For instance, imagine that the root of the decision tree you have learnt on your training collection poses a test on an attribute that does not exist on your test collection, then what happens?</p> <p>Fortunately, WEKA provides at least three ways of getting the same vocabulary in your train and test subcollections. Here are them:</p> <ol> <li>Using a <strong>batch filter</strong> that takes both training and test collections at the same time, using the first for getting the attributes and representing the last using those attributes.</li> <li>Using a <strong><code><strong><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/meta/FilteredClassifier.html" target="_blank"><strong><code><strong>FilteredClasifier</strong></code></strong></a></strong></code></strong> (that I have discussed <a href="http://jmgomezhidalgo.blogspot.com.es/2013/01/text-mining-in-weka-chaining-filters.html" target="_blank">in previous posts</a>), which feeds both the filter and the classifier into a single classifier that takes the original representation class/text as input for both the training and the test sets.</li> <li>A more recent method, that is separately getting the representations and using an <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/misc/InputMappedClassifier.html" target="_blank"><strong>InputMappedClassifier</strong></a></code> that acts as a wrapper of an underlying classifier, and tries to match attributes from the training collection into the corresponding ones of the test subset.</li> </ol> <p>The first method is quite simple, and it just makes use of the <code>-b</code> option of the WEKA filters. The corresponding command line calls are the next ones:</p> <blockquote> <p><code>$> java weka.filters.unsupervised.attribute.StringToWordVector -b -i smsspam.small.train.arff -o smsspam.small.train.vector.arff -r smsspam.small.test.arff -s smsspam.small.test.vector.arff <br/> $> java weka.classifiers.lazy.IBk -t smsspam.small.train.vector.arff -T smsspam.small.test.vector.arff -i -c first <br/> ... <br/> === Confusion Matrix === <br/> a b <-- classified as <br/> 1 15 | a = spam <br/> 0 84 | b = ham</code></p> </blockquote> <p>The second method, conveniently discussed <a href="http://jmgomezhidalgo.blogspot.com.es/2013/01/text-mining-in-weka-chaining-filters.html" target="_blank">in my previous post</a>, can be applied with the following call:</p> <blockquote> <p><code>$> java weka.classifiers.meta.FilteredClassifier -t smsspam.small.train.arff -T smsspam.small.test.arff -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.lazy.IBk -i -c first <br/> ... <br/> === Confusion Matrix === <br/> a b <-- classified as <br/> 1 15 | a = spam <br/> 0 84 | b = ham</code></p> </blockquote> <p>As it is shown in the previous results, both methods achieve the same results. In this case, I have opted for using <code>StringToWordVector</code> without parameters (default tokenization, term weights, no stemming, etc.) with the relatively weak classifier <code><a href="http://weka.sourceforge.net/doc.dev/weka/classifiers/lazy/IBk.html" target="_blank">IBk</a></code> , which implements a k-Nearest-Neighbor learner that, instead of building a model from the training collection, it searches the closest training instance to the test instance (<code>k</code> is 1 on default) and assigns its class to the test instance.</p> <p>However, the third method achieves different results, as the mapping involves some attributes from the training collection disappearing, and ignoring new attributes in the test collection. It is called the following way:</p> <blockquote> <p><code>$> java weka.filters.unsupervised.attribute.StringToWordVector -i smsspam.small.train.arff -o smsspam.small.train.vector.arff <br/> $> java weka.filters.unsupervised.attribute.StringToWordVector -i smsspam.small.test.arff -o smsspam.small.test.vector.arff <br/> $> java weka.classifiers.misc.InputMappedClassifier -W weka.classifiers.lazy.IBk -t smsspam.small.train.vector.arff -T smsspam.small.test.vector.arff -i -c first <br/> Attribute mappings: <br/> Model attributes Incoming attributes <br/> ------------------------------ ---------------- <br/> (nominal) spamclass --> 1 (nominal) spamclass <br/> (numeric) #&gt --> 2 (numeric) #&gt <br/> (numeric) $1 --> - missing (no match) <br/> (numeric) &amp --> - missing (no match) <br/> (numeric) &lt --> 6 (numeric) &lt <br/> (numeric) *9 --> 7 (numeric) *9 <br/> (numeric) + --> - missing (no match) <br/> (numeric) - --> 8 (numeric) - <br/> ... <br/> === Confusion Matrix === <br/> a b <-- classified as <br/> 2 14 | a = spam <br/> 1 83 | b = ham</code></p> </blockquote> <p style="MARGIN-RIGHT: 0px">In fact, this time we get a bit more spam (2 over 14) with a false positive, although the general accuracy is exactly the same: 85%. You can see how some of the attributes are missing (they do not occur in the test dataset), like: "<code>$1</code>", "<code>+</code>", etc. This for sure affects the performance of the classifier, so beware.</p> <p>With these options, my recommendation is using the first method, as it allows you to fully examine the representation of the datasets (term weight vectors) and it decouples filtering from training, what may be convenient in terms of efficiency.</p> <p>Before ending this post, I have to thank Tiago Pasqualini Silva, <a href="http://www.dt.fee.unicamp.br/~tiago/index.html" target="_blank">Tiago Almeida</a> and <a href="http://paginaspersonales.deusto.es/isantos/en/about.shtml" target="_blank">Igor Santos</a> for our experiments with the SMS Spam Collection, and to Tiago Pasqualini in particular because he showed me the <code>InputMappedClassifier</code>.</p> <p>And last but not least, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on this topics!</p> <div style='clear: both;'></div> </div> <div class='post-footer'> <div class='post-footer-line post-footer-line-1'><span class='post-author vcard'> Publicado por <span class='fn'> <a href='https://www.blogger.com/profile/17053588779560658723' rel='author' title='author profile'> Jose Maria Gomez Hidalgo </a> </span> </span> <span class='post-timestamp'> en <a class='timestamp-link' href='http://jmgomezhidalgo.blogspot.com/2013/05/mapping-vocabulary-from-train-to-test.html' rel='bookmark' title='permanent link'><abbr class='published' title='2013-05-02T01:41:00+02:00'>1:41 a. m.</abbr></a> </span> <span class='post-icons'> <span class='item-action'> <a href='https://www.blogger.com/email-post/36589303/1659637270885805661' title='Enviar entrada por correo electrónico'> <img alt='' class='icon-action' height='13' src='http://img1.blogblog.com/img/icon18_email.gif' width='18'/> </a> </span> <span class='item-control blog-admin pid-186762725'> <a href='https://www.blogger.com/post-edit.g?blogID=36589303&postID=1659637270885805661&from=pencil' title='Editar entrada'> <img alt='' class='icon-action' height='18' src='https://resources.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/> </a> </span> </span> <span class='post-backlinks post-comment-link'> </span> <div class='post-share-buttons goog-inline-block'> <a class='goog-inline-block share-button sb-email' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=1659637270885805661&target=email' target='_blank' title='Enviar por correo electrónico'><span class='share-button-link-text'>Enviar por correo electrónico</span></a><a class='goog-inline-block share-button sb-blog' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=1659637270885805661&target=blog' onclick='window.open(this.href, "_blank", "height=270,width=475"); return false;' target='_blank' title='Escribe un blog'><span class='share-button-link-text'>Escribe un blog</span></a><a class='goog-inline-block share-button sb-twitter' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=1659637270885805661&target=twitter' target='_blank' title='Compartir en X'><span class='share-button-link-text'>Compartir en X</span></a><a class='goog-inline-block share-button sb-facebook' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=1659637270885805661&target=facebook' onclick='window.open(this.href, "_blank", "height=430,width=640"); return false;' target='_blank' title='Compartir con Facebook'><span class='share-button-link-text'>Compartir con Facebook</span></a><a class='goog-inline-block share-button sb-pinterest' href='https://www.blogger.com/share-post.g?blogID=36589303&postID=1659637270885805661&target=pinterest' target='_blank' title='Compartir en Pinterest'><span class='share-button-link-text'>Compartir en Pinterest</span></a> </div> </div> <div class='post-footer-line post-footer-line-2'><span class='post-labels'> Etiquetas: <a href='http://jmgomezhidalgo.blogspot.com/search/label/English' rel='tag'>English</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/Evaluation' rel='tag'>Evaluation</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/Information%20Retrieval' rel='tag'>Information Retrieval</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/Machine%20Learning' rel='tag'>Machine Learning</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/NLP' rel='tag'>NLP</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/Spam' rel='tag'>Spam</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/Text%20Mining' rel='tag'>Text Mining</a>, <a href='http://jmgomezhidalgo.blogspot.com/search/label/WEKA' rel='tag'>WEKA</a> </span> <span class='post-comment-link'> <a class='comment-link' href='https://www.blogger.com/comment/fullpage/post/36589303/1659637270885805661' onclick='javascript:window.open(this.href, "bloggerPopup", "toolbar=0,location=0,statusbar=1,menubar=0,scrollbars=yes,width=640,height=500"); return false;'>5 comentarios</a> </span> </div> <div class='post-footer-line post-footer-line-3'><span class='reaction-buttons'> </span> </div> </div> </div> </div> </div></div> </div> <div class='blog-pager' id='blog-pager'> <span id='blog-pager-newer-link'> <a class='blog-pager-newer-link' href='http://jmgomezhidalgo.blogspot.com/search?updated-max=2013-08-23T11:47:00%2B02:00&max-results=7&reverse-paginate=true' id='Blog1_blog-pager-newer-link' title='Entradas más recientes'>Entradas más recientes</a> </span> <span id='blog-pager-older-link'> <a class='blog-pager-older-link' href='http://jmgomezhidalgo.blogspot.com/search?updated-max=2013-05-02T01:41:00%2B02:00&max-results=7' id='Blog1_blog-pager-older-link' title='Entradas antiguas'>Entradas antiguas</a> </span> <a class='home-link' href='http://jmgomezhidalgo.blogspot.com/'>Inicio</a> </div> <div class='clear'></div> <div class='blog-feeds'> <div class='feed-links'> Suscribirse a: <a class='feed-link' href='http://jmgomezhidalgo.blogspot.com/feeds/posts/default' target='_blank' type='application/atom+xml'>Entradas (Atom)</a> </div> </div> </div></div> </div> </div> <div class='column-left-outer'> <div class='column-left-inner'> <aside> </aside> </div> </div> <div class='column-right-outer'> <div class='column-right-inner'> <aside> <div class='sidebar section' id='sidebar-right-1'><div class='widget Profile' data-version='1' id='Profile1'> <h2>Datos personales / Personal Data</h2> <div class='widget-content'> <dl class='profile-datablock'> <dt class='profile-data'> <a class='profile-name-link g-profile' href='https://www.blogger.com/profile/17053588779560658723' rel='author' style='background-image: url(//www.blogger.com/img/logo-16.png);'> Jose Maria Gomez Hidalgo </a> </dt> </dl> <a class='profile-link' href='https://www.blogger.com/profile/17053588779560658723' rel='author'>Ver todo mi perfil</a> <div class='clear'></div> </div> </div><div class='widget HTML' data-version='1' id='HTML1'> <h2 class='title'>Buy books about WEKA / Text Mining</h2> <div class='widget-content'> <a href="http://amzn.to/2iesCUZ">Data Mining: Practical Machine Learning Tools and Techniques</a><br /> <br /> <a href="http://amzn.to/2hwEe8Q">Instant Weka How-to</a><br /> <br /> <a href="http://amzn.to/2hwFysd">Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications</a> </div> <div class='clear'></div> </div><div class='widget Subscribe' data-version='1' id='Subscribe2'> <div style='white-space:nowrap'> <h2 class='title'>Suscribirse a / Subscribe to</h2> <div class='widget-content'> <div class='subscribe-wrapper subscribe-type-POST'> <div class='subscribe expanded subscribe-type-POST' id='SW_READER_LIST_Subscribe2POST' style='display:none;'> <div class='top'> <span class='inner' onclick='return(_SW_toggleReaderList(event, "Subscribe2POST"));'> <img class='subscribe-dropdown-arrow' src='https://resources.blogblog.com/img/widgets/arrow_dropdown.gif'/> <img align='absmiddle' alt='' border='0' class='feed-icon' src='https://resources.blogblog.com/img/icon_feed12.png'/> Entradas </span> <div class='feed-reader-links'> <a class='feed-reader-link' href='https://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fjmgomezhidalgo.blogspot.com%2Ffeeds%2Fposts%2Fdefault' target='_blank'> <img src='https://resources.blogblog.com/img/widgets/subscribe-netvibes.png'/> </a> <a class='feed-reader-link' href='https://add.my.yahoo.com/content?url=http%3A%2F%2Fjmgomezhidalgo.blogspot.com%2Ffeeds%2Fposts%2Fdefault' target='_blank'> <img src='https://resources.blogblog.com/img/widgets/subscribe-yahoo.png'/> </a> <a class='feed-reader-link' href='http://jmgomezhidalgo.blogspot.com/feeds/posts/default' target='_blank'> <img align='absmiddle' class='feed-icon' src='https://resources.blogblog.com/img/icon_feed12.png'/> Atom </a> </div> </div> <div class='bottom'></div> </div> <div class='subscribe' id='SW_READER_LIST_CLOSED_Subscribe2POST' onclick='return(_SW_toggleReaderList(event, "Subscribe2POST"));'> <div class='top'> <span class='inner'> <img class='subscribe-dropdown-arrow' src='https://resources.blogblog.com/img/widgets/arrow_dropdown.gif'/> <span onclick='return(_SW_toggleReaderList(event, "Subscribe2POST"));'> <img align='absmiddle' alt='' border='0' class='feed-icon' src='https://resources.blogblog.com/img/icon_feed12.png'/> Entradas </span> </span> </div> <div class='bottom'></div> </div> </div> <div class='subscribe-wrapper subscribe-type-COMMENT'> <div class='subscribe expanded subscribe-type-COMMENT' id='SW_READER_LIST_Subscribe2COMMENT' style='display:none;'> <div class='top'> <span class='inner' onclick='return(_SW_toggleReaderList(event, "Subscribe2COMMENT"));'> <img class='subscribe-dropdown-arrow' src='https://resources.blogblog.com/img/widgets/arrow_dropdown.gif'/> <img align='absmiddle' alt='' border='0' class='feed-icon' src='https://resources.blogblog.com/img/icon_feed12.png'/> Comentarios </span> <div class='feed-reader-links'> <a class='feed-reader-link' href='https://www.netvibes.com/subscribe.php?url=http%3A%2F%2Fjmgomezhidalgo.blogspot.com%2Ffeeds%2Fcomments%2Fdefault' target='_blank'> <img src='https://resources.blogblog.com/img/widgets/subscribe-netvibes.png'/> </a> <a class='feed-reader-link' href='https://add.my.yahoo.com/content?url=http%3A%2F%2Fjmgomezhidalgo.blogspot.com%2Ffeeds%2Fcomments%2Fdefault' target='_blank'> <img src='https://resources.blogblog.com/img/widgets/subscribe-yahoo.png'/> </a> <a class='feed-reader-link' href='http://jmgomezhidalgo.blogspot.com/feeds/comments/default' target='_blank'> <img align='absmiddle' class='feed-icon' src='https://resources.blogblog.com/img/icon_feed12.png'/> Atom </a> </div> </div> <div class='bottom'></div> </div> <div class='subscribe' id='SW_READER_LIST_CLOSED_Subscribe2COMMENT' onclick='return(_SW_toggleReaderList(event, "Subscribe2COMMENT"));'> <div class='top'> <span class='inner'> <img class='subscribe-dropdown-arrow' src='https://resources.blogblog.com/img/widgets/arrow_dropdown.gif'/> <span onclick='return(_SW_toggleReaderList(event, "Subscribe2COMMENT"));'> <img align='absmiddle' alt='' border='0' class='feed-icon' src='https://resources.blogblog.com/img/icon_feed12.png'/> Comentarios </span> </span> </div> <div class='bottom'></div> </div> </div> <div style='clear:both'></div> </div> </div> <div class='clear'></div> </div><div class='widget BlogArchive' data-version='1' id='BlogArchive1'> <h2>Archivo / Archive</h2> <div class='widget-content'> <div id='ArchiveList'> <div id='BlogArchive1_ArchiveList'> <select id='BlogArchive1_ArchiveMenu'> <option value=''>Archivo / Archive</option> <option value='http://jmgomezhidalgo.blogspot.com/2014/10/'>octubre 2014 (1)</option> <option value='http://jmgomezhidalgo.blogspot.com/2014/05/'>mayo 2014 (1)</option> <option value='http://jmgomezhidalgo.blogspot.com/2014/01/'>enero 2014 (1)</option> <option value='http://jmgomezhidalgo.blogspot.com/2013/08/'>agosto 2013 (1)</option> <option value='http://jmgomezhidalgo.blogspot.com/2013/07/'>julio 2013 (4)</option> <option value='http://jmgomezhidalgo.blogspot.com/2013/06/'>junio 2013 (3)</option> <option value='http://jmgomezhidalgo.blogspot.com/2013/05/'>mayo 2013 (4)</option> <option value='http://jmgomezhidalgo.blogspot.com/2013/04/'>abril 2013 (3)</option> <option value='http://jmgomezhidalgo.blogspot.com/2013/02/'>febrero 2013 (1)</option> <option value='http://jmgomezhidalgo.blogspot.com/2013/01/'>enero 2013 (4)</option> <option value='http://jmgomezhidalgo.blogspot.com/2012/12/'>diciembre 2012 (1)</option> <option value='http://jmgomezhidalgo.blogspot.com/2012/05/'>mayo 2012 (2)</option> <option value='http://jmgomezhidalgo.blogspot.com/2012/02/'>febrero 2012 (2)</option> <option value='http://jmgomezhidalgo.blogspot.com/2011/11/'>noviembre 2011 (2)</option> <option value='http://jmgomezhidalgo.blogspot.com/2011/10/'>octubre 2011 (4)</option> <option value='http://jmgomezhidalgo.blogspot.com/2011/06/'>junio 2011 (4)</option> <option value='http://jmgomezhidalgo.blogspot.com/2011/05/'>mayo 2011 (2)</option> <option value='http://jmgomezhidalgo.blogspot.com/2011/04/'>abril 2011 (3)</option> <option value='http://jmgomezhidalgo.blogspot.com/2011/03/'>marzo 2011 (3)</option> <option value='http://jmgomezhidalgo.blogspot.com/2011/01/'>enero 2011 (4)</option> <option value='http://jmgomezhidalgo.blogspot.com/2010/12/'>diciembre 2010 (8)</option> <option value='http://jmgomezhidalgo.blogspot.com/2010/11/'>noviembre 2010 (4)</option> <option value='http://jmgomezhidalgo.blogspot.com/2010/10/'>octubre 2010 (5)</option> <option value='http://jmgomezhidalgo.blogspot.com/2010/09/'>septiembre 2010 (5)</option> <option value='http://jmgomezhidalgo.blogspot.com/2010/08/'>agosto 2010 (3)</option> <option value='http://jmgomezhidalgo.blogspot.com/2010/05/'>mayo 2010 (9)</option> <option value='http://jmgomezhidalgo.blogspot.com/2010/04/'>abril 2010 (1)</option> <option value='http://jmgomezhidalgo.blogspot.com/2010/03/'>marzo 2010 (3)</option> <option value='http://jmgomezhidalgo.blogspot.com/2010/02/'>febrero 2010 (24)</option> <option value='http://jmgomezhidalgo.blogspot.com/2010/01/'>enero 2010 (4)</option> <option value='http://jmgomezhidalgo.blogspot.com/2009/12/'>diciembre 2009 (11)</option> <option value='http://jmgomezhidalgo.blogspot.com/2009/11/'>noviembre 2009 (15)</option> <option value='http://jmgomezhidalgo.blogspot.com/2009/10/'>octubre 2009 (3)</option> <option value='http://jmgomezhidalgo.blogspot.com/2009/09/'>septiembre 2009 (8)</option> <option value='http://jmgomezhidalgo.blogspot.com/2009/08/'>agosto 2009 (2)</option> <option value='http://jmgomezhidalgo.blogspot.com/2009/07/'>julio 2009 (3)</option> <option value='http://jmgomezhidalgo.blogspot.com/2009/05/'>mayo 2009 (4)</option> <option value='http://jmgomezhidalgo.blogspot.com/2009/04/'>abril 2009 (26)</option> <option value='http://jmgomezhidalgo.blogspot.com/2009/03/'>marzo 2009 (19)</option> <option value='http://jmgomezhidalgo.blogspot.com/2009/02/'>febrero 2009 (12)</option> <option value='http://jmgomezhidalgo.blogspot.com/2009/01/'>enero 2009 (30)</option> <option value='http://jmgomezhidalgo.blogspot.com/2008/12/'>diciembre 2008 (12)</option> <option value='http://jmgomezhidalgo.blogspot.com/2008/11/'>noviembre 2008 (6)</option> <option value='http://jmgomezhidalgo.blogspot.com/2008/10/'>octubre 2008 (9)</option> <option value='http://jmgomezhidalgo.blogspot.com/2008/09/'>septiembre 2008 (10)</option> <option value='http://jmgomezhidalgo.blogspot.com/2008/08/'>agosto 2008 (3)</option> <option value='http://jmgomezhidalgo.blogspot.com/2008/07/'>julio 2008 (12)</option> <option value='http://jmgomezhidalgo.blogspot.com/2008/06/'>junio 2008 (12)</option> <option value='http://jmgomezhidalgo.blogspot.com/2008/05/'>mayo 2008 (12)</option> <option value='http://jmgomezhidalgo.blogspot.com/2008/04/'>abril 2008 (17)</option> <option value='http://jmgomezhidalgo.blogspot.com/2008/03/'>marzo 2008 (16)</option> <option value='http://jmgomezhidalgo.blogspot.com/2008/02/'>febrero 2008 (8)</option> <option value='http://jmgomezhidalgo.blogspot.com/2008/01/'>enero 2008 (1)</option> <option value='http://jmgomezhidalgo.blogspot.com/2007/12/'>diciembre 2007 (2)</option> <option value='http://jmgomezhidalgo.blogspot.com/2007/11/'>noviembre 2007 (6)</option> <option value='http://jmgomezhidalgo.blogspot.com/2007/10/'>octubre 2007 (7)</option> <option value='http://jmgomezhidalgo.blogspot.com/2006/11/'>noviembre 2006 (6)</option> <option value='http://jmgomezhidalgo.blogspot.com/2006/10/'>octubre 2006 (2)</option> </select> </div> </div> <div class='clear'></div> </div> </div><div class='widget LinkList' data-version='1' id='LinkList1'> <h2>Mi perfil en / My profile at</h2> <div class='widget-content'> <ul> <li><a href='http://www.esp.uem.es/jmgomez/'>Página personal / Home Page</a></li> <li><a href='http://www.facebook.com/jmgomezh'>Facebook</a></li> <li><a href='http://www.linkedin.com/in/jmgomezh'>LinkedIn</a></li> <li><a href='http://twitter.com/jmgomez'>Twitter</a></li> <li><a href='http://picasaweb.google.es/Jose.Maria.Gomez.Hidalgo'>Picasa</a></li> <li><a href='http://www.slideshare.net/jmgomezh'>SlideShare</a></li> </ul> <div class='clear'></div> </div> </div><div class='widget Label' data-version='1' id='Label1'> <h2>Etiquetas / Tags</h2> <div class='widget-content cloud-label-widget-content'> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Biomedicine'>Biomedicine</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Captcha'>Captcha</a> </span> <span class='label-size label-size-4'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/CFP'>CFP</a> </span> <span class='label-size label-size-2'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Children%20Protection'>Children Protection</a> </span> <span class='label-size label-size-1'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Cloud%20Computing'>Cloud Computing</a> </span> <span class='label-size label-size-1'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Control%20parental'>Control parental</a> </span> <span class='label-size label-size-2'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Cultura'>Cultura</a> </span> <span class='label-size label-size-2'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Culture'>Culture</a> </span> <span class='label-size label-size-1'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Data%20Mining'>Data Mining</a> </span> <span class='label-size label-size-5'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/English'>English</a> </span> <span class='label-size label-size-2'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Evaluacion'>Evaluacion</a> </span> <span class='label-size label-size-4'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Evaluation'>Evaluation</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Eventos'>Eventos</a> </span> <span class='label-size label-size-4'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Events'>Events</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Humor'>Humor</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Imagen'>Imagen</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Imaging'>Imaging</a> </span> <span class='label-size label-size-5'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Information%20Retrieval'>Information Retrieval</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Internet'>Internet</a> </span> <span class='label-size label-size-5'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Machine%20Learning'>Machine Learning</a> </span> <span class='label-size label-size-5'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/NLP'>NLP</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Online%20Advertising'>Online Advertising</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Opensource'>Opensource</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Opinion'>Opinion</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Opini%C3%B3n'>Opinión</a> </span> <span class='label-size label-size-4'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Opinion%20Mining'>Opinion Mining</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Papers'>Papers</a> </span> <span class='label-size label-size-1'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Parental%20Control'>Parental Control</a> </span> <span class='label-size label-size-4'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Personal'>Personal</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Phishing'>Phishing</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Privacidad'>Privacidad</a> </span> <span class='label-size label-size-4'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Privacy'>Privacy</a> </span> <span class='label-size label-size-1'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Programming'>Programming</a> </span> <span class='label-size label-size-2'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Protecci%C3%B3n%20del%20menor'>Protección del menor</a> </span> <span class='label-size label-size-1'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Recommendation'>Recommendation</a> </span> <span class='label-size label-size-2'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Recommender%20Systems'>Recommender Systems</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Recuperaci%C3%B3n%20de%20Informaci%C3%B3n'>Recuperación de Información</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Redes%20Sociales'>Redes Sociales</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Resources'>Resources</a> </span> <span class='label-size label-size-1'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Rob%C3%B3tica'>Robótica</a> </span> <span class='label-size label-size-4'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Search%20Engines'>Search Engines</a> </span> <span class='label-size label-size-4'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Security'>Security</a> </span> <span class='label-size label-size-1'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Security%20as%20a%20Service'>Security as a Service</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Seguridad'>Seguridad</a> </span> <span class='label-size label-size-1'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Smartphone'>Smartphone</a> </span> <span class='label-size label-size-4'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Social%20Networks'>Social Networks</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Software%20libre'>Software libre</a> </span> <span class='label-size label-size-4'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Spam'>Spam</a> </span> <span class='label-size label-size-4'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Tecnolog%C3%ADa'>Tecnología</a> </span> <span class='label-size label-size-4'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Text%20Mining'>Text Mining</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Trivia'>Trivia</a> </span> <span class='label-size label-size-2'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Tutorial'>Tutorial</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Virus'>Virus</a> </span> <span class='label-size label-size-4'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/Web%20Filtering'>Web Filtering</a> </span> <span class='label-size label-size-3'> <a dir='ltr' href='http://jmgomezhidalgo.blogspot.com/search/label/WEKA'>WEKA</a> </span> <div class='clear'></div> </div> </div></div> </aside> </div> </div> </div> <div style='clear: both'></div>  </div>  </div> </div> <div class='main-cap-bottom cap-bottom'> <div class='cap-left'></div> <div class='cap-right'></div> </div> </div> <footer> <div class='footer-outer'> <div class='footer-cap-top cap-top'> <div class='cap-left'></div> <div class='cap-right'></div> </div> <div class='fauxborder-left footer-fauxborder-left'> <div class='fauxborder-right footer-fauxborder-right'></div> <div class='region-inner footer-inner'> <div class='foot no-items section' id='footer-1'></div> <table border='0' cellpadding='0' cellspacing='0' class='section-columns columns-2'> <tbody> <tr> <td class='first columns-cell'> <div class='foot no-items section' id='footer-2-1'></div> </td> <td class='columns-cell'> <div class='foot no-items section' id='footer-2-2'></div> </td> </tr> </tbody> </table>  <div class='foot section' id='footer-3'><div class='widget Attribution' data-version='1' id='Attribution1'> <div class='widget-content' style='text-align: center;'> Con la tecnología de <a href='https://www.blogger.com' target='_blank'>Blogger</a>. </div> <div class='clear'></div> </div></div> </div> </div> <div class='footer-cap-bottom cap-bottom'> <div class='cap-left'></div> <div class='cap-right'></div> </div> </div> </footer>  </div> </div> <div class='content-cap-bottom cap-bottom'> <div class='cap-left'></div> <div class='cap-right'></div> </div> </div> </div> <script type='text/javascript'> window.setTimeout(function() { document.body.className = document.body.className.replace('loading', ''); }, 10); </script> <script type="text/javascript" src="https://www.blogger.com/static/v1/widgets/4157554182-widgets.js"></script> <script type='text/javascript'> window['__wavt'] = 'AOuZoY4VaqocwR09NbmlxJoFQzEaBDcEuw:1752608287756';_WidgetManager._Init('//www.blogger.com/rearrange?blogID\x3d36589303','//jmgomezhidalgo.blogspot.com/2013/05/','36589303'); _WidgetManager._SetDataContext([{'name': 'blog', 'data': {'blogId': '36589303', 'title': 'Nihil Obstat', 'url': 'http://jmgomezhidalgo.blogspot.com/2013/05/', 'canonicalUrl': 'http://jmgomezhidalgo.blogspot.com/2013/05/', 'homepageUrl': 'http://jmgomezhidalgo.blogspot.com/', 'searchUrl': 'http://jmgomezhidalgo.blogspot.com/search', 'canonicalHomepageUrl': 'http://jmgomezhidalgo.blogspot.com/', 'blogspotFaviconUrl': 'http://jmgomezhidalgo.blogspot.com/favicon.ico', 'bloggerUrl': 'https://www.blogger.com', 'hasCustomDomain': false, 'httpsEnabled': true, 'enabledCommentProfileImages': true, 'gPlusViewType': 'FILTERED_POSTMOD', 'adultContent': false, 'analyticsAccountNumber': '', 'encoding': 'UTF-8', 'locale': 'es', 'localeUnderscoreDelimited': 'es', 'languageDirection': 'ltr', 'isPrivate': false, 'isMobile': false, 'isMobileRequest': false, 'mobileClass': '', 'isPrivateBlog': false, 'isDynamicViewsAvailable': true, 'feedLinks': '\x3clink rel\x3d\x22alternate\x22 type\x3d\x22application/atom+xml\x22 title\x3d\x22Nihil Obstat - Atom\x22 href\x3d\x22http://jmgomezhidalgo.blogspot.com/feeds/posts/default\x22 /\x3e\n\x3clink rel\x3d\x22alternate\x22 type\x3d\x22application/rss+xml\x22 title\x3d\x22Nihil Obstat - RSS\x22 href\x3d\x22http://jmgomezhidalgo.blogspot.com/feeds/posts/default?alt\x3drss\x22 /\x3e\n\x3clink rel\x3d\x22service.post\x22 type\x3d\x22application/atom+xml\x22 title\x3d\x22Nihil Obstat - Atom\x22 href\x3d\x22https://www.blogger.com/feeds/36589303/posts/default\x22 /\x3e\n', 'meTag': '', 'adsenseHostId': 'ca-host-pub-1556223355139109', 'adsenseHasAds': false, 'adsenseAutoAds': false, 'boqCommentIframeForm': true, 'loginRedirectParam': '', 'view': '', 'dynamicViewsCommentsSrc': '//www.blogblog.com/dynamicviews/4224c15c4e7c9321/js/comments.js', 'dynamicViewsScriptSrc': '//www.blogblog.com/dynamicviews/3f13309f54609467', 'plusOneApiSrc': 'https://apis.google.com/js/platform.js', 'disableGComments': true, 'interstitialAccepted': false, 'sharing': {'platforms': [{'name': 'Obtener enlace', 'key': 'link', 'shareMessage': 'Obtener enlace', 'target': ''}, {'name': 'Facebook', 'key': 'facebook', 'shareMessage': 'Compartir en Facebook', 'target': 'facebook'}, {'name': 'Escribe un blog', 'key': 'blogThis', 'shareMessage': 'Escribe un blog', 'target': 'blog'}, {'name': 'X', 'key': 'twitter', 'shareMessage': 'Compartir en X', 'target': 'twitter'}, {'name': 'Pinterest', 'key': 'pinterest', 'shareMessage': 'Compartir en Pinterest', 'target': 'pinterest'}, {'name': 'Correo electr\xf3nico', 'key': 'email', 'shareMessage': 'Correo electr\xf3nico', 'target': 'email'}], 'disableGooglePlus': true, 'googlePlusShareButtonWidth': 0, 'googlePlusBootstrap': '\x3cscript type\x3d\x22text/javascript\x22\x3ewindow.___gcfg \x3d {\x27lang\x27: \x27es\x27};\x3c/script\x3e'}, 'hasCustomJumpLinkMessage': false, 'jumpLinkMessage': 'Leer m\xe1s', 'pageType': 'archive', 'pageName': 'mayo 2013', 'pageTitle': 'Nihil Obstat: mayo 2013'}}, {'name': 'features', 'data': {}}, {'name': 'messages', 'data': {'edit': 'Editar', 'linkCopiedToClipboard': 'El enlace se ha copiado en el Portapapeles.', 'ok': 'Aceptar', 'postLink': 'Enlace de la entrada'}}, {'name': 'template', 'data': {'name': 'custom', 'localizedName': 'Personalizado', 'isResponsive': false, 'isAlternateRendering': false, 'isCustom': true}}, {'name': 'view', 'data': {'classic': {'name': 'classic', 'url': '?view\x3dclassic'}, 'flipcard': {'name': 'flipcard', 'url': '?view\x3dflipcard'}, 'magazine': {'name': 'magazine', 'url': '?view\x3dmagazine'}, 'mosaic': {'name': 'mosaic', 'url': '?view\x3dmosaic'}, 'sidebar': {'name': 'sidebar', 'url': '?view\x3dsidebar'}, 'snapshot': {'name': 'snapshot', 'url': '?view\x3dsnapshot'}, 'timeslide': {'name': 'timeslide', 'url': '?view\x3dtimeslide'}, 'isMobile': false, 'title': 'Nihil Obstat', 'description': '\x3cb\x3eBlog de/by Jos\xe9 Mar\xeda G\xf3mez Hidalgo\x3c/b\x3e\x3cbr\x3e\x3cbr\x3e\nMis reflexiones sobre tecnolog\xeda e Internet, seguridad e inteligencia artificial\x3cbr\x3e\nMy opinions about technology, Internet, security and Artificial Intelligence', 'url': 'http://jmgomezhidalgo.blogspot.com/2013/05/', 'type': 'feed', 'isSingleItem': false, 'isMultipleItems': true, 'isError': false, 'isPage': false, 'isPost': false, 'isHomepage': false, 'isArchive': true, 'isLabelSearch': false, 'archive': {'year': 2013, 'month': 5, 'rangeMessage': 'Mostrando entradas de mayo, 2013'}}}]); _WidgetManager._RegisterWidget('_NavbarView', new _WidgetInfo('Navbar1', 'navbar', document.getElementById('Navbar1'), {}, 'displayModeFull')); _WidgetManager._RegisterWidget('_HeaderView', new _WidgetInfo('Header1', 'header', document.getElementById('Header1'), {}, 'displayModeFull')); _WidgetManager._RegisterWidget('_BlogView', new _WidgetInfo('Blog1', 'main', document.getElementById('Blog1'), {'cmtInteractionsEnabled': false, 'lightboxEnabled': true, 'lightboxModuleUrl': 'https://www.blogger.com/static/v1/jsbin/3107444943-lbx__es.js', 'lightboxCssUrl': 'https://www.blogger.com/static/v1/v-css/123180807-lightbox_bundle.css'}, 'displayModeFull')); _WidgetManager._RegisterWidget('_ProfileView', new _WidgetInfo('Profile1', 'sidebar-right-1', document.getElementById('Profile1'), {}, 'displayModeFull')); _WidgetManager._RegisterWidget('_HTMLView', new _WidgetInfo('HTML1', 'sidebar-right-1', document.getElementById('HTML1'), {}, 'displayModeFull')); _WidgetManager._RegisterWidget('_SubscribeView', new _WidgetInfo('Subscribe2', 'sidebar-right-1', document.getElementById('Subscribe2'), {}, 'displayModeFull')); _WidgetManager._RegisterWidget('_BlogArchiveView', new _WidgetInfo('BlogArchive1', 'sidebar-right-1', document.getElementById('BlogArchive1'), {'languageDirection': 'ltr', 'loadingMessage': 'Cargando\x26hellip;'}, 'displayModeFull')); _WidgetManager._RegisterWidget('_LinkListView', new _WidgetInfo('LinkList1', 'sidebar-right-1', document.getElementById('LinkList1'), {}, 'displayModeFull')); _WidgetManager._RegisterWidget('_LabelView', new _WidgetInfo('Label1', 'sidebar-right-1', document.getElementById('Label1'), {}, 'displayModeFull')); _WidgetManager._RegisterWidget('_AttributionView', new _WidgetInfo('Attribution1', 'footer-3', document.getElementById('Attribution1'), {}, 'displayModeFull')); </script> </body> </html>

Nihil Obstat

23.5.13

Compilation of Resources for Text-based Age Detection

22.5.13

Presentación: "Menores y móviles: Usos, riesgos y controles parentales"