Nihil Obstat

Carlos Laorden nominado para "Born To Be Discovery" por Negobot

2014-10-10T11:23:00.001+02:00

Carlos Laorden, Doctor en Sistemas de Información por la Universidad de Deusto, y compañero y amigo de DeustoTech, ha sido nominado en la categoría de Ciencia y Tecnología para los premios "Born to be Discovery" por el bot antipedófilos NEGOBOT. Yo ya le he votado. ¿Lo vas a hacer tú?

WEKA Text Mining Trick: Copying Options from the Explorer to the Command Line

2014-05-21T11:49:00.001+02:00

From previous posts (specially from Command Line Functions for Text Mining in WEKA), you may know that writing command-line calls to WEKA can be far from trivial, mostly because you may need to nest FilteredClassifier , MultiFilter , StringToWordVector , AttributeSelection and a classifier into a single command with plenty of options -- and nested strings with escaped characters.

For instance, consider the following need: I want to test the classifier J48 over the smsspam.small.arff file, which contains couples of {class,text} lines. However, I want to:

Apply StringToWordVector with specific options: lowercased tokens, specific string delimiters, etc.
Get only those words with Information Gain over zero, which implies using the filter AttributeSelection with InfoGainAttributeEval and Ranker with threhold 0.0.
Make use of 10-fold cross validation, which implies using FilteredClassifier; and as long as I have two filters (StringToWordVector and AttributeSelection), I need to make use of MultiFilter as well.

With some experience, this is not too hard to be done by hand. However, it is much easier to configure your test at the WEKA Explorer, make a quick test with a very small subset of your dataset, then copy the configuration to a text file and editi it to fully fit your needs. For this specific example, I start with loading the dataset at the Preprocess tab, and then I configure the classifier by:

Choosing FilteredClassifier, and J48 as the classifier.
Choosing MultiFilter as the filter, then deleting the default AllFilter and adding StringToWordVector and AttributeSelection filters to it.
Editing the StringToWordVector filter to specify lowercased tokens, do not operate per class, and my list of delimietrs.
Editing the AttributeSelection filter to choose InfoGainAttributeEval as the evaluator, and Ranker with threshold 0.0 as the search method.

I show a picture in the middle of the process, just when editing the StringToWordVector filter:

Then you can specify spamclass as the class and run it to get something like:

=== Run information === Scheme: weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 100000 -prune-rate -1.0 -N 0 -L -stemmer weka.core.stemmers.NullStemmer -M 1 -O -tokenizer \\\"weka.core.tokenizers.WordTokenizer -delimiters \\\\\\\" \\\\\\\\r \\\\\\\\t.,;:\\\\\\\\\\\\\\\'\\\\\\\\\\\\\\\"()?!\\\\\\\\\\\\\\\%-/<>#@+*£&\\\\\\\"\\\"\" -F \"weka.filters.supervised.attribute.AttributeSelection -E \\\"weka.attributeSelection.InfoGainAttributeEval \\\" -S \\\"weka.attributeSelection.Ranker -T 0.0 -N -1\\\"\"" -W weka.classifiers.trees.J48 -- -C 0.25 -M 2 Relation: sms_test Instances: 200 Attributes: 2 spamclass text Test mode: 10-fold cross-validation (../..) === Confusion Matrix === a b <-- classified as 16 17 | a = spam 6 161 | b = ham

As you can see, the Scheme line gives us the exact command options we need to get that result! You can just copy and edit it (after saving the result buffer) to get what you want. Alternatively, you can right click on the command at the Explorer, like in the following picture:

In any case, you get the following messy thing:

weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 100000 -prune-rate -1.0 -N 0 -L -stemmer weka.core.stemmers.NullStemmer -M 1 -O -tokenizer \\\"weka.core.tokenizers.WordTokenizer -delimiters \\\\\\\" \\\\\\\\r \\\\\\\\t.,;:\\\\\\\\\\\\\\\'\\\\\\\\\\\\\\\"()?!\\\\\\\\\\\\\\\%-/<>#@+*£&\\\\\\\"\\\"\" -F \"weka.filters.supervised.attribute.AttributeSelection -E \\\"weka.attributeSelection.InfoGainAttributeEval \\\" -S \\\"weka.attributeSelection.Ranker -T 0.0 -N -1\\\"\"" -W weka.classifiers.trees.J48 -- -C 0.25 -M 2

Then you can strip the options you do not need. For instance, some default options in StringToWordVector are -R first-last, prune-rate -1.0, -N 0, the stemmer, etc. You can guess the default options by issuing the help command:

$>java weka.filters.unsupervised.attribute.StringToWordVector -h Help requested. Filter options: -C Output word counts rather than boolean word presence. -R <index1,index2-index4,...> Specify list of string attributes to convert to words (as weka Range). (default: select all string attributes) ...

So after cleaning the default options (in all filters and the classifier), adding the dataset file and the class index (-t spamsms.small.arff -c 1), and with some pretty printing for clarification, you can easily build the following command:

java weka.classifiers.meta.FilteredClassifier -c 1 -t smsspam.small.arff -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -W 100000 -L -O -tokenizer \\\"weka.core.tokenizers.WordTokenizer -delimiters \\\\\\\" \\\\\\\\r \\\\\\\\t.,;:\\\\\\\\\\\\\\\'\\\\\\\\\\\\\\\"()?!\\\\\\\\\\\\\\\%-/<>#@+*£&\\\\\\\"\\\"\" -F \"weka.filters.supervised.attribute.AttributeSelection -E \\\"weka.attributeSelection.InfoGainAttributeEval \\\" -S \\\"weka.attributeSelection.Ranker -T 0.0 \\\"\"" -W weka.classifiers.trees.J48

So now you can change other parameters if you want, in order to test other text representations, classifiers, etc., without dealing with escaping the options, delimiters, etc.

CFP: Sixth International Conference on Social Informatics

2014-01-30T12:43:00.001+01:00

The Sixth International Conference on Social Informatics (SocInfo 2014) will be taking place at Barcelona, Spain, from November 10th to November 13th. The ultimate goal of Social Informatics is to create better understanding of socially-centric platforms not just as a technology, but also as a set of social phenomena. To that end, the organizers are inviting interdisciplinary papers, on applying information technology in the study of social phenomena, on applying social concepts in the design of information systems, on applying methods from the social sciences in the study of social computing and information systems, on applying computational algorithms to facilitate the study of social systems and human social dynamics, and on designing information and communication technologies that consider social context.

Important dates

Full paper submission: August 8, 2014 (23:59 Hawaii Standard Time)
Notification of acceptance: October 3, 2014
Submission of final version: October 10, 2014
Conference dates: November 10-13, 2014

Topics

New theories, methods and objectives in computational social science
Computational models of social phenomena and social simulation
Social behavior modeling
Social communities: discovery, evolution, analysis, and applications
Dynamics of social collaborative systems
Social network analysis and mining
Mining social big data
Social Influence and social contagion
Web mining and its social interpretations
Quantifying offline phenomena through online data
Rich representations of social ties
Security, privacy, trust, reputation, and incentive issues
Opinion mining and social media analytics
Credibility of online content
Algorithms and protocols inspired by human societies
Mechanisms for providing fairness in information systems
Social choice mechanisms in the e-society
Social applications of the semantic Web
Social system design and architectures
Virtual communities (e.g., open-source, multiplayer gaming, etc.)
Impact of technology on socio-economic, security, defense aspects
Real-time analysis or visualization of social phenomena and social graphs
Socio-economic systems and applications
Collective intelligence and social cognition

My friend Paolo Boldi is in the organizing committee.

Data Mining for Political Elections, and Isaac Asimov

2013-08-23T11:47:00.001+02:00

Using Data Mining, Data Science and Big Data is cool in political elections, and in political decision-making. Well, not sure if cool, but it is a trending topic in Data Science in the latest years.

Here are some examples:

From the research point of view, you can check for instance how Twitter information is used in political campaigns in this Twitter and the Real World CIKM'13 Tutorial by Ingmar Weber and by Yelena Mejova. There is an interesting list of references on several ways of using Twitter to predict user political orientation, general public trends, and other. On the opposite side, you can find an interesting paper which provides sound criticism on some of the research performed on Twitter and politics: "I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper": A Balanced Survey on Election Prediction using Twitter Data, by Daniel Gayo-Avello.

Anyway, it should be clear from multiple points of view that governments (e.g. the NSA PRISM case) and politicians are collecting and using citizen data in order to predict their tastes and to guide their decisions and actions in political campaigns.

I will avoid the privacy discussion here, as I want the case for something different. My case is: Hey, if they can predict elections results, then why voting?

But my blog is not a political one; it should be a technical one - or at least, a technically-focused blogger one. And as many computer geeks, I am a scifi fan. And as one of the biggest authors is Isaac Asimov, I have read a lot by him.

What has to do Asimov with data mining in politics? Well, he predicted it .

More precisely, he predicted how elections may evolve in the Era of Big Data . And he answered my question. You will not vote .

Asimov used to publish short stories in scifi magazines (as many others, I know). In August 1955, he published a short story titled " Franchise " in the magazine "If: Worlds of Science Fiction". I read that story many years later, re-printed in one of his short stores collection books. I was young, and I liked the story, but not too much - there were others more appealing to my taste in the volume. However, I have revisited it recently, and under the light of my technical background, things have changed.

That is real scifi. He technically predicted the future. And it is happening.

The plot is simple; just let me quote the Wikipedia article:

In the future, the United States has converted to an "electronic democracy" where the computer Multivac selects a single person to answer a number of questions. Multivac will then use the answers and other data to determine what the results of an election would be, avoiding the need for an actual election to be held.

As the Big Data platform (the computer Multivac in the story) gets to know more and more about the citizens, it will need less and less to accurately predict election results. The problem is reduced to just making a list of (quite Sentiment Analysis related) questions to a single citizen selected as being representative for answering those questions, in order to refine some details, and that's it.

Do not blame him, nor me. It is just happening.

As always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

Update 1: Yet another example: Twitter hashtags predict rising tension in Egypt.

More Clever Tokenization of Spanish Text in Social Networks

2013-07-27T17:06:00.001+02:00

Text written by users in Social Networks is noisy: emoticons, chat codes, typos, grammar mistakes, and moreover, explicit noise created by users as a style, trend or fashion. Consider the next utterance, taken from a post in the social network Tuenti:

"felicidadees!! k t lo pases muy bien!! =)Feeeliiciidaaadeeess !! (:Felicidadesss!!pasatelo genialll :DFeliicCiidaDesS! :D Q tte Lo0 paseS bN! ;) (heart)"

This is a real text. Its approximate translation to English would be something like:

"happybirthdaay!! njy it lot!! =)Haaapyyybirthdaaayyy !! (:Happybirthdayyy!!have a great timeee :DHappyyBiirtHdayY :D Enjy! ;) (heart)"

The latest word between parenthesis is a Tuenti code that is shown as a heart.

If you want to find more text like this out there, just point your browser to Fotolog.

As you can imagine, just tokenizing this kind of text for further analysis is quite a headache. During our experiments for the project WENDY (link in Spanish), we have designed a relatively simple tokenization algorithm in order to deal with this kind of text for age prediction. Although the method is designed for the Spanish language, it is quite language-independent and it may well be applied to other languages - not yet tested. The algorithm is the following one:

Separate the initial string into candidate tokens using white spaces.
A candidate token can be:
1. A proper sequence of alphabetic characters (a potential word), or proper sequence of punctuation symbols (a potential emoticon). In this case, the candidate token is considered already a token.
2. A mixed sequences of alphabetic characters and punctuation symbols. In this case, the character sequence is divided into sequences of alphabetic characters and sequences of punctuation symbols. For instance, "Hola:-)ketal" is further divided into "Hola", ":-)", and "ketal".

For instance, consider the next (real) text utterance:

"Felicidades LauraHey, felicidades! ^^felicidiadeees;DFelicidades!Un beso! FELIZIDADESS LAURIIIIIIIIIIIIII (LL)felicidadeeeeeees! :D jajaja mira mi tablonme meo jajajajajjajate quiero(:,"

The output of our algorithm is the list of tokens in the next table:

We have evaluate this algorithm directly and indirectly. Direct evaluation consists of comparing how many hits we get with an space-only tokenizer and with out tokenizer, in a Spanish and in a SMS-language dictionary. The more hits you get, the best recognized are words. We find about 9.5 more words in average in the Spanish dictionary with our tokenizer, and an average of 1.13 words more in the SMS-language dictionary, per text utterance (comment).

The indirect evaluation is performed by pipelining the algorithm in the full process of the WENDY age recognition system. The new tokenizer increases the accuracy of the age recognition system from 0.768 to 0.770, which may seem marginal except for the fact that it accounts for 206 new hits in our text collection of Tuenti comments. The new tokenizer provides relatively important increments in recall and precision for the most under-represented but most critical class, that is that of under 14 users.

This is the reference of the paper which details the tokenizer, the experiments, and the context of the WENDY project, in Spanish:

José María Gómez Hidalgo, Andrés Alfonso Caurcel Díaz, Yovan Iñiguez del Rio. Un método de análisis de lenguaje tipo SMS para el castellano. Linguamatica, Vol. 5, No. 1, pp. 31-39, July 2013.

If you are interested in the first steps of text analysis (tokenization, text normalization, POS Tagging), then these two recent news may be useful for you:

The results of the Tweet Normalization Workshop/Task at SEPLN 2013 have been just published, interesting data & dataset.
Leon Derczynski et al. have released a GATE-based POS-Tagger for Twitter with very good levels of accuracy.

And you may want to take a look at my previous post on text normalization.

As always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

Negobot is in the news!

2013-07-22T16:54:00.001+02:00

... And I must say, it is quite popular out there.

Negobot is a conversational agent posing as a 14 year old girl, intended to detecting paedophilic intentions and adapting to them. Negobot is based on Game Theory, and it is the result of a R&D project performed by the Deustotech Laboratory for Smartness, Semantics and Security (S3Lab) and Optenet. The members of the team are:

And myself. Its scientific approach is explained in the following paper:

Laorden, C., Galán-García. P., Santos, I., Sanz, B., Gómez Hidalgo, J.M., García Bringas, P., 2012. Negobot: A Conversational Agent Based on Game Theory for the Detection of Paedophile Behaviour . International Joint Conference CISIS12-ICEUTEA12-SOCOA 12 Special Sessions, Advances in Intelligent Systems and Computing, Vol. 189, Springer Berlin Heidelberg, pp. 261-270. (preprint)

My friend and colleague Carlos Laorden was interviewed by the SINC Agency about the project some days ago, and the agency released a news story that quickly jumped on a wide range of online and offline agencies, newspapers, radio stations, news aggregators, blogs, etc. Here is the original news story in Spanish:

Una 'Lolita' virtual a la caza de pederastas
SINC | 10 julio 2013 10:40

The news story featured a video with the interview to Carlos.

And in English, published by SINC at Alpha Galileo:

A virtual 'Lolita' on the hunt for paedophiles
10 de julio de 2013 Plataforma SINC

From there, to major English-language media:

	Controversial 'Lolita' chatbot catches online predators NBC News
	'Virtual Lolita' aims to trap chatroom paedophiles BBC News Technology
	Negobot, 'Virtual Lolita,' Uses Game Theory To Bust Child Predators In Internet Chat Rooms Huffington Post
	Virtual Lolita poses as schoolgirl aged 14 to trap online paedophiles The Independent
	How 'Lolita style' virtual robots posing as teenage girls are being used to uncover paedophiles on social network sites Daily Mail
	'Virtual Lolita' created to trap paedophiles in online chatrooms METRO

Major international blogs and news aggregators have also featured Negobot:

Engadget: Negobot: a virtual chat agent engineered to trap pedophiles
Ubergizmo: Negobot Chatbot To Trap Pedophiles
IO9: Sophisticated chatbot poses as teenage girl to lure pedophiles
GigaOm: Catching pedophiles with text mining and game theory
Gizmag: Chatbot hunts for pedophiles
BetaBeat: Virtual Teen Can Lure Sexual Predators With the Blink of an Emoticon
Slashdot: Spanish Chatbot Hunts For Pedophiles

As of today, Negobot has got:

181 comments in Slashdot.
42 diggs in Digg.
124 points and 49 comments in Reddit.

Negobot has obtained a world-wide coverage in the news:

	Argentine Republic Crearon un programa informático para atrapar pedófilos en los chats y redes sociales El Intransigente
	Bosnia and Herzegovina Sofisticirani robot "Negobot" služi da namami i otkrije pedofile Vijesti
	Commonwealth of Australia Artificial intelligence poses as 14-year-old-girl to detect paedophiles in social chatrooms News Limited Network Artificial intelligence poses as 14-year-old-girl to detect paedophiles in social chatrooms Herald Sun, Melbourne
	Czech Republic "Wirtualna Lolita", czyli czatbot, który wskaże pedofilów PEJ
	French Republic Negobot, l'adolescente virtuelle qui piège les pédophiles sur internet ! Marie Claire Espagne : une lolita virtuelle traque les pédophiles sur Internet Metro News L'adolescente virtuelle qui traquait les pédophiles en ligne Le Point
	Hellenic Republic Τεχνητή νοημοσύνη- «κυνηγός» παιδόφιλων στο Ίντερνετ Naftemporiki
	Italian Republic Negobot, il software "Lolita" che individua i pedofili dialogando Il Tempo Negobot, la lolita virtuale che stana i pedofili in rete La Republica Negobot, la Lolita virtuale che incastra i pedofili in Rete La Stampa
	Kingdom of Spain Negobot contra los pedófilos ABC Tecnología Negobot contra los pedófilos La Información Una 'Lolita' virtual a la caza de pederastas Publico Idean una lolita virtual para detectar pedófilos en la Red La Voz de Galicia Una 'Lolita' virtual para la caza de pederastas El Correo Gallego La trampa para los pederastas en la red El Espectador Nuevo sistema virtual a la caza de posibles pederastas El Economista
	Kingdom of Sweden "Virtuell lolita" ska få fast pedofiler på nätet Nyheter24
	Malasya Robot Virtual Gadis Remaja Digunakan untuk Menjebak Pedofil Pikiran Rakyat
	Netherlands Digitale pedolokker imiteert schoolmeisje PCM
	Oriental Republic of Uruguay Desarrollan "Lolita virtual" para dar caza a pederastas y corruptores de menores La Red 21
	Portuguese Republic A adolescente robótica caçadora de pedófilos Hype Science
	Republic of Austria Negobot findet Pädophile style.at Kurzmeldungen "Negobot": Chatprogramm forscht Pädophile aus Der Standard
	Republic of Chile Nuevo software permite detectar pedófilos en la red 24 Horas
	Republic of Croatia Napravljen robot koji pronalazi pedofile Radio Sarajevo
	Republic of India A virtual Lolita on the hunt for paedophiles online The Times of India
	Republic of Kazakhstan Negobot, 'Virtual Lolita,' Uses Game Theory To Bust Child Predators In Internet Chat Rooms Safekaznet
	Republic of Poland Negobot sieciową pułapką na pedofilów Autonom
	Republic of Serbia STOP PEDOFILIJI: Virtuelna Lolita kreće u lov na manijake! Telegraf.rs Virtuelna Lolita za lov na pedofile HOBOCTN
	Romania Robotul care pozeaza in pustoaica de 14 ani - da de gol pedofilii Ziare
	Russian Federation Поиском педофилов в сети займется бот, выдающий себя за 14-летнюю Корреспондент.net Вычисление педофилов в интернете поручат чат-боту LENTA
	Socialist Republic of Vietnam 'Virtual Lolita' aims to trap chatroom paedophiles Info VN
	Swiss Confederation Spagna: ecco Negobot, 14enne virtuale che scova i pedofili in rete Ticino News
	Ukraine В іспанських інтернет-чатах підлітків від педофілів захищає Negobot UBR

Carlos Laorden has been also interviewed for Spanish newspapers and in radio stations:

Interview in El Mundo (Spanish, PDF).
Interview in La Noche de La Cope (Spanish, MP3).
Interview in La Mañana de La Cope (Spanish, MP3).

And last but not least, Negobot has got some criticism in the form of a (quite funny) video.

You can keep on tracking with Google Search in Web Pages and in the news.

Finally, sorry for the SSF, and thanks for reading.

Performance Analysis of N-Gram Tokenizer in WEKA

2013-07-08T08:47:00.001+02:00

The goal of this post is to analyze the WEKA class NGramTokenizer in terms of performance, as it depends on the complexity of the regular expression used during the tokenization step. There is a potential trade-off between more simple regex (which lead to more tokens) and more complex regexes (which take more time to be evaluated). This post intends to provide experimental insights on this trade-off, in order to save your time when using this extremely useful class with the WEKA indexer StringToWordVector.

Motivation

The WEKA weka.core.tokenizers.NGramTokenizer class is responsible for tokenizing a text into pieces, which depending on the configuration of its size, they can be token unigrams, bigrams and so on. This class relies on the method String[] split(String regex) for splitting a text string into tokens, which are further combined into ngrams.

This method, in turn, depends on the complexity of the regular expression used to split the text. For instance, let us examine this simple example:

public class TextSplitTest { public static void main(String[] args) { String delimiters = "\\W"; String s = "This is a text &$% string"; System.out.println(s); String[] tokens = s.split(delimiters); System.out.println(tokens.length); for (int i = 0; i < tokens.length; ++i) System.out.println("#"+tokens[i]+"#"); } }

In this call to the split() method, we are using the regex "\\W", which matches any non-alphanumeric character as a delimiter. The output of this class execution is:

$> java TextSplitTest This is a text &$% string 9 #This# #is# #a# #text# ## ## ## ## #string#

This is due that every individual non-alphanumeric character is a match, and we have five delimiters between "text" and "string". In consequence, we find four empty (but not null) strings among these five matches. If we use the regex "\\W+" as the delimiters string, which matches sequences of one or more non-alphanumeric characters, we get the following output:

$> java TextSplitTest This is a text &$% string 5 #This# #is# #a# #text# #string#

Which is closer to what we expected at the beginning.

When tokenizing a text, it seems wise to avoid computing empty strings as potential tokens, because we have to invest some time to discard them -- and we can have thousands of instances!. On the other side, it is clear that a more complex regular expression leads to more computation time. So there is a trade-off between using a one-character delimiter versus using a more sophisticated regex to avoid empty strings. To which extent does this trade-off impacts on the StringToWordVector/NGramTokenizer classes?

Experiment Setup

I run these experiments on my laptop, with: CPU - Intel Core2 Duo, P8700 @ 2.53GHz; RAM: 2.90GB (1.59 GHz). For some of the tests, specially those involving a big number of ngrams, I need to make use of the -Xmx option in order to increase the heap space.

I am using the class IndexText.java available at my GITHub repository. I have commented all the output to retain only the computation time for the method index(), which creates the tokenizer and the filter objects and performs the filtering process. This process actually indexes the documents, that is, it transforms the text strings in each instance into a dictionary-based representation -- each instance is an sparse list of pairs (token_number,weight) where the weight is binary-numeric. I have also modified the class to set lowercasing to false, in order to accumulate as many tokens as possible.

I have perfomed experiments using the two next collections:

The SMS Spam Collection, which is a dataset of 5,568 short messages classified as spam/ham (not spam).
The classical Reuters-21578 text collection (ModApte split), which is a dataset of 21,578 relatively short news stories, classified according a number of economic classes (acquisitions, earning reports, products like rubber, tin or sugar, etc.). I have downloaded it from the NLTK data directory.

I am comparing using the strings "\\W" and "\\W+" as delimiters in the NGramTokenizer instance of the index() method, for unigrams, uni-to-bigrams, and uni-to-trigrams. In the case of the SMS Spam Collection, I have divided the dataset into pieces of 20%, 40%, 60%, 80% and 100% in order to evaluate the effect of the collection size.

Finally, I have run the program 10 times per experiment, in order to average and get more stable results. All numbers are expressed in milliseconds.

Results and Analysis

We will examine the results on the SMS Spam Collection. The results obtained for unigrams are the next ones:

It is a bar diagram which shows the time in milliseconds for each collection size (20%, 40%, etc.). The results for the bigrams are:

And the results for trigrams on the SMS Spam Collection are the next ones:

So the times for unigrams, uni-to-bigrams and uni-to-trigrams are exponetially higher (as it can be expected). While on unigrams, using the simple regex "\\W" is more efficient, the more sophisticated regex "\\W+" is more efficient for bigrams and trigrams. There is one anomaly point (at 60% on trigrams), but I believe it is an outlier. So it seems that the cost of using a more sophiticated regex does not pay for unigrams, in which the cost of matching this regex is higher than discarding empty strings. However it is the opposite in the case of uni-to-bigrams and uni-to-trigrams, where the empty strings seem to hurt the algorithm for building the bi- and trigrams.

The results on the Reuters-21578 collection are the next ones:

These results are fully aligned with the results obtained on the SMS Spam Collection, with the advantage of increasing the difference in the case of uni-to-trigrams, as the number of different tokens on the Reuters-21578 test collection is much bigger (as there are more texts, and they are longer).

But all in all, the biggest increment in performance we get are 4.59% in the SMS Spam Collection (uni-to-trigrams, 40% sub-collection) and 4.15% on the Reuters-21578 collection, which I consider marginal. So all in all, there is not a big difference between using these two regexes after all.

Conclusions

In the potential trade-off between using simple regular expressions to recognize text tokens, and using a more sophisticated regular expression in the WEKA indexer classes for avoiding spurius tokens, my simple experiment shows that both approaches are more or less equivalent in terms of performance.

However, when using only unigrams, it is better to use simple regular expressions because the time to match tokens in a more sophisticated regex does not pay.

On the other side, the algorithm for building bi- and trigrams seems to be sensitive to the empty strings generated by a simple regex, and you can get around a 4% increase of performance when using more sophisticated regular expressions and avoiding those empty strings.

As always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

Chat or What: Approaching Text Normalization in Chats and Social Networks

2013-07-04T21:45:00.001+02:00

It is not strange that, with the overload of user-generated content, there is an increasing interest on processing chat/SMS-like language. Social Networks, virtual worlds, MMORPGs and chat rooms are plagued with emoticons, abbreviations, typos and channel codes that make the task of processing user-generated text a nightmare. In this post I list a number of resources and approaches that may be useful for researchers and practitioners of Natural Language Processing regarding this problem, which following the course by Richard Sproat and Steven Bedrick, I call Text Normalization .

Text Normalization can be seen as translation from informal language to standard English-Spanish-whatever. The most simple approach you can follow is a word by word translation using a dictionary. This approach is followed by online lingo translators like Lingo2Word and Transl8it!. In fact, you can reproduce this work using the Lingo2Word dictionary (click on the header links). I have followed this approach as a baseline in several projects and works, like WENDY - WEb-access coNfidence for chilDren and Young (web page in Spanish, the paper: " Combining Predation Heuristics and Chat-Like Features in Sexual Predator Identification " in English).

Another knowledge-based alternative is manually coding normalization rules. An example is the tool Deflog, which is a program that decodes the usual expressions used in the picture-oriented social network Fotolog. In this network, the majority of (Spanish-language) users make use of specific language codes like repeating vowels ("I liiiiiiiiiiiiike iiiiiiit"), alternating upper and lowercase ("YoU WiLL LiKe It"), and so on. The program encodes a number of functions that "correct" word tokens, each function for a particular code. While the functions mostly apply to Spanish and Fotolog, a linguist may derive their own rules for another domain (e.g. Twitter).

These are obviously baselines. There much more sophisticated methods, mostly based on statistical methods; I provide a list here that complements the reading list in the course by Sproat and Bedrick:

Bo Han, Paul Cook and Timothy Baldwin, Lexical Normalisation of Short Text Messages, In ACM Transactions on Intelligent Systems and Technology (TIST) 4(1), pp. 5:1-5:27, 2013.
Tim Schlippe, Chenfei Zhu, Daniel Lemcke, and Tanja Schultz. Statistical Machine Translation based Text Normalization with Crowdsourcing. In Proceedings of The 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2013), Vancouver, Canada, 26-31 May 2013.
Bo Han, Paul Cook and Timothy Baldwin, Automatically Constructing a Normalisation Dictionary for Microblogs, In EMNLP-CoNLL 2012, 421-432, Jeju, Republic of Korea.
Bo Han and Timothy Baldwin, Lexical normalisation of short text messages: Makn sens a #twitter, In ACL 2011, 368-378, Portland, OR, USA.
Tim Schlippe, Chenfei Zhu, Jan Gebhardt, Tanja Schultz. Text Normalization based on Statistical Machine Translation and Internet User Support. In Proceedings of The 11th Annual Conference of the International Speech Communication Association (Interspeech 2010), Makuhari, Japan, 26-30 September 2010.
Carlos Henriquez, Adolfo Hernández H., A ngram-based statistical machine translation approach for text normalization on chat-speak style communications. Proceedings of the CAW2 (Content Analysis in Web 2.0) Workshop, April 2009.

You can get some more papers by tracking the referenced literature or by searching these papers for citations.

As a final note, remember that text normalization is not always a good idea. I mean, for some problems it would be nice to keep the original abbreviations, emoticons and so as they can be representative of the style, genre, an author or a particular age.

I hope these works will suggest you other methods for your problem at hand. As always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

Sample Code for Text Indexing with WEKA

2013-06-23T01:18:00.001+02:00

Following the example in which I demonstrated how to develop your own classifier in Java based on WEKA, I propose an additional example on how to index a collection of texts in you Java code. This post is inspired and supported by the WEKA "Use WEKA in your Java code" wiki page. To index a text collection is to generate a mapping between docs and words (or other indexing units) as represented in the next graph:

The fundamental class for text indexing in WEKA is weka.filters.unsupervised.attribute.StringToWordVector. This class provides an impressive range of indexing options that include using custom tokenizers, stemmers and stoplists; binary, Term Frequency and TF.IDF weights, etc. For some applications, its default options may be enough -- however I recommend to get familiar with all its options, in order to get full advantage of it.

With the purpose of showing how to use StringToWordVector in your code, I have created a simple class named IndexTest.java, stored in my GitHub repository. Apart from the relatively simple methods for loading and storing Attribute-Relation File Format (ARFF) files, the core of the class is the method void index(), which creates and employs a StringToWordVector object. The first piece of the code is the following one:

// Set the tokenizer NGramTokenizer tokenizer = new NGramTokenizer(); tokenizer.setNGramMinSize(1); tokenizer.setNGramMaxSize(1); tokenizer.setDelimiters("\\W");

This snippet creates and configures a tokenizer, that is the object responsible for breaking the original text into individual strings named tokens, representing the indexing units (typically words). In this case I am using a weka.core.tokenizers.NGramTokenizer, which I find more useful than the usual weka.core.tokenizers.WordTokenizer, as I describe in the post about sentiment analysis with WEKA. This tokenizer is able to recognize n-grams, that is, sequences of tokens. Here I use the methods void setNGramMaxSize(int value) and void setNGramMinSize(int value) to define the size of the n-grams as unigrams.

Another interesting aspect of the tokenizer part is that we setup the regular expression "\\W" as delimiters or separators. This regex defines that any character not being alphanumeric is considered a delimiter. As a result, only alphanumeric character strings will be considered tokens. For a detailed reference on regular expression in Java, check the lesson on the topic in the Java Tutorial.

The second code snippet is the following one:

// Set the filter StringToWordVector filter = new StringToWordVector(); filter.setInputFormat(inputInstances); filter.setTokenizer(tokenizer); filter.setWordsToKeep(1000000); filter.setDoNotOperateOnPerClassBasis(true); filter.setLowerCaseTokens(true);

This second snippet creates and configures the StringToWordVector object, which is a subclass of the weka.filters.Filter class. Any filter has to make reference to a dataset, which is the inputInstances dataset in this case, as done with the filter.setInputFormat(inputInstances) call.

We setup the tokenizer and some other options as an example. Both DoNotOperateOnPerClassBasis and WordsToKeep should be standard in most of text classifiers. The first one tells the filter to extract the tokens from all classes as a whole, instead of doing it class per class (default option). I simply fail to understand why one should want to get different indexing tokens per class in a text classification problem. The second option sets the number of words to keep, and I recommend to define a big integer here in order to cover all possible tokens.

The third and last code snippet shows the invocation of the filter on the inputInstances reference:

// Filter the input instances into the output ones outputInstances = Filter.useFilter(inputInstances,filter);

This is the standard method for applying a filter, according to the "Use WEKA in your Java code". The output of calling this class on a simple dataset as smsspam.small.arff is the next one:

$> javac IndexTest.java $>java IndexTest Usage: java IndexTest <fileInput> <fileOutput> $>java IndexTest smsspam.small.arff result.arff ===== Loaded dataset: smsspam.small.arff ===== Started indexing at: 1371939800703 ===== Filtering dataset done ===== Finished indexing at: 1371939800812 Total indexing time: 109 ===== Saved dataset: result.arff ===== $>more result.arff @relation 'sms_test-weka.filters.unsupervised.attribute.StringToWordVector-R2-W1000000-prune-rate-1.0-N0-L-stemmerweka.core.stemmers.NullStemmer-M1-O- tokenizerweka.core.tokenizers.NGramTokenizer -delimiters "\\W" -max 1 -min 1'@attribute spamclass {spam,ham} @attribute 000 numeric @attribute 03 numeric @attribute 07046744435 numeric @attribute 07732584351 numeric ../..

As a note, the name of the relation in the generated ARFF file (tag @relation) encodes the properties of the applied filter, including some default options I have not configured in it.

So that is all. More examples on this topics coming in the next weeks. And as always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

Comparing baselines of keyword and learning based sentiment analysis

2013-06-19T17:13:00.001+02:00

In my previous post, I have presented a simple example of using WEKA for Sentiment Analysis (or Opinion Mining). As most of my blog posts on text mining with WEKA, I approach interesting, hot or easy tasks as a way to present this package capabilities for text mining -- in consequence, these posts are tutorial in essence.

In that particular post, I left several open tasks for anybody who may be interested on completing them, and I picked two for myself. One of the tasks left for the reader was coding a class and training a model to actually classify texts according to sentiment -- and as I have been requested the code, I did it by myself and it is available at my GitHub repository.

Another task I left pending, and picked for myself, was applying a keyword-based approach using SentiWordNet to the same (SFU Review Corpus) collection and comparing its accuracy to the learning (WEKA) approach. So this is the topic of this post.

Goal

The goal of this post is to build a simple keyword-based sentiment analysis program based on SentiWordNet and evaluate it on the SFU Review Corpus, in order to compare its accuracy with the one obtained via (WEKA) learning as described in my previous post "Baseline Sentiment Analysis with WEKA".

About SentiWordNet

SentiWordNet is a collection of concepts (synonym sets, synsets) from WordNet that have been evaluated from the point of view of their polarity (if they convey a positive or a negative feeling). Some interesting features include:

As it is based on WordNet, only English and the four most significant parts of speech (nouns, adjectives, adverbs and verbs) are covered. Multi-word expressions are included, encoded with underscore (e.g. "too_bad", "at_large").
Each concept has attached polarity scores. For instance:

# POS ID PosScore NegScore SynsetTerms Gloss a 01125429 0 0.625 bad#1 having undesirable or negative qualities; "a bad report card"; "his sloppy appearance made a bad impression"; "a bad little boy"; "clothes in bad shape"; "a bad cut"; "bad luck"; "the news was very bad"; "the reviews were bad"; "the pay is bad"; "it was a bad light for reading"; "the movie was a bad choice" a 01052038 0.222 0.778 too_bad#1 regrettable#1 deserving regret; "regrettable remarks"; "it's regrettable that she didn't go to college"; "it's too bad he had no feeling himself for church"

So SentiWordNet is in a tab-separated format, being the first column the Part Of Speech (POS), the second and third ones the polarity scores (between 0 and 1), the next column the synset (synonym set, list of synonyms tagged with their sense -- word#sense_number), and the last one the WordNet gloss (roughly speaking, the definition).

Another interesting feature is that SentiWordNet researchers have provided us with a very basic Java class named SWN3.java to query the database for a pair word/POS. This class loads the database and provides a function that outputs "positive", "strong_positive", "negative", "strong_negative" or "neutral" for a given pair according to the manual scores assigned to the synsets. It is very basic because it does not perform Word Sense Disambiguation nor even POS Tagging, and the labels are heuristically defined (some other definitions are possible). However, we can take advantage of it in order to implement a very basic sentiment classifier, as described below.

In order to make use of the SWN3.java class, you have to:

Download a copy of SentiWordNet.
Rename the file to SentiWordNet_3.0.0.txt and put it in a data folder -- relative to the place you located your SWN3.java file. Alternatively, you can modify this class to use a different path or data file name.
Delete all lines starting with the symbol "#" from the SentiWordNet_3.0.0.txt file. HINT: The header and the last line of the file.

And that's it.

The Algorithm and Its Parameters/Heuristics

I have sketched a very simple algorithm for sentiment classification using the provided by the SWN3.java querying class. Given the output of its function public String extract(String word, String pos), that is "positive" etc., the algorithm consists of:

Tokenizing the target text into alphanumeric strings (eventually, words).
Start a polarity score with 0.
For each token, search for it using the extract function and add +1 (positive), +2 (strong_positive), -1 (negative), or -2 (strong_negative).
Return "yes" if the final polarity score is over 0, and "no" if its below 0.

Let me remind that the class tags used in the SFU Review Corpus are "yes" (positive) and "no" (negative).

That's all. No rocket science here.

However, there are two basic parameters:

What to do if you get a neutral score (0)? So we can be positive (Y, return "yes" when the score is greater or equals to 0), or negative (N, return "no" when the score is less or equals to 0).
Which is the Part of Speech we can use in the SentiWordNet search? I have crafted to options: (1) Looking (and summing) as all available POS (AllPOS), and (2) looking only as adjectives (ADJ).

So I have coded four methods, named classifyAllPOSY(), classifyAllPOSN(), classifyADJY() and classifyADJN() for the four possible combinations. These functions are available in the SentiWordNetDemo.java class at the GitHub repository. And these are the approaches I test below.

The rationale for the first parameter is that we have a 50% balance between the 400 reviews, so it is not clear which we should prefer. In an imbalanced problem, we could choose the most populated class. An alternative is analyzing SentiWordNet to check if it is positively or negatively biased (that is, with more positive or negative words), or even refine this with an additional corpus (counting words and weighting according the frequencies of positive/negative words).

The rationale for the second parameter is that adjectives tend to be less ambiguous (discarding sarcasm or irony), but it is easy to test with any other POS. Using all of them is incorrect (as every word has only one POS in context) but it is practical, and it will give more extreme scores (assuming that a negative word is so with each of its possible POS).

Results and Analysis

So we are testing four approaches, and I will be using the same metrics as I used in the previous blog on sentiment analysis with WEKA, that are averaged F1 and accuracy (along with the Confusion Matrix itself). The test is performed over the 400 text documents in the dataset, as we do not need training for this algorithm. The following table shows the results I have obtained:

I have added to the table the two best performing configurations for a learning based classifier as presented in the previous blog post. However, the comparison is not 100% fair, as the learning approach has been evaluated by 10 fold Cross Validation -- which involves using the full dataset as test set, but in 10% size batches.

All in all, it seems that the keyword-based (using SentiWordNet) approach is competitive (it beats many learning-based classifiers in my previous experiment), getting its best results using only adjectives and outputting "no" in case of neutral scores. The effectiveness on the "yes" class is better than the SVMs with 1-to-3-grams, in terms of recall. I believe that, with some adjustments, the keyword-based approach can be very competitive in this case, and it has the additional advantage that it does not rely on the quality or amount of training data.

Comparing the parameters, the default "no" is consistently better than the default "yes". Using all POS is worse than using only adjectives, because even in the case of default "yes" (which is beaten by both ALL cases in terms of accuracy), we get more balanced decisions -- the ALL setup leads to extremely positive scores, and a clear bias to the "yes" class.

Concluding Discussion

As discussed above, I consider this test as a baseline because of the wide number of simple heuristics employed in the algorithm. Actually, there are a number of possible improvements to be done, although some of them are not trivial. I tag them as [easy|hard] according to my experience in text mining. For instance:

Recognizing multiword expressions [easy]. This can be done by making simple searches for token n-grams in the SentiWordNet database, just modifying the SentiWordNetDemo class.
Using a validation dataset to optimize the score threshold [easy]. We have assumed that an overall score of 0 is neutral, and tested to classify it as positive or negative (being the second option better). We have general evidence that the database is positively oriented, so we can set a threshold over 0 (e.g. 10, 20...) for classifying a text as positive, in order to correct this effect. The most simple way of doing this is selecting a 10% of the corpus as a validation set, sorting the decisions according to the score, and defining a threshold that optimizes the accuracy (or F1).
Test different scoring models like e.g. modifying the SWN3.java program to output original scores instead of tags [easy] and use those scores for the final polarity score calculation. Alternatively, we can play with different definitions of "strong_positive" etc. in terms of the weights [easy], or to use different scores for assigning polarity labels database [easy]. This can be more difficult to test, but we can use a validation set as in the previous point.
Performing POS Tagging by using the majority tag [easy], coding a POS Tagger based on learning [hard], or using an existing off-the-shelf POS Tagger (like e.g. Freeling or CoreNLP) [easy]. After using a POS Tagger, the tags must be normalized or processed in order to retain the basic POS, as most of POS Taggers make use of sophisticated tag sets that represent morphology and so on. Obviously, the algorithm should be changed to perform only the search for the appropriate POS tag.
Performing Word Sense Disambiguation by using the first sense [easy], coding a WSD system based on learning using a dataset like Semcor [hard], coding a WSD system based on dictionaries -- e.g. using the WordNet glosses in the database itself [easy], or using a an existing off-the-shelf WSD system like e.g. SenseLearner [easy]. You may need to perform data transformations in the case of using different database versions for WSD and sentiment analysis, and in terms of format.

In a more exploratory work, I suggest to:

Test the algorithm on other datasets like the classical Movie Review Datasets by Bo Pang and Lilian Lee, or with other semantic lexicons (opinionated word databases) like the Opinion Lexicon by Bing Liu et al. or the Subjectivity Lexicon by Janyce Wiebe et al..
Perform an exploratory analysis of the distribution of polarities at SentiWordNet and its implications on the basic algorithm.

I am not sure if I will be making any other tests with the keyword-based approach to sentiment analysis, as I want to keep my focus on WEKA features for text mining.

Anyway, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

Baseline Sentiment Analysis with WEKA

2013-06-11T13:21:00.001+02:00

Sentiment Analysis (and/or Opinion Mining) is one of the hottest topics in Natural Language Processing nowadays. The task, defined in a simplistic way, consists of determining the polarity of a text utterance according to the opinion or sentiment of the speaker or writer, as positive or negative. This task has multiple applications, including e.g. Customer Relationship Management or predicting political elections.

While initial results dating back to early 2000 seem very promising, it is not such a simple task. We face from the informal Twitter language to the fact that opinions can be faceted (for instance, I may like the software but not the hardware of a device), or opinion spam and fake reviews, along with traditional and complex problems in Natural Language Processing as irony, sarcasm or negation. For a good overview of the task, please check the survey paper on opinion mining and sentiment analysis by Bo Pang and Lillian Lee. A more practical overview is the Sentiment Tutorial with LingPîpe by Alias-i.

In general, there are two main approaches to this task:

Counting and/or weighting sentiment-related words that have been evaluated and tagged by experts, conforming a lexical collection like SentiWordNet.
Learning a text classifier on a previously labelled text collection, like e.g. the SFU Review Corpus.

The SentiWordNet home page offers a simple Java program that follows the first approach. I will follow the second one in order to show how to use an essential WEKA text mining class (weka.core.converters.TextDirectoryLoader), and to provide another example of the weka.filters.unsupervised.attribute.StringToWordVector class.

I will follow the process outlined in the previous post about Language Identification using WEKA.

Data Collection and Preprocessing

For this demonstration, I will make use of a relatively small but interesting dataset named the SFU Review Corpus. This corpus consists of 400 reviews in English extracted from the Epinions website in 2004 divided in 25 positive and 25 negative reviews for each of 8 product categories (Books, Cars, Computers, etc.). It also contains 400 reviews in Spanish extracted from Ciao.es divided in the same categories (except for the Cookware category in English, which --more or less-- maps to Lavadoras --Washing Machines-- in Spanish).

The original format of the collections is one directory per category of products, including 25 positive reviews including the word "yes" in the file name and 25 negative reviews including the word "no" in the file name. Unfortunately, this format does not allow to work directly with it in WEKA, but a couple of handy scripts transform it into a new format: two directories, one including the positive reviews (directory yes), and the other one including the negative reviews (directory no). I have kept the category in the name of the files (with patterns like bookyes1.txt) in order to allow others making a more detailed analysis per category.

Comparing the structure of the original and the new format of the text collections:

In order to construct an ARFF file from this structure, we can use the weka.core.converters.TextDirectoryLoader class, which is an evolution of a previously existing helper class named TextDirectoryToArff.java and available at WEKA Documentation at wikispaces. Using this class is as simple as issuing the next command:

$> java weka.core.converters.TextDirectoryLoader -dir SFU_Review_Corpus_WEKA > SFU_Review_Corpus.arff

You have to call this command at the parent directory of SFU_Review_Corpus_WEKA, and the parameter -dir sets up the input directory. This class expects to have a single directory containing a directory per class value (yes and no in our case), which in turn should contain a number of files pertaining to the corresponding classes. As the output of this command goes to the standard output, I have to redirect it to a file.

I have left the output of the execution of this command for both the English (SFU_Review_Corpus.arff) and the Spanish (SFU_Spanish_Review.arff) collections at the OpinionMining folder of my GitHub repository.

Data Analysis

Previous models in my blog posts have been based on a relatively simple representation of texts as sequences of words. However, a trivial analysis of the problem easily drives us to think that multi-word expressions (e.g. "very bad" vs. "bad", or "a must" vs. "I must") can lead to better predictors of user sentiment or opinion about an item. Because of this, we will compare word n-grams vs. single words (or unigrams). As an basic set up, I propose to compare word unigrams, 3-grams, and 1-to-3-grams. The latter representation will include uni- to 3-grams with the hope of getting the best of all of them.

Keeping in ming that capitalization may matter in this problem ("BAD" is worse than "bad"), and that we can use standard punctuation (for each of the languages) as texts are long comments (several paragraphs each), I derive the following calls to the weka.filters.unsupervised.attribute.StringToWordVector class:

$> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"\\\\W\" -min 1 -max 1" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.uni.arff $> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"\\\\W\" -min 3 -max 3" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.tri.arff $> java weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer "weka.core.tokenizers.NGramTokenizer -delimiters \"\\\\W\" -min 1 -max 3" -W 10000000 -i SFU_Review_Corpus.arff -o SFU_Review_Corpus.vector.unitri.arff

We follow the notation vector.uni to denote that the dataset is vectorized and that we are using word unigrams, and so on. The calls for the Spanish collection are similar to these ones.

The most important thing in these calls is that we are no longer using the weka.core.tokenizers.WordTokenizer class. Instead, we are using weka.core.tokenizers.NGramTokenizer, which uses the options -min and -max to set the minimum and maximum size of the n-grams. But the most important thing is that there is a major difference between both classes, regarding the usage of delimiters:

The weka.core.tokenizers.WordTokenizer class uses the deprecated Java class java.util.StringTokenizer , even in the latest versions of the WEKA package (as of the day of this writing). In StringTokenizer, the delimiters are the characters used as "spaces" to tokenize the input string: white space, punctuation marks, etc. So you have to explicitly define which will be the "spaces" in your text.
The weka.core.tokenizers.NGramTokenizer class uses the recommended Java String method String[] split(String regex) , in which the argument (and thus the delimiters string) is a Regular Expression (regex) in Java. The text is splitted into tokens separated by substrings that match the regex, so you can use all the power of regexes including e.g. special codes for characters. In this case I am using the code \W which denotes any non-word character, in order to get only alpha-numeric character sequences.

After splitting the text into word n-grams (or more properly, after representing the texts as term-weight vectors in our Vector Space Model), we may want to examine which n-grams are most predictive. As in the Language Identification post, we make use of the weka.filters.supervised.attribute.AttributeSelection class:

$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.uni.arff -o SFU_Review_Corpus.vector.uni.ig0.arff $> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.tri.arff -o SFU_Review_Corpus.vector.tri.ig0.arff $> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -i SFU_Review_Corpus.vector.unitri.arff -o SFU_Review_Corpus.vector.unitri.ig0.arff

After the selection of the most predictive n-grams, we get the following statistics in the test collections:

The percentages in rows 3-6-9 measure the agressivity of feature selection. Overall, both collections have comparable statistics (in the same order of magnitude). Original unigrams are quite similar, but bigrams and trigrams are less in Spanish (despite the fact that there are more isolated words -- unigrams). Selecting n-grams with Information Gain is a bit more aggressive in Spanish for unigrams and possible bigrams, but less in trigrams.

Adding bigrams and trigrams to the representation substantially increases the number of predictive features (from 4 to 5 times). However, only trigrams result in a little increment of features, so bigrams will play a role here. The number of features is quite handy, and allows us to make quick experiments.

According to my previous post on setting up experiments with WEKA text classifiers and how to chain filters and classifiers, you must note that these are not the final features if we configure a cross-validation experiment -- we have to chain the filters (StringToWordVector and AttributeSelection) and the classifier in order to perform a valid experiment, as the features for each folder should be different.

Experiments and Results

In order to simplify the example, and expecting to get good results, we will use the same algorithms we used in the Language Identification problem. These are: Naive Bayes (NB, weka.classifiers.bayes.NaiveBayes), PART (weka.classifiers.rules.PART), J48 (weka.classifiers.trees.J48), k-Nearest Neighbors (weka.classifiers.lazy.IBk) with k = 1,3,5, and Support Vector Machines (weka.classifiers.functions.SMO); all of them with the default options, except for kNN which uses 1, 3 and 5 neighbors. I am testing the three proposed representations (based on unigrams, trigrams and 1-3grams) by 10-fold cross-validation. An example experiment command line is the following one:

$> java weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -O -tokenizer \\\"weka.core.tokenizers.NGramTokenizer -delimiters \\\\\\\"\\\\\\\W\\\\\\\" -min 1 -max 1\\\" -W 10000000\" -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.bayes.NaiveBayes -v -i -t SFU_Review_Corpus.arff > tests/uniNB.txt

You can change the size of n-grams with the -min and -max parameters. Also, you can change the learning algorithm with the most external -W option. I am storing the results in a tests folder, in files with the convention <rep><alg>.txt. The results of this test for the English language collection are the following ones:

Considering the class yes (positive sentiment) as the positive class, in each column we show the True Positives (hits on the yes class), False Positives (members of the no class mistakenly classified as yes), False Negatives (members of the yes class mistakenly classified as no) and True Negatives (hits on the no class); along with the macro-averaged F1 (standard average F1 over both classes) and the general accuracy.

Additionally, the results for the Spanish language collection are the following ones:

So these are the results. Let us start the analysis...

Results Analysis

We can perform an analysis regarding different aspects:

Which is the overall performance?
Which is the performance when comparing different languages?
Which are the best learning algorithms?
Which effect do have different text representations in the classifier performance?

All in all, and taking into account that class balance is 50% (thus a trivial acceptor or a trivial rejector, or a random classifier accuracy would be 50%), most of the classifiers beat this baseline but not by a wide margin, and even the best one among all algorithms, languages and representations (SVMs on English 1-to-3-grams) reaches only a modest 71% -- far from a satisfying 90% or over. Let me remind we are facing a relatively simple problem -- long, few texts, and a binary classification. Most approaches in the literature get much better results in similar setups.

Results are better for English than for Spanish, comparing one on one. I will check the representations used in Spanish, for instance listing the first 20 n-grams for each representation, in order to explain it:

Some of the n-grams (highlighted in italics) are just incorrect, because of the incorrect recognition of accents due to the inappropriate pattern I have used in the tokenization step. The tokenizer makes use of the string "\W" in order to recognize alphanumeric string -- which in Java do not include vowels with accents ("á", "é", "í", "ó", "ú") and other language-specific symbols (e.g. "ñ"). Most of the n-grams are just not opinionated words or n-grams; instead, they are either intensifiers (like e.g. "muy" -- "very") or just contingent (dependent on the training collection, e.g. "en el taller" -- "in the garage"; "tarjeta de memoria" -- "storage card"). Those clearly opinionated words are highlighted in boldface. Very few. So for this issue, we can conclude that the training collection is too small.

If we examine the performance of different classifiers, we can cluster them in three groups: top performers (SVMs, NB), medium performers (PART, J48) and losers for this problem (kNN). These groups are intuitive:

Both SVMs and NB have often demonstrated their high performance in sparse datasets, and in text classification problems in particular. They both build a linear classifier with weights (or probabilities) for each of the features. Linear classifiers perform well here given that the dataset is built on representations that clearly promote over-fitting the dataset, as we have seen that many of the most predictive n-grams are collection-dependent.
Both PART and J48 (C4.5) are based on reducing error by progressively partitioning the dataset according to tests on the most predictive features. But the predictive features we have for such a small collection are not very good, indeed.
All versions of kNN perform very bad, most likely because the dataset is sparse and relatively small.

However, we have to keep in mind that we have used the algorithms with their default configurations. For instance, kNN allows to use the cosine similarity instead of the Euclidean distance -- being the cosine similarity much better for text classification problems, as demonstrated many times during 50 years of research in Information Retrieval.

And regarding dataset representations, the behavior is not uniform -- we do not systematically get better results with one representation in comparison with the others. In general, 1-to-3-grams perform better than the other representations in English, while unigrams are best in Spanish, and trigrams is most often the worst representation for both languages. If we focus on top performing classifiers (NB and SVMs), this latter comment is always true. In consequence, trigrams have --to some extent-- demonstrated their power in English (as a complement to uni- and bigrams), but not in Spanish (but knowing that the representation is incorrect because of character encoding).

Concluding Remarks

So all in all, we have a baseline learning-based method for Sentiment Analysis in English (and probably in Spanish, after correcting the representation), which is -- not surprisingly -- based on 1-to-3-grams and Support Vector Machines. And it is a baseline because its performance is relatively poor (with an accuracy of 71%), and we have not taken full advantage of the configuration, text representation and other parameters yet.

After this long (again!) post, I propose the next steps -- some of them left for the reader as an exercise:

Build a Java class that classifies text files according their sentiment, for English at least, taking my previous post on Language Identification as an example -- left for the reader.
Test other algorithms, and in particular: play with SVM configuration, and add Boosting (using weka.classifiers.meta.AdaBoostM1) to Naive Bayes -- left for the realer.
Check differences of accuracy in terms of product type -- cars, movies, etc. -- left for the reader.
Improve the Spanish language representation using the appropriate regex in the tokenizer to cover Spanish letters and accents -- I will take this one myself.
Check the accuracy of the basic keyword-based algorithm available in the SentiWordNet page -- I will take this one as well.

So that is all for the moment. You can expect one or more posts from me on this hot topic. Finally, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

Compilation of Resources for Text-based Age Detection

2013-05-23T18:07:00.001+02:00

Text-based age detection consists of estimate the age of a user according to the kind of texts he/she writes. This task is atracting some attention in the latest years, as for instance it promises to add one of the most interesting demographic features required in ad targetting. There is even an online application, TweetGenie, which guesses the age of a Twitter user -- it works for Dutch and English.

Text-based age detection is a text classification task which has close relation with others like genre detection or authorship attribution, as it should be based on stylistic features (e.g. usage of capitalization, average word length, frequencies of prepositions, or even the usage of emoticons) instead of on content bearing words (mostly nouns and verbs) like e.g. in topical text categorization. However, this does not mean that a pure word-based learning would not be effective.

A particular feature of this task is that it can be approached as classification if ages are divided in ranges, or as regression if we try to approach the exact age of the user.

There is a currently ongoing scientific competition at this topic, namely the Author Profiling task at the 9th evaluation lab on uncovering plagiarism, authorship, and social software misuse (PAN 2013). With this competition adding up new text collections, we have the following resources for trying and testing our approaches to text-based age detection:

The PAN 2013 Training Corpus for Author Profiling Task, consisting of a big number of posts and chats from three age ranges in Spanish and English.
The Blog Authorship Corpus, referenced in PAN, consisting of a big number of blog posts from three age ranges in English.
The NPS Chat Corpus, consisting on a relatively small number of chats from five age ranges in English (download from the NLTK corpora page or pay to the LDC).

For your comfort, I summarize some statistics about the collections:

And some notes on the information available in each collection:

The following papers can be of interest in order to avoid repeating others work.

J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging , Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs.
S. Argamon, M. Koppel, J. Pennebaker and J. Schler (2009), Automatically profiling the author of an anonymous text , Communications of the ACM 52 (2): 119-123.
M.Koppel, S. Argamon and A. Shimoni (2003), Automatically categorizing written texts by author gender , Literary and Linguistic Computing 17(4), November 2002, pp. 401-412.
Jenny K. Tam (2009). Detecting Age in Online Chat , Master Thesis, Naval Postgraduate School.
Jane Lin (2007). Automatic Author Profiling of Online Chat Logs , Master Thesis, Naval Postgraduate School.

Please feel free to send me a message or comment below if you find any other resource that I should add to this post. Thanks for reading.

Presentación: "Menores y móviles: Usos, riesgos y controles parentales"

2013-05-22T18:22:00.001+02:00

El día 19 de abril dí una charla en la Universidad Europea de Madrid, titulada "Menores y móviles: Usos, riesgos y controles parentales". Esta charla se corresponde con un trabajo de investigación que he realizado dentro del proyecto titulado "Protección de usuarios menores de edad de telefonía móvil inteligente", dirigido por Joaquin Pérez y financiado por la Universidad Europea de Madrid (P2012 UEM14).

El resumen de la charla está disponible en la página de la red MAVIR (MA2VICMR: Mejorando el Acceso, el Análisis y la Visibilidad de la Información y los Contenidos Multilingüe y Multimedia en Red para la Comunidad de Madrid), y la presentación utilizada durante la charla es la siguiente:

Si el tema te interesa, no dudes en hacer culaquier pregunta o sugerencia en los comentarios de este post.

Language Identification as Text Classification with WEKA

2013-05-20T21:28:00.001+02:00

Language Identification, consisting on guessing the natural language in which a text is written (or an utterance is spoken), is not one of the hardest problems in Natural Language Processing, and in consequence, I believe it is a good starting point for learning about the text analysis capabilities available in WEKA.

This is in fact one problem taken by others like in this tutorial on using LingPipe for Language Identification, or by Alejandro Nolla at his post on Detecting Text Language With Python and NLTK. Moreover you can find a wide number of language identification programs, APIs and demos in the Wikipedia article on Language Identification. We may even consider this function as a natural language commodity, as you can see how Google Translate does it on default in the next figure:

The most typical (and rather simple) approach to Language Identification is storing a list of the most frequent character 3-grams in each language and checking the target overlap with each of the lists. Alternatively, you can use stop words lists. Of course, the accuracy depends on how you compute the overlap, but even simple distances can make it rather effective.

However, I will not follow this approach here. Instead, I will show how to build an standard text classifier using WEKA in order to show the options (and how to apply) the StringToWordVector filter, which is the main tool for text analysis in WEKA.

The steps we have to follow are the next ones:

To collect data from different languages in order to build a basic dataset.
To prepare the data for learning, which involves transforming it by using the StringToWordVector filter.
To analyze the resulting dataset, and hopefully, to improve it by using attribute selection.
To test over an independent test collection, which will give us a robust estimation of the accuracy of the approaches on real examples.
To learn the most accurate model as obtained from the previous step, and to use it for our classification program.

So this will be a rather long post. Be prepared for it.

Collecting the data and Creating the Datasets

Following the LingPipe Language ID Tutorial, I collect the data from the Leipzig Corpora Home Page. In particular, I will address guessing among English (EN), French (FR) and Spanish (SP), so I have gone to the download page, completed the CAPTCHA to get the list of available corpora, and downloaded:

The 2005 English 10k corpus of news in text format.
The 2009 French 10k corpus of news in text format.
The 2001-2002 Spanish 10k corpus of news in text format -- which is no longer there as far as I can see.

For your comfort, I have put these corpora in my LangID GITHub demo page. The files have the following format:

1 I didn't know it was police housing," officers quoted Tsuchida as saying. 2 You would be a great client for Southern Indiana Homeownership's credit counseling but you are saying to yourself "Oh, we can pay that off." 3 He believes the 21st century will be the "century of biology" just as the 20th century was the century of IT.

So I have loaded them into an OpenOffice spreadsheet, and replaced the number columns by the corresponding tags for the different languages: EN, FR, and SP. Then I have escaped the " and ' characters, because they are string delimiters in WEKA Attribute-Relation File Format (ARFF). In order to build the datasets, I have split the data keeping the first 9K sentences of each language for training, and the remaining 1K for testing. As some learning algorithms may be sensitive to the instance order, I have mixed the instances in batches of 1K texts, so the first 1K sentences are in English, the next 1K sentences are in French, and so on. The training data has the following header:

@relation langid_train @attribute language_class {EN,FR,SP} @attribute text String @data EN,'I didn\'t know it was police housing,\" officers quoted Tsuchida as saying.' EN,'You would be a great client for Southern Indiana Homeownership\'s credit counseling but you are saying to yourself \"Oh, we can pay that off.\"' EN,'He believes the 21st century will be the \"century of biology\" just as the 20th century was the century of IT.' ../..

The ARFF files for training and testing are available at the GITHub repository for the demo as well. You can open the training file (langid.collection.train.arff) in the WEKA Explorer, and setting the class to be the first attribute, you should be getting something like the following figure:

So we have a training collection with 9K instances per class (language), and a test collection with 1K instances per class.

Data Transformation

As in previous posts about text classification with WEKA, we need to transform the text strings into term vector to enable learning. This is done by applying the StringToWordVector filter, that is the most remarkable text mining function in WEKA. In previous posts, I have applied this filter with default options, but it offers a wide range of possibilities that can be seen when opening it in the WEKA Explorer. If you click on the Filter button and browse the tree to "weka > filters > unsupervised > attribute > StringToWordVector", and then click on the filter name, you get the next window:

Those are a lot of options, aren't them? So let us focus on the minimum set of options in order to be productive with this example of Language Identification. Those are:

doNoOperateOnPerClassBasis - we set this option to True in order to make the filter collect word tokens over the classes as a whole. This should be the standard setting in nearly all text classification problems.
lowerCaseTokens - we set this option to True because we are interested on the words independently of using upper or lower case. In other problems, like e.g. when processing Social Networks text, keeping the capitalization may be critical for getting a good accuracy.
tokenizer - WEKA provides several tokenizers, intended to break the original texts into tokes according to a number of rules. The most simple tokenizer is the weka.core.tokenizers.WordTokenizer, which splits the string into tokens by using a list of separators that can be set by clicking on the tokenizer name. It is a nice idea to give a look at the texts we have before setting up the list of separating characters. In our case, we have several languages and the default punctuation symbols may not fit our problem -- we need to add opening question and exclamation marks, apart from other symbols from HTML format like &, and other symbols. So our delimiters string will be " \r\n\t.,;:\"\'()?!-¿¡+*&#$%\\/=<>[]_`@" (backslash is escaped).
wordsToKeep - we set this option to keep as much words as we can, to include the full vocabulary of the dataset. An appropriate value may be one million.

So we leave the rest of options on default. Most notably, we are not using sophisticated weighting schemas (like TF or TF.IDF), nor stop words or stemming. These options are very frequent in Information Retrieval systems like Apache Lucene/SOLR, and they often lead to nice accuracy improvements in search systems.

We need to have the same vocabulary both in the training and the testing datasets, so we can apply this filter in the command line by using the batch (-b) option:

$> java weka.filters.unsupervised.attribute.StringToWordVector -O -L -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\"\\'()?!-¿¡+*&#$%\\\\/=<>[]_`@\"" -W 10000000 -b -i langid.collection.train.arff -o langid.collection.train.vector.arff -r langid.collection.test.arff -s langid.collection.test.vector.arff

The options -O, -L, -tokenizer and -W correspond to the options above. The delimiter string is escaped because it is included in the specification of the tokenizer. The resulting files are also in the GITHub repository for the LangID example, along with the script stwv.sh (String To Word Vector) which includes this command.

Data Analysis and Improvement

If we take a quick look to the terms or tokens we have got, e.g.:

@attribute archival numeric @attribute archivarlos numeric @attribute archivas numeric @attribute archives numeric @attribute archiving numeric @attribute archivo numeric @attribute archivos numeric

We can imagine that most of them will be useless for Language Identification. This motivates making a more precise analysis of the tokens by using some kind of quality metric, like Information Gain. In fact, I am applying the weka.filters.supervised.attribute.AttributeSelection filter as I did in my posts on selecting attributes by chaining filters and on command line functions for text mining. So I issue the following command:

$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -b -i langid.collection.train.vector.arff -o langid.collection.train.vector.ig0.arff -r langid.collection.test.vector.arff -s langid.collection.test.vector.ig0.arff

We apply the filter in batch mode as well, in order to get the same attributes both in the training and in the test collections. We also set up the first attribute as the class (with the option -c), and set the threshold for keeping attributes as 0.0 in the weka.attributeSelection.Ranker search method. This means that we will keep only those attributes with Information Gain score over 0, and they will be sorted according to their score as well. This command is included in the asig.sh (Attribute Selection by Information Gain) script of the GITHub repository for the LangID example, along with the data files.

From the original 65,429 word attributes we got in the previous step, we have kept only 16,840 (a 25.73% of the original ones). We can be more aggressive by setting the threshold to a bigger value (e.g. 0.2).

The first twenty attributes are the next ones:

As we can see, all of them are very frequent words (in each language) that would be present in the stop lists for them. In consequence, our "pure" data mining approach is quite close to the traditional one based on stop words.

It makes sense to learn a J48 tree to get an idea of the complexity of the term relations. The weka.classifiers.trees.J48 algorithm implements the Quinlan's popular C4.5 learner, and as it outputs a decision tree, it can give us valuable insights of the term relations, like e.g. which co-occurring terms are more predictive. If we train that classifier on our new training dataset with the following command:

$> java weka.classifiers.trees.J48 -t langid.collection.train.vector.ig0.arff -no-cv

However, we get a quite complex decision tree populated with 273 nodes and 137 leaves. All the tests in the tree have the following look: "word > 0" or "word <= 0". This means that the algorithm induces that only the occurrence of words is important, but not its weight. The root of the tree is obviously a test on "the", and the smallest side of the tree (its right hand side, with "the > 0") is the following one:

the > 0 | de <= 0: EN (5945.0/8.0) | de > 0 | | el <= 0 | | | and <= 0 | | | | for <= 0 | | | | | to <= 0: FR (24.0/3.0) | | | | | to > 0: EN (2.0) | | | | for > 0: EN (3.0) | | | and > 0: EN (7.0) | | el > 0: SP (3.0)

This means, for instance, that the word "the" is an excellent predictive feature, and if it occurs in a text and the word "de" (from French or Spanish) does not occur in the text, that text is most likely written in English (with an estimated likelihood of 99.86% on the training collection). The overall accuracy of J48 over the training collection is 98.3963%.

Training and then Evaluating on the Test Collection

Before start training and evaluating, we have to decide which algorithms are most appropriate for the problem. In my experience with text learning, it is wise to test at least the following ones:

The Naive Bayes probabilistic approach, quick and with good results in text learning on average problems. In WEKA, It is incarnated in the weka.classifiers.bayes.NaiveBayes class.
The rule learner PART, which induces a list of rules by learning partial decision trees. It is a symbolic algorithm that produces rules which can be very valuable as they are easy to understand. This algorithm is implemented by the weka.classifiers.rules.PART class.
Of course, the J48 algorithm because of its visualization capabilities.
The lazy learner k-Nearest Neighbors (kNN), which occasionally gives excellent results in text classification problems. The WEKA class that implements this algorithm is weka.classifiers.lazy.IBk.
The Support Vector Machines algorithm, which it is probably the most effective on text classification problems because of its ability to focus on the most relevant examples in order to separate the classes. It is a very good learning algorithm for sparse datasets, and it is implemented in WEKA via the weka.classifiers.functions.SMO class or by the library LibSVM. I choose the Sequential Minimum Optimization implementation (SMO) embedded in WEKA.

Also, when Naive Bayes or J48 are effective, I usually get from small to even big accuracy improvements by using boosting, implemented by the weka.classifiers.meta.AdaBoostM1 class in WEKA. Boosting takes as input a weak classifier, and build a classifier committee by iteratively training that weak learner on those dataset subsets on which the previous learners are not effective. In this case, I will not apply boosting because the weak learners get rather high levels of accuracy, and it is most likely that boosting will only achieve a marginal improvement (if any) at the cost of a much bigger training time.

I have written an script named test.sh to execute all these algorithms with default options at the GITHub repository for the LangID demo. The results obtained by the algorithms are included in the repository as well, and summarized in the next table:

The different versions of the lazy algorithm kNN tested here appear to be very weak. It is likely we can improve its performance by changing the way the distance among examples is computed (from the Euclidean distance to a more appropriate one for text, that would be the cosine similarity), but their performance is so low that they will not score better than the rest of the algorithms.

The top algorithms in this test are Naive Bayes and Support Vector Machines. There is a trade off between both algorithms: SVMs are more effective (in fact, they are very effective) but they employ quite a lot of time to be trained, while Naive Bayes is less effective but quicker to be trained. In terms of classification time, both algorithms are linear on the number of attributes.

Even we have used a big number of attributes, there are some examples with rather weak representations. For instance, let us check the following instances or texts:

{58 1,94 1,313 1,1663 1} {119 1,361 1,2644 1,16840 FR} {2 1,16840 SP}

The first and second examples have only 3 occurring words (the class value for the first text is EN in the sparse format it is used by WEKA in this example), and the third example has only one word ("el"). The two first examples attribute numbers (58 or over) mean that the attributes are not the most informative ones, while in the third example we find a very informative word. If we apply a more aggressive selection using Information Gain, we will be missing a lot of examples (with null representations) in this example, thus making them fall to the most likely class. As the classes have a balanced distribution, the language chosen in that case will be EN, which is the default value for the class attribute.

Learning the Best Classifier and Using it Programmatically

So after our experiments, we know the best classifier in our tests is SVMs. So it is time to learn it and store the classifier into a file for further programmatic use. For this purpose, I have written an script that trains the classifier and stores the model into a file, using the following command-line call:

$> java weka.classifiers.meta.FilteredClassifier -t langid.collection.train.arff -c first -no-cv -d smo.model.dat -v -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -O -L -tokenizer \\\"weka.core.tokenizers.WordTokenizer -delimiters \\\\\\\" \\\\\\\r\\\\\\\n\\\\\\\t.,;:\\\\\\\\\\\\\\\"'()?!-¿¡+*&#$%/=<>[]_`@\\\\\\\"\\\" -W 10000000\" -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.functions.SMO

This call is rather painful because of the nested, and nested, and nested, and nested quotes. So I have pretty-printed it in the script learn.sh script at the GitHub repository for the LangID example. For dealing with nested quotes, follow the advice in the Wikipedia article about nested quotation.

With this call, we have stored a model in the file smo.model.dat, which chains the StringToWordVector filter, the AttributeSelection filter, and an SMO classifier by using the weka.classifiers.meta.FilteredClassifier and the weka.filters.MultiFilter classes, as I have explained in the post on Command Line Functions for Text Mining in WEKA.

One good point of WEKA is that we can learn a model in the command line and use it in a program. I have modified the MyFilteredClassifier.java program I used in my post describing A Simple Text Classifier in Java with WEKA, and I have committed it at the GITHub repository with the name LanguageIdentifier.java. I have created three sample test files as well, test_en.txt, test_fr.txt and test_sp.txt. The operation of the program is the following one:

$> javac LanguageIdentifier.java $> java LanguageIdentifier Usage: java LanguageIdentifier <fileData> <fileModel> $> java LanguageIdentifier test_en.txt smo.model.dat ===== Loaded text data: test_en.txt ===== This is a sample test for the language identifier demo. ===== Loaded model: smo.model.dat ===== ===== Instance created with reference dataset ===== @relation 'Test relation' @attribute language_class {EN,FR,SP} @attribute text string @data ?,' This is a sample test for the language identifier demo.' ===== Classified instance ===== Class predicted: EN $> java LanguageIdentifier test_fr.txt smo.model.dat ===== Loaded text data: test_fr.txt ===== Ceci est un test de l'échantillon pour la démonstration de l'identificateur de langue. ===== Loaded model: smo.model.dat ===== ===== Instance created with reference dataset ===== @relation 'Test relation' @attribute language_class {EN,FR,SP} @attribute text string @data ?,' Ceci est un test de l'échantillon pour la démonstration de l'identificateur de langue.' ===== Classified instance ===== Class predicted: FR $> java LanguageIdentifier test_sp.txt smo.model.dat ===== Loaded text data: test_sp.txt ===== Esto es un texto de prueba para la demostración del identificador de idioma. ===== Loaded model: smo.model.dat ===== ===== Instance created with reference dataset ===== @relation 'Test relation' @attribute language_class {EN,FR,SP} @attribute text string @data ?,' Esto es un texto de prueba para la demostración del identificador de idioma.' ===== Classified instance ===== Class predicted: SP

So the program is correct on the three examples. Remember that you have to learn the model before using the program. As a side note, as the program only uses a FilteredClassifier object, you can change the script to accommodate a different algorithm. For instance, you can just change the text "weka.classifiers.functions.SMO" by "weka.classifiers.bayes.NaiveBayes" in the learn.sh script, and the program will be working the same way -- but with a different model.

Concluding Remarks

While being relatively simple, the Language Identification problem helps to identify the essential tasks we have to perform when building text classifiers with WEKA. It is a complete example in the sense that we have not only collected the dataset and learnt on it, but we have also dig a bit into the most suitable representation by playing with attribute selection and tentative classifier to visualize the data. It also demonstrates some basic configurations of the StringToWordVector filter, which is the most remarkable tool in WEKA for text mining.

If you have had the time to read all this post, and even tried the program: thank you! I hope it has been a valuable time investment. I am tempted to suggest you to modify the dataset to include more languages, as the problem I have addressed is relatively simple -- only three and quite different languages.

Finally, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on this topics!

Mapping Vocabulary from Train to Test Datasets in WEKA Text Classifiers

2013-05-02T01:41:00.001+02:00

There are several ways of evaluating a (text) classifier: cross validation, splitting your dataset into train and test subsets, or even evaluating the classifier on the training set itself (not recommended). I will not discuss the merits of each method, instead I will focus on a train/test split evaluation.

When you start to work with your train and test text datasets, you have got two labelled text collections like e.g. those I make available at my GITHub project: smsspam.small.train.arff and smsspam.small.test.arff . In this case, we have two collections that are a 50% split of my original simple collection smsspam.small.arff , which in turn is a subset of the the original SMS Spam Collection. The files are formatted according to the WEKA ARFF:

@relation sms_test @attribute spamclass {spam,ham} @attribute text String @data ham,'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...' spam,'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C\'s apply 08452810075over18\'s' ...

That is, one text instance per line, the first attribute being the nominal class spam/ham, and the second attribute being the text itself.

In text classification, you have to transform this original representation into a vector of terms/words/stems/etc. in order to allow the classifier to learn expressions like: "if the word "win" occurs in a text, then classify it as spam". In other words, you have to represent your texts as feature vectors, where the features are words and the values are e.g. binary weights, TF weights, or TF.IDF weights. In fact, WEKA provides the handy StringToWordVector filter for this purpose (Thanks, WEKA!).

However, it is most likely that the vocabulary used in your training set and in your test set is not identical. For instance, if you directly apply the StringToWordVector filter to the previous files, you get a bit different results, summarized in the following table:

Obviously, to enable learning you have to ensure that the representation of both datasets is the same. For instance, imagine that the root of the decision tree you have learnt on your training collection poses a test on an attribute that does not exist on your test collection, then what happens?

Fortunately, WEKA provides at least three ways of getting the same vocabulary in your train and test subcollections. Here are them:

Using a batch filter that takes both training and test collections at the same time, using the first for getting the attributes and representing the last using those attributes.
Using a FilteredClasifier (that I have discussed in previous posts), which feeds both the filter and the classifier into a single classifier that takes the original representation class/text as input for both the training and the test sets.
A more recent method, that is separately getting the representations and using an InputMappedClassifier that acts as a wrapper of an underlying classifier, and tries to match attributes from the training collection into the corresponding ones of the test subset.

The first method is quite simple, and it just makes use of the -b option of the WEKA filters. The corresponding command line calls are the next ones:

$> java weka.filters.unsupervised.attribute.StringToWordVector -b -i smsspam.small.train.arff -o smsspam.small.train.vector.arff -r smsspam.small.test.arff -s smsspam.small.test.vector.arff $> java weka.classifiers.lazy.IBk -t smsspam.small.train.vector.arff -T smsspam.small.test.vector.arff -i -c first ... === Confusion Matrix === a b <-- classified as 1 15 | a = spam 0 84 | b = ham

The second method, conveniently discussed in my previous post, can be applied with the following call:

$> java weka.classifiers.meta.FilteredClassifier -t smsspam.small.train.arff -T smsspam.small.test.arff -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.lazy.IBk -i -c first ... === Confusion Matrix === a b <-- classified as 1 15 | a = spam 0 84 | b = ham

As it is shown in the previous results, both methods achieve the same results. In this case, I have opted for using StringToWordVector without parameters (default tokenization, term weights, no stemming, etc.) with the relatively weak classifier IBk , which implements a k-Nearest-Neighbor learner that, instead of building a model from the training collection, it searches the closest training instance to the test instance (k is 1 on default) and assigns its class to the test instance.

However, the third method achieves different results, as the mapping involves some attributes from the training collection disappearing, and ignoring new attributes in the test collection. It is called the following way:

$> java weka.filters.unsupervised.attribute.StringToWordVector -i smsspam.small.train.arff -o smsspam.small.train.vector.arff $> java weka.filters.unsupervised.attribute.StringToWordVector -i smsspam.small.test.arff -o smsspam.small.test.vector.arff $> java weka.classifiers.misc.InputMappedClassifier -W weka.classifiers.lazy.IBk -t smsspam.small.train.vector.arff -T smsspam.small.test.vector.arff -i -c first Attribute mappings: Model attributes Incoming attributes ------------------------------ ---------------- (nominal) spamclass --> 1 (nominal) spamclass (numeric) #&gt --> 2 (numeric) #&gt (numeric) $1 --> - missing (no match) (numeric) &amp --> - missing (no match) (numeric) &lt --> 6 (numeric) &lt (numeric) *9 --> 7 (numeric) *9 (numeric) + --> - missing (no match) (numeric) - --> 8 (numeric) - ... === Confusion Matrix === a b <-- classified as 2 14 | a = spam 1 83 | b = ham

In fact, this time we get a bit more spam (2 over 14) with a false positive, although the general accuracy is exactly the same: 85%. You can see how some of the attributes are missing (they do not occur in the test dataset), like: "$1", "+", etc. This for sure affects the performance of the classifier, so beware.

With these options, my recommendation is using the first method, as it allows you to fully examine the representation of the datasets (term weight vectors) and it decouples filtering from training, what may be convenient in terms of efficiency.

Before ending this post, I have to thank Tiago Pasqualini Silva, Tiago Almeida and Igor Santos for our experiments with the SMS Spam Collection, and to Tiago Pasqualini in particular because he showed me the InputMappedClassifier.

And last but not least, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on this topics!

URL Text Classification with WEKA, Part 1: Data Analysis

2013-04-26T00:46:00.001+02:00

I have recently came across a website named SquidBlackList.org, which features a number or URL lists for safe web browsing using the open source proxy Squid. In particular, it features a quite big porn domains list, so I wondered: Is it possible to make a text classification system with WEKA to detect porn domains using the text in the URLs?

Just to note that SquidBlackList on porn (and most of the rest of the lists they provide is licensed under Creative Commons Attribution 3.0 Unported License: Blacklists (Squidblacklist.org) / CC BY 3.0

The Filtering Problem

Most web filtering systems work by using a manually classified list of URLs into a list of categories that are used to define filtering profiles (e.g. block porn but allow press). The URL lists or database must be manually maintained, and it has to be quite comprehensive regarding user browsing behaviour. As (aggregated) web browsing follows a Zipfian distribution (that is, relatively few URLs accumulate most of the traffic), you can provide a rather effective service by ensuring that your URL database covers the most popular URLs. URL-based filtering is rather efficient (if your database is well implemented), and it can easily cover around 95% of the web traffic (in terms of #requests, not in terms or #URLs).

However, covering the remaining 5% requires performing some kind of analysis. My target here is dynamically classifying that 5% of web requests (which may account for millions of URLs or even just domains) into two classes: notporn and porn. This way, we can cover the 100% of the traffic, and it is likely that we concentrate our classification mistakes (that may be possible at the URL database as well) only into that small 5% - so our filter can be 98% effective or more.

Why analyzing the URL text? For a matter of efficiency - you do not have to go to the Internet and get the actual Web content in order to analyze it, so all the processing is local to the proxy and you eventually avoid performing unnecessary Web requests at the proxy itself.

Collecting the Dataset

So we start with an 880k porn domains list, but although it is possible to learn only from positive examples, we may expect better effectiveness if we collect negative examples (not porn domains). A handy resource is the Top 1M Sites list by Alexa, a Web research company that provides this ranked list in a daily basis. Having 1M negative examples and 880k positive examples makes a good class balance and quite populated dataset -- nice for learning, specially when its instances are relatively short text sequences (e.g. google.com vs. porn.com).

First we have to make both lists comparable. The format of the Alexa list is <rank>,<domain>, while the format of the Squid black list is <dot><domain> (in order to match the Squid URL list format). A couple of cut and sed commands will do the trick.

Then we can just add the class and mix the lists.

Cleaning the Dataset, first step

But... Hey, Internet is for porn! -- we should expect that some of the URLs in the Alexa ranking are pornographic. In fact, a simple search demonstrate it:

$ grep porn alexa.csv | more pornhub.com youporn.com ... $ grep porn alexa.csv | wc -l 5719

We can just substract the porn list from the Alexa list with a handy grep:

grep -f porn.csv -v alexa.csv > alexaclean.csv

But it takes a loooooong time, so I prefer to sort Alexa list, transforming it to Linux format (as the original one has DOS format), and use comm:

$ sort alexa.csv > alexasorted.csv $ fromdos alexasorted.csv $ comm -23 alexasorted.csv porn.csv > alexaclean.csv $ wc -l alexaclean.csv 975088 alexaclean.csv

Good point, only 25k URLs are pornographic... Well, lets check:

$ grep porn alexaclean.csv | head 001porno.com 0dayporn.org 1000porno.net ...

So we still have some porn in there.

Cleaning the Dataset, second step

Cleaning Alexa list from porn is a bit more complex. How to find those popular porn sites, if they are not even in such a comprehensive list as the Squidblacklist one? Another resource comes to help, and it is the sex-related search engine PornMD. This engine has recently published a list of popular porn searches in the form of a dynamic infography named Global Internet Porn Habits:

So, if you collect a list of the top searches in five of the biggest speaking countries, you get:

Cleaning the list from duplicated words, adding "porn", "sex" and "xxx" (rule of thumb), and computing the number of domains they occur in the Alexa (cleaned) and the Squidblacklist lists, we get:

Looking at the list, a relatively safe proportion between the number of occurrences in Squid's versus Alexa's (clean) list is 9 -- this way, we keep most obvious words and remove the most ambiguous ones (although there are some borderline examples, as "asian"). We can see the effects:

$ grep "amateur\|anal\|asian\|creampie\|hentai\|lesbian\|mature\|milf\|squirt\|teen\|porn\|sex\|xxx" alexaclean.csv | wc -l 17389 $ grep "porn\|sex\|xxx" alexaclean.csv | wc -l 12342 $ grep -v "amateur\|anal\|asian\|creampie\|hentai\|lesbian\|mature\|milf\|squirt\|teen\|porn\|sex\|xxx" alexaclean.csv > alexacleanfinal.csv $ wc -l alexacleanfinal.csv 964735 alexacleanfinal.csv

You can see that just "porn", "sex" and "xxx" account for 70,97% of domains, so there is some domain knowledge in the process. I must note I may use another, much more extensive list of porn-related searches like the one featured PornMD Most Popular page.

Additional Analysis

To get a feeling of how the previous porn-related keywords are distributed across the original Alexa ranking, I have computed the number of lines (domains) they occur in 100k intervals, to get the following chart:

Where #query1 represents the number of occurrences of "porn\|sex\|xxx" and #query2 represents the full list of keywords. The growth is nearly linear with an average of 1234.2 URLs per interval in #query1, and 1738.9 URLs per interval in #query2. The curves are smooth, and there are more domains in the first intervals (e.g. 1482 hits in the first 100k Alexa URLs for #query1) than in the latest ones (e.g. 1077 hits in the last 100k Alexa URLs for #query1).

There are other dataset statistics that may provide better insights regarding the classification problem, or in other words, that may be more informative or predictive in terms of classification accuracy. For instance:

What is the length of an average domain name in each category?
How many points and/or dashes do domain have in average per category?
Which is the distribution of different TLDs (Top Level Domains) across both categories?

Can you imagine any other interesting statistics?

The Dataset

Once we have got the original Squidblacklist and the Alexa cleaned one (after substraction and removing the keyword hitting lines), we add some format to get a WEKA ARFF file. For instance, 0000free.com must be transformed into '0000free.com',safe. A bit of sed trickery does the job, and then we mix the lists with the following command:

$ paste -d '\n' alexacleanfinal.csv porn.csv > urllist.csv

The rationale behind mixing the lists is that some learning algorithms are dependent on the order of examples, and for those algorithms it is clever not to expose first all the examples of one class, the other class' ones. As the paste command adds new lines when one of the lists finish, we have to remove double CRs (\n\n) with another sed call, and we finally add the ARFF header to get a file starting the following way:

@relation URLs @attribute urltext String @attribute class {safe,porn} @data '0000free.com',safe '0000000000000000000sex.com',porn '0000.jp',safe '000000000gratisporno.ontheweb.nl',porn ...

I have left that file named urllist.arff in my GitHub folder for your convenience, so you can start playing with it. Beware, it is over 40Mb.

So that is all for the moment. Stay tuned for my next steps if you liked this post.

Thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on this topic!

A Simple Text Classifier in Java with WEKA

2013-04-08T09:31:00.001+02:00

In previous posts [1, 2, 3], I have shown how to make use of the WEKA classes FilteredClassifier and MultiFilter in order to properly build and evaluate a text classifier using WEKA. For this purpose, I have made use of the Explorer GUI provided by WEKA, and its command-line interface.

In my opinion, it is a good idea to get familiar with both the Explorer and the command-line interface if you want to get a feeling of the amazing power of this data mining library. However, where you can take full advantage its power is in your own Java programs. Now it is time to deal with it.

Following Salton, and Belkin and Croft, the process of text classification involves two main steps:

Representing your text database in order to enable learning, and to train a classifier on it.
Using the classifier to predict text labels of new, unseen documents.

The first step is a batch process, in the sense that you can do it periodically (as long as your labelled data set gets improved with time -- bigger sizes, new labels or categories, corrected predictions via user feedback). The second step is actually the moment in which you get advantage of the knowledge distilled by the learning process, and it is online in the sense that it is don by demand (when new documents arrive). This distinction is conceptual, I mean that modern text classifiers retrain on the added documents as soon as they get them, in order to keep or improve accuracy with time.

In consequence, what we need to demonstrate the text classification process is two programs: one to learn from the text dataset, and another to use the learnt model to classify new documents. Let us start showing a very simple text learner in Java, using WEKA. The class is named MyFilteredLearner.java, and its main() method demonstrates its usage, which involves:

Loading the text dataset.
Evaluating the classifier.
Training the classifier.
Storing the classifier.

The most interesting parts of the process are:

We read the dataset by simply using the method getData() of an ArffReader object that wraps a BufferedReader.
We programmatically create the classifier by combining a StringToWordVector filter (in order to represent the texts as feature vectors) and a NaiveBayes classifier (for learning), using the FilteredClassifier class discussed in previous posts.

The process of creating the classifier is demonstrated in the next code snippet:

trainData.setClassIndex(0); filter = new StringToWordVector(); filter.setAttributeIndices("last"); classifier = new FilteredClassifier(); classifier.setFilter(filter); classifier.setClassifier(new NaiveBayes());

So we set the class of the dataset as being the first attribute, then we create the filter and set the attribute to be transformed from text into a feature vector (the last one), and then we create the FilteredClassifier object and add the previous filter and a new NaiveBayes classifier to it. Given the attributes above, the dataset has to have the class as the first attribute, and the text as the second (and last) one, like in my typical example of the SMS spam subset example (smsspam.small.arff).

You can execute this class with the following commands to get the following output:

$>javac MyFilteredLearner.java $>java MyFilteredLearner smsspam.small.arff myClassifier.dat ===== Loaded dataset: smsspam.small.arff ===== Correctly Classified Instances 187 93.5 % Incorrectly Classified Instances 13 6.5 % Kappa statistic 0.7277 Mean absolute error 0.0721 Root mean squared error 0.2568 Relative absolute error 25.8792 % Root relative squared error 69.1763 % Coverage of cases (0.95 level) 94 % Mean rel. region size (0.95 level) 51.75 % Total Number of Instances 200 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 0,636 0,006 0,955 0,636 0,764 0,748 0,943 0,858 spam 0,994 0,364 0,933 0,994 0,962 0,748 0,943 0,986 ham Weighted Avg. 0,935 0,305 0,936 0,935 0,930 0,748 0,943 0,965 ===== Evaluating on filtered (training) dataset done ===== ===== Training on filtered (training) dataset done ===== ===== Saved model: myClassifier.dat =====

The evaluation has been performed with default values except for the number of folds, that has been set to 4 as shown in the next code snippet:

Evaluation eval = new Evaluation(trainData); eval.crossValidateModel(classifier, trainData, 4, new Random(1)); System.out.println(eval.toSummaryString());

For the case you don want to evaluate the classifier on the training data, you can omit the call to the evaluate() method.

Now let us deal with the classification program, which is far more complex but only for the process of creating an instance. The class is named MyFilteredClassifier.java, and its main() method demonstrates its usage, which involves:

Reading the text to be classified from a file.
Reading the model or classifier from a file.
Creating the instance.
Classifying it.

Creating the instance is performed in the makeInstance() method, and its code is the following one:

// Create the attributes, class and text FastVector fvNominalVal = new FastVector(2); fvNominalVal.addElement("spam"); fvNominalVal.addElement("ham"); Attribute attribute1 = new Attribute("class", fvNominalVal); Attribute attribute2 = new Attribute("text",(FastVector) null); // Create list of instances with one element FastVector fvWekaAttributes = new FastVector(2); fvWekaAttributes.addElement(attribute1); fvWekaAttributes.addElement(attribute2); instances = new Instances("Test relation", fvWekaAttributes, 1); // Set class index instances.setClassIndex(0); // Create and add the instance DenseInstance instance = new DenseInstance(2); instance.setValue(attribute2, text); // instance.setValue((Attribute)fvWekaAttributes.elementAt(1), text); instances.add(instance);

The classifier learnt with MyFilteredLearner.java expects that an instance has two attributes: the first one is the class, it is a nominal one with values "spam" or "ham"; the second one is a String, which is the text to be classified. Instead of creating one instance, we create a whole new dataset which first instance is the one that we want to classify. This is required in order to let the classifier know the schema of the dataset, which is stored in the Instances object (and not in each instance).

So first we create the attributes by using the FastVector class provided by WEKA. The case of the nominal attribute ("class") is relatively simple, but the case of the String one is a bit more complex because it requires the second argument of the constructor to be null, but casted to FastVector. Then we create an Instances object by using a FastVector to store the two previous attributes, and set the class index to 0 (which means that the first attribute will be the class). As a note, the FastVector class is deprecated in the WEKA development version.

The latest step is to create an actual instance. I am using the WEKA development version in this code (as of the date of this post), so we have to use a DenseInstance object. However, if you make use of the stable version, then you can use Instance (link to the stable version doc), and must change this code to:

Instance instance = new Instance(2);

As a note, I have commented in the code a different way of setting the value of the second attribute. I must note that we do not set the value of the first attribute, as it is unknown.

The rest of the methods are (more or less) straightforward if you follow the documentation (weka - Programmatic Use, and weka - Use WEKA in your Java code). You get the class prediction on your text with the following lines:

double pred = classifier.classifyInstance(instances.instance(0)); System.out.println("Class predicted: " + instances.classAttribute().value((int) pred));

And if you feed this classifier with a file (smstest.txt) that stores the text "this is spam or not, who knows?", and the model learnt with MyFilteredLearner.java (that is stored in myClassifier.dat), then you get the following result:

$>javac MyFilteredClassifier.java $>java MyFilteredClassifier smstest.txt myClassifier.dat ===== Loaded text data: smstest.txt ===== this is spam or not, who knows? ===== Loaded model: myClassifier.dat ===== ===== Instance created with reference dataset ===== @relation 'Test relation' @attribute class {spam,ham} @attribute text string @data ?,' this is spam or not, who knows?' ===== Classified instance ===== Class predicted: ham

It is interesting to see that the class assigned to the instance before classifying it is "?", which means undefined or unknown.

For those interested on using the classifiers discussed in my previous posts (I mean including AttributeSelection, and using PART and SMO as classifiers), the only part of this code that you have to change is the learn() and evaluate() methods in MyFilteredLearner.java. Just play with it, and have fun.

Thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for futher articles on this topic!

UPDATE (June 26th, 2013): Since I wrote this post, I have moved my code examples and other stuff to a GitHub repository. I have just updated the links.

Command Line Functions for Text Mining in WEKA

2013-04-01T18:21:00.000+02:00

In previous posts I have explained how to chain filters and classifiers in WEKA, in order to avoid incorrect results when evaluating text classifiers by using cross-fold validation, and how to integrate feature selection in the text classification process. For this purpose, I have used the FilteredClassifier and the MultiFilter in the Explorer GUI provided by WEKA. Now it is time to do so in the command line.

WEKA essentially provides three usage modes:

Using the Explorer, and other GUIs like the Experimenter, which allow to setup experiments and to examine the results graphically.
Using the command line functions, which allow to setup filters, classifiers and clusterers with plenty of configuration options.
Using the classes programmatically, that is, in your own programs in Java.

One major difference between modes 1 and 2 is that in the first mode, you spend some of the memory in the GUI, while in the second one, you do not. That can be a significant difference when you load big datasets. In both cases you can control the memory assigned to WEKA using Java command line options like -Xms, -Xms and so, but it may be interesting to save the memory used in the graphic elements in order to be able to deal with bigger datasets.

I will deal with the usage of WEKA in your programs in the future, in this post I focus on the command line. Before trying the following examples, please ensure weka.jar is added to your CLASSPATH. The first thing we must know is that WEKA filters and classifiers can be called in the command line, and that the call without arguments will show their configuration options. For instance, when you call a rule learner like PART (which I used in my previous posts), you get the following options:

$>java weka.classifiers.rules.PART Weka exception: No training file and no object input file given. General options: -h or -help Output help information. -synopsis or -info Output synopsis for classifier (use in conjunction with -h) -t <name of training file> Sets training file. -T <name of test file> Sets test file. If missing, a cross-validation will be performed on the training data. ... Options specific to weka.classifiers.rules.PART: -C <pruning confidence> Set confidence threshold for pruning. (default 0.25) ...

I omit the full list of options. Options are divided into two groups, those that are accepted by any classifier and those specific to the PART classifier. General options include three usage modes:

Evaluating the classifier on the training collection it self, possibly using cross validation, or on a test collection.
Training a classifier and storing the model in a file for further use.
Training a classifier and getting its output (classification of instances) on a test collection.

However, when calling a filter in the command line, the input file (the dataset) is read from the standard input, so you have to redirect the input from your file by using the appropriate operator (<), or to use the option -h to get the options of the filter.

In my previous post on chaining filters and classifiers, I performed an experiment running a PART classifier on an ARFF-formatted subset of the SMS Spam Collection, namely the smsspam.small.arff file. As every instance is of the form [spam|ham],"message text", we have to transform the text of the message into a term weight vector by using the StringToWordVector filter. You can combine the filter and the classifier evaluation into one command by using the FilteredClassifier class as in the following command:

$>java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.rules.PART

To get the following output:

=== Stratified cross-validation === Correctly Classified Instances 173 86.5 % Incorrectly Classified Instances 27 13.5 % Kappa statistic 0.4181 Mean absolute error 0.1625 Root mean squared error 0.3523 Relative absolute error 58.2872 % Root relative squared error 94.9031 % Total Number of Instances 200 === Confusion Matrix === a b <-- classified as 13 20 | a = spam 7 160 | b = ham

Which is exactly the one I showed in my previous post. I have used the following general options:

-t smsspam.small.arff to specify the dataset to train (and on default, to evaluate on by using cross-validation).
-c 1 to specify the first attribute as the class.
-x 3 to specify that the number of folds to be used in the cross-validation evaluation is 3.
-v and -o to avoid outputting the classifiers and statistics on the training collection, respectively.

Plus the specific options of the FilteredClassifier -F to define the filter, and -W to define the classifier.

In my subsequent post on chaining filters, I proposed to make use of attribute selection to improve the representation of our learning problem. This can be done by issuing the following command:

$>java weka.classifiers.meta.FilteredClassifier -t smsspam.small.arff -c 1 -x 3 -v -o -F "weka.filters.MultiFilter -F weka.filters.unsupervised.attribute.StringToWordVector -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.rules.PART

To get the following output:

=== Stratified cross-validation === Correctly Classified Instances 167 83.5 % Incorrectly Classified Instances 33 16.5 % Kappa statistic 0.1959 Mean absolute error 0.1967 Root mean squared error 0.38 Relative absolute error 70.53 % Root relative squared error 102.3794 % Total Number of Instances 200 === Confusion Matrix === a b <-- classified as 6 27 | a = spam 6 161 | b = ham

Which in turn, it is the same I got in that post. If we replace PART by the SMO implementation of Support Vector Machines included in WEKA (by changing weka.classifiers.rules.PART to weka.classifiers.functions.SMO), we get the accuracy figure of 91%, as described in the post.

While most of the options are the same as in the previous command, two things deserve special attention in this one:

We chain the StringToWordVector and the AttributeSelection filters by using the MultiFilter described in the previous post. The order of calls is obviously relevant, as we first need to tokenize the messages into words, and then selecting the most informative words. Moreover, while we apply StringToWordVector with the default options, the AttributeSelection filter makes use of the InfoGainAttributeEval function as quality metric, and the Ranker class as the search method. The Ranker class is applied with the option -T 0.0 in order to specify that the filter has to rank the attributes (words or tokens) according to the quality metric, but to keep only which score is over the threshold defined by T, that is 0.0.
As the order of options is not relevant, it is required to link the options to the appropriate class by using the quotation mark symbol ("). Unfortunately, we have three nested expressions:
- The whole MultiFilter filter, enclosed by the isolated quotation marks (").
- The AttributeSelection filter, enclosed by the escaped quotation mark (\").
- The Ranker search method, enclosed by the double escaped quotation mark (\\\"). Here we escape the escape symbol itself (\) along with the quotation mark.
So many escaping symbols make it a bit dirty, but still functional.

Si I have shown how we can chain filters and classifiers, and apply several chained filters as well, in the command line. In next posts I will explain how to train, store and then evaluate a classifier by using the command line, and how to make use of WEKA filters and classifiers in your Java programs.

Thanks for reading, and please feel free to leave a comment if you think I can improve this article!

NOTE: You can find the collection I used in this post, along with other stuff related to WEKA and text mining in my Text Mining in WEKA page.

Text Mining in WEKA Revisited: Selecting Attributes by Chaining Filters

2013-02-11T10:50:00.001+01:00

Two weeks ago, I wrote a post on how to chain filters and classifiers in WEKA, in order to avoid misleading results when performing experiments with text collections. The issue was that, when using N Fold Cross Validation (CV) in your data, you should not apply the StringToWordVector (STWV) filter on the full data collection and then perform the CV evaluation on your data, because you would be using words that are present in your test subset (but not in your training subset) for each run. Moreover, the STWV filter can extract and use simple statistics to filter out the terms (e.g. minimum number of occurrences), but those statistics over the full collection are not valid because in each CV run you use only a subset of it.

Now I would like to deal with a more general setting in which you want to apply dimensionality reduction because, in general text classification tasks, the documents or examples are represented by hundreds (if not thousands) of tokens, what makes the classification problem very hard for many learners. In WEKA, this involves using the AttributeSelection filter along with the STWV one. Before thinking about dimensionality reduction, we must reflect a bit about it.

Dimensionality reduction is a typical step in many data mining problems, which involves transforming our data representation (the schema of our table, the list of current attributes) into a shorter, more compact, and hopefully, more predictive one. Basically, this can be done in two ways:

With feature reduction, which maps the original representation (list of attributes) onto a new and more compact one. The new attributes are synthetic, that is, they somehow combine the information from subsets of the original ones which share statistical properties. Typical feature reduction techniques include algebraic analysis methods like Principal Component Analysis (PCA) and Singular Value Decomposition (SVD). In text analysis, the most popular method is, by far, Latent Semantic Analysis, which involves obtaining the principal components or buckets into the term-to-document sparse matrix.
With feature selection, which just selects a subset of the original representation attributes, according to some Information Theory quality metric like Information Gain or X^2 (Chi-Square). This method can be far more simple and less time consuming than the previous one, as you only have to compute the value of the metric for each attribute, and rank the attributes. Then you simply decide a threshold in the metric (e.g. 0 for Information Gain) and keep the attributes with a value over it. Alternatively, you can choose a percentage of the number of original attributes (e.g. 1% and 10% are typical numbers in text classification), and just keep those top ranking ones. However, there are other more time consuming alternatives, like exploring the predictive power of subsets of attributes using search algorithms.

A major difference between both methods is that feature reduction leads to synthetic attributes, but feature selection just keeps some of the original ones. This may affect the ability of the data scientist to understand the results, as synthetic attributes can be statistically relevant but meaningless. Another difference is that feature reduction does not make use of the class information, while feature selection does. In consequence, the second method is very likely to lead to a more predictive subset of attributes than the original one. But beware, more theoretical predictive power does not always mean more effectiveness. I recommend to read the old (?) but always helpful paper by Yimming Yang & Jan Pedersen on the topic.

The WEKA package supports both methods, mainly with the weka.attributeSelection.PrincipalComponents (feature reduction) and weka.filters.supervised.attribute.AttributeSelection (feature selection) filters. But an important question is: Do you really need to make dimensionality reduction in text analysis? There are two clear arguments against it:

Some algorithms get no hurt with using all the features, even if they are really many and very sparse. For instance, Support Vector Machines excel in text classification problems exactly for that: they are able to deal with thousands of attributes, and they get better results when no reduction is performed. A typical text classification problem in which dimensionality reduction can be a big mistake is spam filtering.
If it is a matter of computing time, like e.g. in symbolic learners like decision trees (C4.5) or rules (Ripper), then there is no worry. Big Data techniques come to help, as you can configure cheap and big clusters over e.g. Hadoop to perform your computations!

But having the algorithms in my favourite data analysis package, and knowing that sometimes they lead to effectiveness improvements, why not using them?

Because of the reasons above, I will focus on feature selection. In consequence, I will deal with the AttributeSelection filter, leaving the PrincipalComponents one for another post. Let us start with the same text collection that I used in my previous post about chaining filters and classifiers in WEKA. It is an small subset of the SMS Spam Collection, made with the first 200 messages for brevity and simplicity.

Our goal is to perform a 3-fold CV experiment with any algorithm in WEKA. But, in order to do it correctly, we know we must chain the STWV filter with the classifier by using the FilteredClassifier learner in WEKA. However, we want to perform feature selection as well, and the FilteredClassifier allows us to chain a single filter and a single classifier. So, how to combine both the STWV and the AttributeSelection filters into a single one?

Let us start doing it manually. After loading the dataset into the WEKA Explorer, applying the STWV filter with the default settings, and setting the class attribute to the "spamclass" one, we get something like this:

Now we can either go to the "Select attributes" tab, or just stay in the "Preprocess" tab and choose the AttributeSelection filter. I opt for the second way, so you can browse the filters folder by clicking on the "Choose" button at the "Filters" area. After selecting the "weka > filters > supervised > attribute > AttributeSelection", you can see the selected filter in the "Filters" area, as shown in the next picture:

In order to set up the filter, we can click on the name of the filter. The "weka.gui.GenericObjectEditor" window we get is a generic window that allows to configure filters, classifiers, etc. according to a number of object-defined properties. In this case, it allows us to set up the AttributeSelection filter configuration options, which are:

The evaluator, which is the quality metric we use to evaluate the predictive properties of an attribute or a set of them. There you can choose among a wide number of them (which depends on your WEKA version), including specially Chi Square (ChiSquaredAttributeEval), Information Gain (InfoGainAttributeEval), and Gain Ratio (GainRatioAttributeEval).
The search algorithm, which is the way we will select the remaining group of attributes, and includes very clever but time consuming group search algorithms, and my favourite one, the Ranker (weka.attributeSelection.Ranker). This one just ranks the attributes according to the chosen quality metric, and keeps those meeting some criterion (like e.g. having a value over a predefined threshold).

In the next picture, you can see the AttributeSelection configuration window with the evaluator set up to Information Gain, and the search set up as Ranker, with the default options.

The Ranker evaluator has two main properties:

The numToSelect property, which defines the number of attributes to keep, an Integer number that is -1 (all) by default.
The threshold property, which defines the minimum value that an attribute has to get in the evaluator in order to be kept. The default value for this property is the minimum Long integer in Java.

In consequence, if we want to keep those attributes scoring over 0, we have just to write that number in the threshold area of the window we get when we click on the Ranker at the previous window:

By clicking OK on all the previous windows, we get a configuration of the AttributeSelection filter which involves keeping those attributes with Information Gain score over 0. If we apply that filter to our current collection, we get the following result:

As you can see, we get a ranked list of 82 attributes (plus the class one), in which the top scoring attribute is the token "to". This attribute occurs in 69 messages (value 1), but many of them are spam ones, so it is quite predictive for this particular class. We can see as well that we only keep a 5.93% of the original attributes (82 over 1382).

Now we can go to the "Classify" tab and select the rule learner PART ("weka > classifiers > rules > PART") to be evaluated on the training collection itself ("Test options" area, "Use training set option"), getting the next result:

We get an accuracy of 95.5%, much better than the results I reported in my previous post. Of course, these results cannot be compared because this quick experiment is a test on the training collection, not done with 3-fold CV and the FilteredClassifier. But if we want to run a CV experiment, how to do it as we have 2 filters instead of one, in our set up?

What we need now is to start with the original text collection in ARFF format (no STWV yet), and to use the MultiFilter that WEKA provides for these situations. We start then with the original collection, and go to the "Classify" tab. If we try to choose any classic learner (J48 for the C4.5 decision tree learner, SMO for Support Vector Machines, etc.), it will be impossible because we have just one attribute (the text of the SMS messages) along with the class, but we can use the weka.classifiers.meta.FilteredClassifier. After selecting it, we will see something similar to the next picture:

If we click on the name of the classifier at the "Classifier" area and we select weka.classifiers.rules.PART as the classifier (with default options), we get the next set up in the FilteredClassifier editor window:

Then we can choose the weka.filters.MultiFilter in the filter area, which starts with a dummy AllFilter. Time to set up our filter combining STWV and AttributeSelection. We click on the filter name area and we get a new filter edition window with an area to define the filters to be applied. If we click on it, we get a new window that allows to add, configure and delete filters. The selected filters will be applied in the order we add them, so we start deleting the AllFilter and adding a STWV filter with the default options, getting something similar to the next picture:

Filters are added by clicking on the "Choose" button to select them, and clicking on the "Add" button to add them to the list. We can now add the AttributeSelection filter with the Information Gain evaluator and the Ranker with threshold 0 search, by editing the filter when clicking on the "Edit" button with the AttributeSelection filter selected in the list. If you manually re-dimension the window, you can see a set up similar to this one:

The set up is nearly finished. We close this window by clicking on the "X" button, and click on the "OK" button at the MultiFilter and FilteredClassifier windows. In the "Classify" tab at the explorer, we select "Cross-validation" in the "Test options" area, entering 3 as the number of folds, and we select the class attribute as "spamclass". Having done this, we can just click on the "Start" button to get the next result:

So we get an accuracy of 83.5%, which is worse than the one we got without using feature selection (which was 86.5%). Oh oh, all this clever (?) set up to get a drop of 3 points in accuracy! :-(

But what happens if, instead of using a relatively weak learner on text problems like PART, we turn to Support Vector Machines? WEKA includes the weka.classifiers.functions.SMO classifier, which implements John Platt's sequential minimal optimization algorithm for training a support vector classifier. If we choose this classifier with default options, we get a quite different results:

Using only the STWV filter, we get an accuracy of 90.5% with 18 spam messages classified as legitimate ("ham"), and 1 false positive.
Using the MultiFilter with AttributeSelection in the same setup, we get an accuracy of 91% with 16 spam messages classified as ham, and 2 false positives.

So we get an improvement of accuracy on a more accurate learner, what is nice. However, the difference is just 0.5% (1 message in our 200 instances collection), so it is moderate. Moreover, we get one more false positive, what is bad for this particular problem. In spam filtering, it is much worse to make a false positive (sending a legitimate message to the spam folder) than the opposite, because the user has the risk to miss an important message. Check my paper on cost sensitive evaluation of spam filtering at ACM SAC 2002.

But all in all, I expect this post shows the merits of feature selection in text classification problems, and how to do it with my favourite library, WEKA. Thanks for reading, and please feel free to leave a comment if you think I can improve this article!

Text Mining in WEKA: Chaining Filters and Classifiers

2013-01-29T13:21:00.001+01:00

One of the most interesting features of WEKA is its flexibility for text classification. Over the years, I have had the chance to make a lot of experiments on text collections with WEKA, most of them in supervised tasks that are commonly mentioned as Text Categorization, that is, classifying text segments (documents, paragraphs, collocations) into a set of predefined classes. Examples of Text Categorization tasks include assigning topics labels to news items, classifying email messages into folders, or, more close to my research, classifying messages as spam or not (Bayesian spam filters) and web pages as inappropriate or not (e.g. pornographic content vs. educational resources).

WEKA support for Text Categorization is impressive. A prominent feature is that this package supports breaking text utterances into indexing terms (word stems, collocations) and assigning them a weight in term vectors, a required step in nearly every text classification task. This tokenization and indexing process is achieved by using a super-flexible filter named StringToWordVector. Lets me show an example of how it works.

I will start with a simple text collection, which is an small sample of the publicly available SMS Spam Collection. Some colleagues and me built this collection for experimenting with Bayesian SMS spam filters, and it contains 4,827 legitimate messages and 747 mobile spam messages, for a total of 5,574 short messages collected from several sources. I will make use of an small subset in order to better show my points in this post. The subset is made with the first 200 messages, and it is the following one right formatted in the suitable WEKA ARFF format:

@relation sms_test

@attribute spamclass {spam,ham}
@attribute text String

@data
ham,'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
ham,'Ok lar... Joking wif u oni...'
spam,'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C\'s apply 08452810075over18\'s'
ham,'U dun say so early hor... U c already then say...'
ham,'Nah I don\'t think he goes to usf, he lives around here though'
spam,'FreeMsg Hey there darling it\'s been 3 week\'s now and no word back! I\'d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv'
...
ham,'Hi its Kate how is your evening? I hope i can see you tomorrow for a bit but i have to bloody babyjontet! Txt back if u can. :) xxx'

In the first 200 messages of the collection, 33 of them are spam and 167 are legitimate ("ham"). This collection can be loaded in the WEKA Explorer, showing something similar to the following window:

The point is that messages are featured as string attributes, so you have to break them into words in order to allow learning algorithms to induce classifiers with rules like:

if ("urgent" in message) then class(message) == spam

Here is where the StringToWordVector filter comes to help. You can just select it by clicking the "Choose" button in the "Filter" area, and browsing the folders to "weka > filters > unsupervised > attribute" one. Once selected, you should be able to see something like this:

If you click on the name of the filter, you will get a lot of options, which I leave for another post. For the my goals in this one, you can just apply this filter with the default options to get an indexed collection of 200 messages and 1,382 indexing tokens (plus the class attribute), shown in the next picture:

If you want to see colors showing the distribution of attributes (tokens) according to the class, you can just select the "class" attribute as the class for the collection in the bottom-left area of the WEKA Explorer. So, you can see that the attribute "Available" occurs just in one message, which happens to be a legitimate (ham) one:

Now, we can make our experiments in the Classify tab. We can just select cross-validation using 3 folds (1), point to the appropriate attribute to be used as a class (which is the "spamclass" one) (2), and select a rule learner like PART in the classifier area (3). You can find that classifier at the "weka > classifiers > rules" folder when clicking on the "Choose" button at the "Classifier" area. This setup is shown in the next figure:

The selected evaluation method, cross-validation, instructs WEKA to divide the training collection into 3 sub-collections (folds), and perform three experiments. Each experiment is done by using two of the folds for training, and the remaining one for testing the learnt classifier. The sub-collections are sampled randomly, the way that each instance belong only to one of them, and the class distribution (50% in our example) is kept inside each fold.

So, if we click on the "Start" button, we will get the output of our experiment, featuring the classifier learnt over the full collection, and the values for the typical accuracy metrics averaged over the three experiments, along with the confusion matrix. The classifier learnt over the full collection is the following one:

PART decision list
------------------

or <= 0 AND
to <= 0 AND
2 <= 0: ham (119.0/3.0)

£1000 <= 0 AND
FREE <= 0 AND
call <= 0 AND
Reply <= 0 AND
i <= 0 AND
all <= 0 AND
final <= 0 AND
50 <= 0 AND
mobile <= 0 AND
ur <= 0 AND
text <= 0: ham (26.0/2.0)

i <= 0 AND
all <= 0: spam (30.0/3.0)

: ham (25.0/1.0)

Number of Rules : 4

This notation can be read as:

if (("or" not in message) and ("to" not in message) and ("2" not in message)) then class(message) == ham
...
otherwise class(message) == ham

And the confusion matrix is the next one:

=== Confusion Matrix ===

a b <-- classified as
17 16 | a = spam
12 155 | b = ham

Which means that the PART learner is able to get 17+155 correct classifications, and it makes 12+16 mistakes. It leads to an accuracy of 86%.

But we have done it wrong!

Do you remember the "Available" token, which occurs only on one of the messages? In which fold is it? When it is on a training fold, we are using it for training (making the learner trying to generalize from a token that does not occur in the test collection). And when it is on the test collection, the learner should not even know about it! Moreover, what happens with attributes that are highly predictive for the full collection (according to their statistics when computing e.g. the Information Gain metric)? They may have worse (or better) statistics when a subset of their occurrences is not seen, as they can be on the test collection!

The right way to perform a correct text classification experiment with cross validation in WEKA is feeding the indexing process into the classifier itself, that is, chaining the indexing filter (StringToWordVector) and the learner, the way that we index and train for every sub-set in the cross-validation run. Thus, you have to use the FilteredClassifier class provided by WEKA.

In fact, this is not that difficult. Let us go back to the original test collection, which features two attributes: the message (as a string) and the class. Then you can go to the Classify tab, and choose the FilteredClassifier learner, which is available at the "weka > classifiers > meta", and shown in the next picture:

Then you must choose the filter and the classifier you are going to apply to the collection, by clicking on the classifier name at the "Classifier" area. I choose StringToWordFilter and PART with their default options:

If we now run our experiment with 3-fold cross-validation and the filtered classifier we have just configured, we get different results:

=== Confusion Matrix ===

a b <-- classified as
13 20 | a = spam
7 160 | b = ham

For an accuracy of 86.5%, a bit better than the one obtained with the wrong setup. However, we catch 4 less spam messages, and the True Positive ratio goes down from 0.515 to 0.394. This setup is more realistic and it better mimics what will happen in the real world, in which we will find highly relevant but unseen events, and our statistics may change dramatically over time.

So now we can run our experiment safely, as no unseen events will be used in the classification. Moreover, if we apply any kind of Information Theory based filter like e.g. ranking the attributes according to their Information Gain value, the statistics will be correct, as they will be based on the training set for each cross-validation run.

Thanks for reading, and please feel free to leave a comment if you think I can improve this article!

A note on WEKA limitations and big data

2013-01-16T19:13:00.001+01:00

I love WEKA since it was first introduced to me by my friend Enrique Puertas back in 1999, when he used it for programming a Usenet News client with spam filtering capabilities based on Machine Learning (what we usually call a bayesian spam filter now). I got impressed by its flexibility and functionality, and the ease of experimenting with WEKA and using it in my Java programs. I quickly got familiar with it and I used it for making my very first experiments on spam filtering.

Over the years, WEKA has being updated, getting more algorithms and making some tasks easier for text miners. For instance, the StringToWordVector filter allows to get a Vector Space Model (or bag of words) representation of your problem texts, a task that I had to do manually (with my own programs or scripts) at the beginning. Another example: the Sparse ARFF format allows to get a compact representation of your word vectors, instead of getting thousands of attribute values per instance, most of them being "0" or "no". Moreover, WEKA has attracted so much attention that other platforms have integrated it (e.g. GATE) or provided covering environments that augment its functionality (e.g. RapidMiner).

However, our needs as researchers have evolved as well. One of the most important issues now is data size. While working with average computers in my early experiments was enough, given the size of standard collections (20 Newsgroups, Reuters-21578, LingSpam, etc. - all of the order of tens of thousand instances), now that is nearly impossible. Most of my experiments involve from hundreds of thousand to millions of instances. In those cases, WEKA can spend days for a single learn-and-test cycle, or it can simply run out of memory; and not with an average machine, even with a really big server!

So now, what?

Before dealing with this question, I must say that I have been a heavy user of the WEKA command line and the Explorer GUI . However, I have never considered or used the WEKA Experimenter GUI . I know from friends and diagonal readings that the Experimenter allows to distribute experiments over a number of machines. However, if I am going to distribute my experiments, why not using newer technologies (less ad-hoc, WEKA-dependent), just 100% compatible/standard/implemented with-in cloud providers? Why not getting advantage of elastic cloud capabilities (grow and pay as you need)?

Given said this, and keeping up with the latest news and trends in data and text mining, I see two options:

Going for R . This language/platform has grown incredibly in the latest years, and it has quickly became a standard language for data mining, present in many curricula, and much often considered an absolute requirement in data science job offers. There are nice books about it as well, like "R in a Nutshell", and other strategical books recommend/use it (like "The Elements of Statistical Learning"). R supports map reduce algorithms over Hadoop for distributed experiments with tons of data. And R interfaces with Java as well.
Choosing Mahout (plus Lucene/SOLR ). This platform is Java-based, tightly integrated with Hadoop, and it makes use of Lucene for text representation tasks -- Lucene could be considered a standard for deploying search engines now. There are good books on Mahout and Lucene/SOLR as well ("Mahout in Action", "Lucene in Action", "Apache SOLR Cookbook").

But still I do not feel any option is better than the other one. Both are challenging and appealing, and I have not taken a decision yet. And I am willing to hear your opinion, of course.

A list of datasets for opinion mining in Twitter

2013-01-10T19:14:00.000+01:00

In a recent thread at the SentimentAI group (list), a number of links to datasets for training / testing opinion mining / sentiment classifiers over Twitter have been contributed. I list them here for the case somebody considers this information useful:

Three datasets provided by Hassan Saif, including an annotated subset of the Stanford Twitter Sentiment Corpus, and two for the specific topics of the Health Care Reform and the Obama-McCain Debate.
The Stanford Twitter Corpus itself, provided by Alec Go and others at Sentiment140. You can download the ST Corpus directly (70Mb).
The Sanders Analytics Twitter Sentiment Corpus , provided by Niek Sanders.
The mejaj datasets , provided by Nibir Bora and others.
The SemEval-2013: Sentiment Analysis in Twitter evaluation campaign (or competition) dataset. Note the competition is still active, you can join it! Check the dates at the SemEval-2013 website.
The RepLab 2012 Profiling task dataset. The profiling task is a bit different from the standard sentiment classification task. For instance, factual tweets can imply bad reputation ("Lehmann Brothers goes bankrupt") and negative sentiment tweets can imply good reputation ("R.I.P. Michael Jackson. We'll miss you").
UPDATE (8/10/2013): Contributed by Eugenio Martínez Cámara (thanks!), the Spanish-language dataset used in the TASS workshop organized at the anual meeting of the SEPLN.

You can find the SentimentAI thread on Twitter datasets here.

Spam en LinkedIn al estilo "Robin Sage"

2013-01-08T16:55:00.000+01:00

Yo mismo, y algunos de mis contactos en LinkedIn, han recibido recientemente una solicitud de contacto por parte de una tal "Elena Domínguez" (enlace*). Se trata de un perfil un poco extraño, por cuanto está bastante poco detallado (experiencia profesional, formación, etc.), pero pertenece a varios grupos de ingenieros (se auto-califica como ingeniera), pero tiene cientos de contactos sumamente heterogéneos de temas TIC. Ésta es la imagen del perfil:

Si se acepta a esta "persona", en pocos días (u horas), se recibirá un correo invitando a unirse al grupo de LinkedIn "International Master's in Theoretical & Practical Application of Finite Element Method" (enlace*). Aunque el master promocionado mediante este grupo de LinkedIn parece razonablemente legítimo, tanto el perfil como el grupo parecen ser spam.

Una cosa que llama especialmente la atención es que su foto de perfil es bastante rara, como "demasiado aséptica", casi artificial. Una evidencia adicional de spam la obtenemos cuando realizamos una búsqueda por imágenes en Google, usando esta imagen como consulta. Primero obtenemos la URL de la imagen:

A continuación, buscamos la foto en Google Images, pulsando sobre el botón de la cámara e introduciendo la URL que hemos obtenido antes:

Y los resultados son los siguientes:

A partir de estos resultados, se puede deducir con bastante certeza que la foto es de "stock", es decir, de catálogo, y que aparece en varios catálogos como imagen de archivo de mujer de negocios con expresión neutra, realizada en estudio. Usar una foto como esta para nuestro perfil en una red como LinkedIn es posible, pero bastante poco probable.

Por tanto, considero esta fotografía como una evidencia fuerte que, unida al comportamiento del "usuario" (enviando el correo de invitación a un grupo tan focalizado en un producto educativo) como al número tan alto de contactos para un perfil tan poco detallado), me lleva a pensar que se trata de un perfil de spam, pero real en el sentido de que no es un experimento de ingeniería social como el realizado por Thomas Ryan con el perfil " Robin Sage ".

Como conclusión, pienso que hasta LinkedIn, que es una de las redes menos explotadas para el spam, se irá viendo invadida crecientemente por este fenómeno, cada vez con mayor nivel de personalización y sofisticación.

(*) No asocio el enlace al nombre del perfil o del grupo para no generar spam web.

Report on ERA Course: Fighting Child Pornography on the Internet

2012-12-04T15:16:00.004+01:00

I have had the pleasure of attending as a student to the European Academy of Law course on "Fighting Child Pornography on the Internet", at Madrid 29-30 November 2012. I was supported by the Spanish child protection NGO Protégeles, as I work with then whenever I can in order to push their mission.

It was a nice course, with a good coverage of topics, including legal aspects, and technical issues both from the view of prosecuting sex offenders and from Web filtering. Speakers were excellent and provided a lok of useful hints and links. I also crafted a backlog hashtag for the event in Twitter (#ERAChildPornCourse), but I am afraid that neither attendents nor speakers are very happy with Twitter (with scarce exceptions). I collected some comments during the event, organized in terms of the topic:

Legal issues

Media types that do not involve real children are child porn?
Internet and digital cameras have led to an explosion of child porn, now a home industry
There is a thousand years history on child porn (e.g. paintings) but cameras imply children are really abused to get it recorded
What does mean child porn possesion? What about cloud drives? And streaming?
Internet is world-wide, so who has the jurisdiction? Should anybody have it?
Eurojust helps coordination of child porn prosecution, examples of operations: "lost boy", "nanny", "dreamboard"
Lanzarote Convention says accesing a child porn site, if knowing it hosts that stuff, is illegal
Providing lists of links of web sites hosting child porn is illegal under Lanzarote Convention

Protection, prosecution, technical issues

For preparing cases against child porn, prosecutors check nature of material, offender involvement and number of images
The 10% of photographs ever taken, were taken during the latest year Note: all kind of pics
Groomers and child sex offenders play "the jailbait game" in vidro chat sites
Youngsters are extemely vulnerable to grooming: they nearly accept all frienship requests, have 3-4k+ contacts
Haebephilia is the sexual preference for individuals in early years of puberty (generally 11-14)
LEAs make use of a plethora of image analysis tools to process suspect pics; Microsoft PhotoDNA just one in the box
About 20% of child porn stuff is delivered through commercial platforms
Project HAVEN aims at stop child abuse by EU citizens in foreign countries (Asia, South America...)
Law Enforcement Agencies cooperate and share a Child Abuse International database
Law Enforcement Agencies (e.g. Europol) are getting more and more focused on victim identification
INHOPE has not authority to release block lists of child porn sites

An aditional fact is that after hearing Interpol and Europol, one gets proud of having such great professionals working against child porn.

All in all, it has been a great course and I am very happy of being able to attend to it.

Artículo en Novática: comprometiendo la seguridad de reCAPTCHA

2012-05-17T12:31:00.000+02:00

En el número 215 de Novática hemos publicado un artículo que versa sobre la utilización de diversas técnicas de normalización de imagen y el OCR Tesseract de Google para realizar ataques de reconocimiento de texto sobre dos versiones de reCAPTCHA. La referencia del artículo es:

Noemí Carranza, Ricardo Palma Durán, Gonzalo Álvarez Marañón, José María Gómez Hidalgo, 2012. Análisis de la seguridad del sistema reCAPTCHA. Revista Novática 215, enero-febrero 2012, pág. 43-48.

El resumen del artículo es el siguiente:

En los últimos tiempos se han popularizado extraordinariamente los sistemas CAPTCHA, que protegen servicios Web planteando al usuario una prueba destinada a verificar que se trata de un ser humano y no de un robot, o sistema automático para el envío de correo basura o difusión de malware. Estos sistemas están siempre expuestos a que spammers y hackers sean capaces de comprometer su seguridad, y abusar de los recursos subyacentes (cuentas de correo, blogs, etc.) para realizar sus actividades ilícitas. Por ello, es necesario comprobar periódicamente su seguridad usando herramientas como sistemas de reconocimiento óptico de caracteres (OCR), sistemas de análisis de imagen, y otras. En este artículo realizamos un análisis de la seguridad del sistema reCAPTCHA, que probablemente es el más usado en Internet actualmente. Para ello, utilizamos diversas técnicas de análisis de imagen orientadas
a corregir las deformaciones y distorsiones realizadas por el sistema en las imágenes que muestra al usuario, así como el eficaz sistema de OCR Tesseract. Se han analizado dos versiones del sistema reCAPTCHA y se ha comprobado que la seguridad del sistema probablemente ha aumentado en la segunda versión, más reciente, aunque es posible comprometer la seguridad del sistema si se cuenta con recursos suficientes en forma de una botnet de tamaño medio (unos 10.000 ordenadores).