11.2.13

Text Mining in WEKA Revisited: Selecting Attributes by Chaining Filters

Two weeks ago, I wrote a post on how to chain filters and classifiers in WEKA, in order to avoid misleading results when performing experiments with text collections. The issue was that, when using N Fold Cross Validation (CV) in your data, you should not apply the StringToWordVector (STWV) filter on the full data collection and then perform the CV evaluation on your data, because you would be using words that are present in your test subset (but not in your training subset) for each run. Moreover, the STWV filter can extract and use simple statistics to filter out the terms (e.g. minimum number of occurrences), but those statistics over the full collection are not valid because in each CV run you use only a subset of it.

Now I would like to deal with a more general setting in which you want to apply dimensionality reduction because, in general text classification tasks, the documents or examples are represented by hundreds (if not thousands) of tokens, what makes the classification problem very hard for many learners. In WEKA, this involves using the AttributeSelection filter along with the STWV one. Before thinking about dimensionality reduction, we must reflect a bit about it.

Dimensionality reduction is a typical step in many data mining problems, which involves transforming our data representation (the schema of our table, the list of current attributes) into a shorter, more compact, and hopefully, more predictive one. Basically, this can be done in two ways:

  • With feature reduction, which maps the original representation (list of attributes) onto a new and more compact one. The new attributes are synthetic, that is, they somehow combine the information from subsets of the original ones which share statistical properties. Typical feature reduction techniques include algebraic analysis methods like Principal Component Analysis (PCA) and Singular Value Decomposition (SVD). In text analysis, the most popular method is, by far, Latent Semantic Analysis, which involves obtaining the principal components or buckets into the term-to-document sparse matrix.
  • With feature selection, which just selects a subset of the original representation attributes, according to some Information Theory quality metric like Information Gain or X^2 (Chi-Square). This method can be far more simple and less time consuming than the previous one, as you only have to compute the value of the metric for each attribute, and rank the attributes. Then you simply decide a threshold in the metric (e.g. 0 for Information Gain) and keep the attributes with a value over it. Alternatively, you can choose a percentage of the number of original attributes (e.g. 1% and 10% are typical numbers in text classification), and just keep those top ranking ones. However, there are other more time consuming alternatives, like exploring the predictive power of subsets of attributes using search algorithms.

A major difference between both methods is that feature reduction leads to synthetic attributes, but feature selection just keeps some of the original ones. This may affect the ability of the data scientist to understand the results, as synthetic attributes can be statistically relevant but meaningless. Another difference is that feature reduction does not make use of the class information, while feature selection does. In consequence, the second method is very likely to lead to a more predictive subset of attributes than the original one. But beware, more theoretical predictive power does not always mean more effectiveness. I recommend to read the old (?) but always helpful paper by Yimming Yang & Jan Pedersen on the topic.

The WEKA package supports both methods, mainly with the weka.attributeSelection.PrincipalComponents (feature reduction) and weka.filters.supervised.attribute.AttributeSelection (feature selection) filters. But an important question is: Do you really need to make dimensionality reduction in text analysis? There are two clear arguments against it:

  1. Some algorithms get no hurt with using all the features, even if they are really many and very sparse. For instance, Support Vector Machines excel in text classification problems exactly for that: they are able to deal with thousands of attributes, and they get better results when no reduction is performed. A typical text classification problem in which dimensionality reduction can be a big mistake is spam filtering.
  2. If it is a matter of computing time, like e.g. in symbolic learners like decision trees (C4.5) or rules (Ripper), then there is no worry. Big Data techniques come to help, as you can configure cheap and big clusters over e.g. Hadoop to perform your computations!

But having the algorithms in my favourite data analysis package, and knowing that sometimes they lead to effectiveness improvements, why not using them?

Because of the reasons above, I will focus on feature selection. In consequence, I will deal with the AttributeSelection filter, leaving the PrincipalComponents one for another post. Let us start with the same text collection that I used in my previous post about chaining filters and classifiers in WEKA. It is an small subset of the SMS Spam Collection, made with the first 200 messages for brevity and simplicity.

Our goal is to perform a 3-fold CV experiment with any algorithm in WEKA. But, in order to do it correctly, we know we must chain the STWV filter with the classifier by using the FilteredClassifier learner in WEKA. However, we want to perform feature selection as well, and the FilteredClassifier allows us to chain a single filter and a single classifier. So, how to combine both the STWV and the AttributeSelection filters into a single one?

Let us start doing it manually. After loading the dataset into the WEKA Explorer, applying the STWV filter with the default settings, and setting the class attribute to the "spamclass" one, we get something like this:

Now we can either go to the "Select attributes" tab, or just stay in the "Preprocess" tab and choose the AttributeSelection filter. I opt for the second way, so you can browse the filters folder by clicking on the "Choose" button at the "Filters" area. After selecting the "weka > filters > supervised > attribute > AttributeSelection", you can see the selected filter in the "Filters" area, as shown in the next picture:

In order to set up the filter, we can click on the name of the filter. The "weka.gui.GenericObjectEditor" window we get is a generic window that allows to configure filters, classifiers, etc. according to a number of object-defined properties. In this case, it allows us to set up the AttributeSelection filter configuration options, which are:

  • The evaluator, which is the quality metric we use to evaluate the predictive properties of an attribute or a set of them. There you can choose among a wide number of them (which depends on your WEKA version), including specially Chi Square (ChiSquaredAttributeEval), Information Gain (InfoGainAttributeEval), and Gain Ratio (GainRatioAttributeEval).
  • The search algorithm, which is the way we will select the remaining group of attributes, and includes very clever but time consuming group search algorithms, and my favourite one, the Ranker (weka.attributeSelection.Ranker). This one just ranks the attributes according to the chosen quality metric, and keeps those meeting some criterion (like e.g. having a value over a predefined threshold).

In the next picture, you can see the AttributeSelection configuration window with the evaluator set up to Information Gain, and the search set up as Ranker, with the default options.

The Ranker evaluator has two main properties:

  • The numToSelect property, which defines the number of attributes to keep, an Integer number that is -1 (all) by default.
  • The threshold property, which defines the minimum value that an attribute has to get in the evaluator in order to be kept. The default value for this property is the minimum Long integer in Java.

In consequence, if we want to keep those attributes scoring over 0, we have just to write that number in the threshold area of the window we get when we click on the Ranker at the previous window:

By clicking OK on all the previous windows, we get a configuration of the AttributeSelection filter which involves keeping those attributes with Information Gain score over 0. If we apply that filter to our current collection, we get the following result:

As you can see, we get a ranked list of 82 attributes (plus the class one), in which the top scoring attribute is the token "to". This attribute occurs in 69 messages (value 1), but many of them are spam ones, so it is quite predictive for this particular class. We can see as well that we only keep a 5.93% of the original attributes (82 over 1382).

Now we can go to the "Classify" tab and select the rule learner PART ("weka > classifiers > rules > PART") to be evaluated on the training collection itself ("Test options" area, "Use training set option"), getting the next result:

We get an accuracy of 95.5%, much better than the results I reported in my previous post. Of course, these results cannot be compared because this quick experiment is a test on the training collection, not done with 3-fold CV and the FilteredClassifier. But if we want to run a CV experiment, how to do it as we have 2 filters instead of one, in our set up?

What we need now is to start with the original text collection in ARFF format (no STWV yet), and to use the MultiFilter that WEKA provides for these situations. We start then with the original collection, and go to the "Classify" tab. If we try to choose any classic learner (J48 for the C4.5 decision tree learner, SMO for Support Vector Machines, etc.), it will be impossible because we have just one attribute (the text of the SMS messages) along with the class, but we can use the weka.classifiers.meta.FilteredClassifier. After selecting it, we will see something similar to the next picture:

If we click on the name of the classifier at the "Classifier" area and we select weka.classifiers.rules.PART as the classifier (with default options), we get the next set up in the FilteredClassifier editor window:

Then we can choose the weka.filters.MultiFilter in the filter area, which starts with a dummy AllFilter. Time to set up our filter combining STWV and AttributeSelection. We click on the filter name area and we get a new filter edition window with an area to define the filters to be applied. If we click on it, we get a new window that allows to add, configure and delete filters. The selected filters will be applied in the order we add them, so we start deleting the AllFilter and adding a STWV filter with the default options, getting something similar to the next picture:

Filters are added by clicking on the "Choose" button to select them, and clicking on the "Add" button to add them to the list. We can now add the AttributeSelection filter with the Information Gain evaluator and the Ranker with threshold 0 search, by editing the filter when clicking on the "Edit" button with the AttributeSelection filter selected in the list. If you manually re-dimension the window, you can see a set up similar to this one:

The set up is nearly finished. We close this window by clicking on the "X" button, and click on the "OK" button at the MultiFilter and FilteredClassifier windows. In the "Classify" tab at the explorer, we select "Cross-validation" in the "Test options" area, entering 3 as the number of folds, and we select the class attribute as "spamclass". Having done this, we can just click on the "Start" button to get the next result:

So we get an accuracy of 83.5%, which is worse than the one we got without using feature selection (which was 86.5%). Oh oh, all this clever (?) set up to get a drop of 3 points in accuracy! :-(

But what happens if, instead of using a relatively weak learner on text problems like PART, we turn to Support Vector Machines? WEKA includes the weka.classifiers.functions.SMO classifier, which implements John Platt's sequential minimal optimization algorithm for training a support vector classifier. If we choose this classifier with default options, we get a quite different results:

  • Using only the STWV filter, we get an accuracy of 90.5% with 18 spam messages classified as legitimate ("ham"), and 1 false positive.
  • Using the MultiFilter with AttributeSelection in the same setup, we get an accuracy of 91% with 16 spam messages classified as ham, and 2 false positives.

So we get an improvement of accuracy on a more accurate learner, what is nice. However, the difference is just 0.5% (1 message in our 200 instances collection), so it is moderate. Moreover, we get one more false positive, what is bad for this particular problem. In spam filtering, it is much worse to make a false positive (sending a legitimate message to the spam folder) than the opposite, because the user has the risk to miss an important message. Check my paper on cost sensitive evaluation of spam filtering at ACM SAC 2002.

But all in all, I expect this post shows the merits of feature selection in text classification problems, and how to do it with my favourite library, WEKA. Thanks for reading, and please feel free to leave a comment if you think I can improve this article!

29.1.13

Text Mining in WEKA: Chaining Filters and Classifiers

One of the most interesting features of WEKA is its flexibility for text classification. Over the years, I have had the chance to make a lot of experiments on text collections with WEKA, most of them in supervised tasks that are commonly mentioned as Text Categorization, that is, classifying text segments (documents, paragraphs, collocations) into a set of predefined classes. Examples of Text Categorization tasks include assigning topics labels to news items, classifying email messages into folders, or, more close to my research, classifying messages as spam or not (Bayesian spam filters) and web pages as inappropriate or not (e.g. pornographic content vs. educational resources).

WEKA support for Text Categorization is impressive. A prominent feature is that this package supports breaking text utterances into indexing terms (word stems, collocations) and assigning them a weight in term vectors, a required step in nearly every text classification task. This tokenization and indexing process is achieved by using a super-flexible filter named StringToWordVector. Lets me show an example of how it works.

I will start with a simple text collection, which is an small sample of the publicly available SMS Spam Collection. Some colleagues and me built this collection for experimenting with Bayesian SMS spam filters, and it contains 4,827 legitimate messages and 747 mobile spam messages, for a total of 5,574 short messages collected from several sources. I will make use of an small subset in order to better show my points in this post. The subset is made with the first 200 messages, and it is the following one right formatted in the suitable WEKA ARFF format:

@relation sms_test

@attribute spamclass {spam,ham}
@attribute text String

@data
ham,'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
ham,'Ok lar... Joking wif u oni...'
spam,'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C\'s apply 08452810075over18\'s'
ham,'U dun say so early hor... U c already then say...'
ham,'Nah I don\'t think he goes to usf, he lives around here though'
spam,'FreeMsg Hey there darling it\'s been 3 week\'s now and no word back! I\'d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv'
...
ham,'Hi its Kate how is your evening? I hope i can see you tomorrow for a bit but i have to bloody babyjontet! Txt back if u can. :) xxx'

In the first 200 messages of the collection, 33 of them are spam and 167 are legitimate ("ham"). This collection can be loaded in the WEKA Explorer, showing something similar to the following window:

The point is that messages are featured as string attributes, so you have to break them into words in order to allow learning algorithms to induce classifiers with rules like:

if ("urgent" in message) then class(message) == spam

Here is where the StringToWordVector filter comes to help. You can just select it by clicking the "Choose" button in the "Filter" area, and browsing the folders to "weka > filters > unsupervised > attribute" one. Once selected, you should be able to see something like this:

If you click on the name of the filter, you will get a lot of options, which I leave for another post. For the my goals in this one, you can just apply this filter with the default options to get an indexed collection of 200 messages and 1,382 indexing tokens (plus the class attribute), shown in the next picture:

If you want to see colors showing the distribution of attributes (tokens) according to the class, you can just select the "class" attribute as the class for the collection in the bottom-left area of the WEKA Explorer. So, you can see that the attribute "Available" occurs just in one message, which happens to be a legitimate (ham) one:

Now, we can make our experiments in the Classify tab. We can just select cross-validation using 3 folds (1), point to the appropriate attribute to be used as a class (which is the "spamclass" one) (2), and select a rule learner like PART in the classifier area (3). You can find that classifier at the "weka > classifiers > rules" folder when clicking on the "Choose" button at the "Classifier" area. This setup is shown in the next figure:

The selected evaluation method, cross-validation, instructs WEKA to divide the training collection into 3 sub-collections (folds), and perform three experiments. Each experiment is done by using two of the folds for training, and the remaining one for testing the learnt classifier. The sub-collections are sampled randomly, the way that each instance belong only to one of them, and the class distribution (50% in our example) is kept inside each fold.

So, if we click on the "Start" button, we will get the output of our experiment, featuring the classifier learnt over the full collection, and the values for the typical accuracy metrics averaged over the three experiments, along with the confusion matrix. The classifier learnt over the full collection is the following one:

PART decision list
------------------

or <= 0 AND
to <= 0 AND
2 <= 0: ham (119.0/3.0)

£1000 <= 0 AND
FREE <= 0 AND
call <= 0 AND
Reply <= 0 AND
i <= 0 AND
all <= 0 AND
final <= 0 AND
50 <= 0 AND
mobile <= 0 AND
ur <= 0 AND
text <= 0: ham (26.0/2.0)

i <= 0 AND
all <= 0: spam (30.0/3.0)

: ham (25.0/1.0)

Number of Rules : 4

This notation can be read as:

if (("or" not in message) and ("to" not in message) and ("2" not in message)) then class(message) == ham
...
otherwise class(message) == ham

And the confusion matrix is the next one:

=== Confusion Matrix ===

a b <-- classified as
17 16 | a = spam
12 155 | b = ham

Which means that the PART learner is able to get 17+155 correct classifications, and it makes 12+16 mistakes. It leads to an accuracy of 86%.

But we have done it wrong!

Do you remember the "Available" token, which occurs only on one of the messages? In which fold is it? When it is on a training fold, we are using it for training (making the learner trying to generalize from a token that does not occur in the test collection). And when it is on the test collection, the learner should not even know about it! Moreover, what happens with attributes that are highly predictive for the full collection (according to their statistics when computing e.g. the Information Gain metric)? They may have worse (or better) statistics when a subset of their occurrences is not seen, as they can be on the test collection!

The right way to perform a correct text classification experiment with cross validation in WEKA is feeding the indexing process into the classifier itself, that is, chaining the indexing filter (StringToWordVector) and the learner, the way that we index and train for every sub-set in the cross-validation run. Thus, you have to use the FilteredClassifier class provided by WEKA.

In fact, this is not that difficult. Let us go back to the original test collection, which features two attributes: the message (as a string) and the class. Then you can go to the Classify tab, and choose the FilteredClassifier learner, which is available at the "weka > classifiers > meta", and shown in the next picture:

Then you must choose the filter and the classifier you are going to apply to the collection, by clicking on the classifier name at the "Classifier" area. I choose StringToWordFilter and PART with their default options:

If we now run our experiment with 3-fold cross-validation and the filtered classifier we have just configured, we get different results:

=== Confusion Matrix ===

a b <-- classified as
13 20 | a = spam
7 160 | b = ham

For an accuracy of 86.5%, a bit better than the one obtained with the wrong setup. However, we catch 4 less spam messages, and the True Positive ratio goes down from 0.515 to 0.394. This setup is more realistic and it better mimics what will happen in the real world, in which we will find highly relevant but unseen events, and our statistics may change dramatically over time.

So now we can run our experiment safely, as no unseen events will be used in the classification. Moreover, if we apply any kind of Information Theory based filter like e.g. ranking the attributes according to their Information Gain value, the statistics will be correct, as they will be based on the training set for each cross-validation run.

Thanks for reading, and please feel free to leave a comment if you think I can improve this article!

16.1.13

A note on WEKA limitations and big data

I love WEKA since it was first introduced to me by my friend Enrique Puertas back in 1999, when he used it for programming a Usenet News client with spam filtering capabilities based on Machine Learning (what we usually call a bayesian spam filter now). I got impressed by its flexibility and functionality, and the ease of experimenting with WEKA and using it in my Java programs. I quickly got familiar with it and I used it for making my very first experiments on spam filtering.

Over the years, WEKA has being updated, getting more algorithms and making some tasks easier for text miners. For instance, the StringToWordVector filter allows to get a Vector Space Model (or bag of words) representation of your problem texts, a task that I had to do manually (with my own programs or scripts) at the beginning. Another example: the Sparse ARFF format allows to get a compact representation of your word vectors, instead of getting thousands of attribute values per instance, most of them being "0" or "no". Moreover, WEKA has attracted so much attention that other platforms have integrated it (e.g. GATE) or provided covering environments that augment its functionality (e.g. RapidMiner).

However, our needs as researchers have evolved as well. One of the most important issues now is data size. While working with average computers in my early experiments was enough, given the size of standard collections (20 Newsgroups, Reuters-21578, LingSpam, etc. - all of the order of tens of thousand instances), now that is nearly impossible. Most of my experiments involve from hundreds of thousand to millions of instances. In those cases, WEKA can spend days for a single learn-and-test cycle, or it can simply run out of memory; and not with an average machine, even with a really big server!

So now, what?

Before dealing with this question, I must say that I have been a heavy user of the WEKA command line and the Explorer GUI . However, I have never considered or used the WEKA Experimenter GUI . I know from friends and diagonal readings that the Experimenter allows to distribute experiments over a number of machines. However, if I am going to distribute my experiments, why not using newer technologies (less ad-hoc, WEKA-dependent), just 100% compatible/standard/implemented with-in cloud providers? Why not getting advantage of elastic cloud capabilities (grow and pay as you need)?

Given said this, and keeping up with the latest news and trends in data and text mining, I see two options:

  • Going for R . This language/platform has grown incredibly in the latest years, and it has quickly became a standard language for data mining, present in many curricula, and much often considered an absolute requirement in data science job offers. There are nice books about it as well, like "R in a Nutshell", and other strategical books recommend/use it (like "The Elements of Statistical Learning"). R supports map reduce algorithms over Hadoop for distributed experiments with tons of data. And R interfaces with Java as well.
  • Choosing Mahout (plus Lucene/SOLR ). This platform is Java-based, tightly integrated with Hadoop, and it makes use of Lucene for text representation tasks -- Lucene could be considered a standard for deploying search engines now. There are good books on Mahout and Lucene/SOLR as well ("Mahout in Action", "Lucene in Action", "Apache SOLR Cookbook").

But still I do not feel any option is better than the other one. Both are challenging and appealing, and I have not taken a decision yet. And I am willing to hear your opinion, of course.

10.1.13

A list of datasets for opinion mining in Twitter

In a recent thread at the SentimentAI group (list), a number of links to datasets for training / testing opinion mining / sentiment classifiers over Twitter have been contributed. I list them here for the case somebody considers this information useful:
You can find the SentimentAI thread on Twitter datasets here.

8.1.13

Spam en LinkedIn al estilo "Robin Sage"

Yo mismo, y algunos de mis contactos en LinkedIn, han recibido recientemente una solicitud de contacto por parte de una tal "Elena Domínguez" (enlace*). Se trata de un perfil un poco extraño, por cuanto está bastante poco detallado (experiencia profesional, formación, etc.), pero pertenece a varios grupos de ingenieros (se auto-califica como ingeniera), pero tiene cientos de contactos sumamente heterogéneos de temas TIC. Ésta es la imagen del perfil:


Si se acepta a esta "persona", en pocos días (u horas), se recibirá un correo invitando a unirse al grupo de LinkedIn "International Master's in Theoretical & Practical Application of Finite Element Method" (enlace*). Aunque el master promocionado mediante este grupo de LinkedIn parece razonablemente legítimo, tanto el perfil como el grupo parecen ser spam.

Una cosa que llama especialmente la atención es que su foto de perfil es bastante rara, como "demasiado aséptica", casi artificial. Una evidencia adicional de spam la obtenemos cuando realizamos una búsqueda por imágenes en Google, usando esta imagen como consulta. Primero obtenemos la URL de la imagen:



A continuación, buscamos la foto en Google Images, pulsando sobre el botón de la cámara e introduciendo la URL que hemos obtenido antes:



Y los resultados son los siguientes:



A partir de estos resultados, se puede deducir con bastante certeza que la foto es de "stock", es decir, de catálogo, y que aparece en varios catálogos como imagen de archivo de mujer de negocios con expresión neutra, realizada en estudio. Usar una foto como esta para nuestro perfil en una red como LinkedIn es posible, pero bastante poco probable.

Por tanto, considero esta fotografía como una evidencia fuerte que, unida al comportamiento del "usuario" (enviando el correo de invitación a un grupo tan focalizado en un producto educativo) como al número tan alto de contactos para un perfil tan poco detallado), me lleva a pensar que se trata de un perfil de spam, pero real en el sentido de que no es un experimento de ingeniería social como el realizado por Thomas Ryan con el perfil " Robin Sage ".

Como conclusión, pienso que hasta LinkedIn, que es una de las redes menos explotadas para el spam, se irá viendo invadida crecientemente por este fenómeno, cada vez con mayor nivel de personalización y sofisticación.

(*) No asocio el enlace al nombre del perfil o del grupo para no generar spam web.

4.12.12

Report on ERA Course: Fighting Child Pornography on the Internet



I have had the pleasure of attending as a student to the European Academy of Law course on "Fighting Child Pornography on the Internet", at Madrid 29-30 November 2012. I was supported by the Spanish child protection NGO Protégeles, as I work with then whenever I can in order to push their mission.

It was a nice course, with a good coverage of topics, including legal aspects, and technical issues both from the view of prosecuting sex offenders and from Web filtering. Speakers were excellent and provided a lok of useful hints and links. I also crafted a backlog hashtag for the event in Twitter (#ERAChildPornCourse), but I am afraid that neither attendents nor speakers are very happy with Twitter (with scarce exceptions). I collected some comments during the event, organized in terms of the topic:

Legal issues
  • Media types that do not involve real children are child porn?
  • Internet and digital cameras have led to an explosion of child porn, now a home industry
  • There is a thousand years history on child porn (e.g. paintings) but cameras imply children are really abused to get it recorded
  • What does mean child porn possesion? What about cloud drives? And streaming?
  • Internet is world-wide, so who has the jurisdiction? Should anybody have it?
  • Eurojust helps coordination of child porn prosecution, examples of operations: "lost boy", "nanny", "dreamboard"
  • Lanzarote Convention says accesing a child porn site, if knowing it hosts that stuff, is illegal
  • Providing lists of links of web sites hosting child porn is illegal under Lanzarote Convention
Protection, prosecution, technical issues
  • For preparing cases against child porn, prosecutors check nature of material, offender involvement and number of images
  • The 10% of photographs ever taken, were taken during the latest year Note: all kind of pics
  • Groomers and child sex offenders play "the jailbait game" in vidro chat sites
  • Youngsters are extemely vulnerable to grooming: they nearly accept all frienship requests, have 3-4k+ contacts
  • Haebephilia is the sexual preference for individuals in early years of puberty (generally 11-14)
  • LEAs make use of a plethora of image analysis tools to process suspect pics; Microsoft PhotoDNA just one in the box
  • About 20% of child porn stuff is delivered through commercial platforms
  • Project HAVEN aims at stop child abuse by EU citizens in foreign countries (Asia, South America...)
  • Law Enforcement Agencies cooperate and share a Child Abuse International database
  • Law Enforcement Agencies (e.g. Europol) are getting more and more focused on victim identification
  • INHOPE has not authority to release block lists of child porn sites
An aditional fact is that after hearing Interpol and Europol, one gets proud of having such great professionals working against child porn.

All in all, it has been a great course and I am very happy of being able to attend to it.

17.5.12

Artículo en Novática: comprometiendo la seguridad de reCAPTCHA

En el número 215 de Novática hemos publicado un artículo que versa sobre la utilización de diversas técnicas de normalización de imagen y el OCR Tesseract de Google para realizar ataques de reconocimiento de texto sobre dos versiones de reCAPTCHA. La referencia del artículo es:

Noemí Carranza, Ricardo Palma Durán, Gonzalo Álvarez Marañón, José María Gómez Hidalgo, 2012. Análisis de la seguridad del sistema reCAPTCHA. Revista Novática 215, enero-febrero 2012, pág. 43-48.

El resumen del artículo es el siguiente:

En los últimos tiempos se han popularizado extraordinariamente los sistemas CAPTCHA, que protegen servicios Web planteando al usuario una prueba destinada a verificar que se trata de un ser humano y no de un robot, o sistema automático para el envío de correo basura o difusión de malware. Estos sistemas están siempre expuestos a que spammers y hackers sean capaces de comprometer su seguridad, y abusar de los recursos subyacentes (cuentas de correo, blogs, etc.) para realizar sus actividades ilícitas. Por ello, es necesario comprobar periódicamente su seguridad usando herramientas como sistemas de reconocimiento óptico de caracteres (OCR), sistemas de análisis de imagen, y otras. En este artículo realizamos un análisis de la seguridad del sistema reCAPTCHA, que probablemente es el más usado en Internet actualmente. Para ello, utilizamos diversas técnicas de análisis de imagen orientadas
a corregir las deformaciones y distorsiones realizadas por el sistema en las imágenes que muestra al usuario, así como el eficaz sistema de OCR Tesseract. Se han analizado dos versiones del sistema reCAPTCHA y se ha comprobado que la seguridad del sistema probablemente ha aumentado en la segunda versión, más reciente, aunque es posible comprometer la seguridad del sistema si se cuenta con recursos suficientes en forma de una botnet de tamaño medio (unos 10.000 ordenadores).