10.10.14

Carlos Laorden nominado para "Born To Be Discovery" por Negobot

Carlos Laorden, Doctor en Sistemas de Información por la Universidad de Deusto, y compañero y amigo de DeustoTech, ha sido nominado en la categoría de Ciencia y Tecnología para los premios "Born to be Discovery" por el bot antipedófilos NEGOBOT. Yo ya le he votado. ¿Lo vas a hacer tú?

21.5.14

WEKA Text Mining Trick: Copying Options from the Explorer to the Command Line

From previous posts (specially from Command Line Functions for Text Mining in WEKA), you may know that writing command-line calls to WEKA can be far from trivial, mostly because you may need to nest FilteredClassifier , MultiFilter , StringToWordVector , AttributeSelection and a classifier into a single command with plenty of options -- and nested strings with escaped characters.

For instance, consider the following need: I want to test the classifier J48 over the smsspam.small.arff file, which contains couples of {class,text} lines. However, I want to:

  • Apply StringToWordVector with specific options: lowercased tokens, specific string delimiters, etc.
  • Get only those words with Information Gain over zero, which implies using the filter AttributeSelection with InfoGainAttributeEval and Ranker with threhold 0.0.
  • Make use of 10-fold cross validation, which implies using FilteredClassifier; and as long as I have two filters (StringToWordVector and AttributeSelection), I need to make use of MultiFilter as well.

With some experience, this is not too hard to be done by hand. However, it is much easier to configure your test at the WEKA Explorer, make a quick test with a very small subset of your dataset, then copy the configuration to a text file and editi it to fully fit your needs. For this specific example, I start with loading the dataset at the Preprocess tab, and then I configure the classifier by:

  1. Choosing FilteredClassifier, and J48 as the classifier.
  2. Choosing MultiFilter as the filter, then deleting the default AllFilter and adding StringToWordVector and AttributeSelection filters to it.
  3. Editing the StringToWordVector filter to specify lowercased tokens, do not operate per class, and my list of delimietrs.
  4. Editing the AttributeSelection filter to choose InfoGainAttributeEval as the evaluator, and Ranker with threshold 0.0 as the search method.

I show a picture in the middle of the process, just when editing the StringToWordVector filter:

Then you can specify spamclass as the class and run it to get something like:

=== Run information ===
Scheme: weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 100000 -prune-rate -1.0 -N 0 -L -stemmer weka.core.stemmers.NullStemmer -M 1 -O -tokenizer \\\"weka.core.tokenizers.WordTokenizer -delimiters \\\\\\\" \\\\\\\\r \\\\\\\\t.,;:\\\\\\\\\\\\\\\'\\\\\\\\\\\\\\\"()?!\\\\\\\\\\\\\\\%-/<>#@+*£&\\\\\\\"\\\"\" -F \"weka.filters.supervised.attribute.AttributeSelection -E \\\"weka.attributeSelection.InfoGainAttributeEval \\\" -S \\\"weka.attributeSelection.Ranker -T 0.0 -N -1\\\"\"" -W weka.classifiers.trees.J48 -- -C 0.25 -M 2

Relation: sms_test
Instances: 200
Attributes: 2 spamclass text
Test mode: 10-fold cross-validation
(../..)
=== Confusion Matrix ===
a b <-- classified as
16 17 | a = spam
6 161 | b = ham

As you can see, the Scheme line gives us the exact command options we need to get that result! You can just copy and edit it (after saving the result buffer) to get what you want. Alternatively, you can right click on the command at the Explorer, like in the following picture:

In any case, you get the following messy thing:

weka.classifiers.meta.FilteredClassifier -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 100000 -prune-rate -1.0 -N 0 -L -stemmer weka.core.stemmers.NullStemmer -M 1 -O -tokenizer \\\"weka.core.tokenizers.WordTokenizer -delimiters \\\\\\\" \\\\\\\\r \\\\\\\\t.,;:\\\\\\\\\\\\\\\'\\\\\\\\\\\\\\\"()?!\\\\\\\\\\\\\\\%-/<>#@+*£&\\\\\\\"\\\"\" -F \"weka.filters.supervised.attribute.AttributeSelection -E \\\"weka.attributeSelection.InfoGainAttributeEval \\\" -S \\\"weka.attributeSelection.Ranker -T 0.0 -N -1\\\"\"" -W weka.classifiers.trees.J48 -- -C 0.25 -M 2

Then you can strip the options you do not need. For instance, some default options in StringToWordVector are -R first-last, prune-rate -1.0, -N 0, the stemmer, etc. You can guess the default options by issuing the help command:

$>java weka.filters.unsupervised.attribute.StringToWordVector -h
Help requested.

Filter options:
-C
Output word counts rather than boolean word presence.
-R <index1,index2-index4,...>
Specify list of string attributes to convert to words (as weka Range).
(default: select all string attributes)
...

So after cleaning the default options (in all filters and the classifier), adding the dataset file and the class index (-t spamsms.small.arff -c 1), and with some pretty printing for clarification, you can easily build the following command:

java weka.classifiers.meta.FilteredClassifier
-c 1
-t smsspam.small.arff
-F "weka.filters.MultiFilter
-F \"weka.filters.unsupervised.attribute.StringToWordVector
-W 100000
-L
-O
-tokenizer \\\"weka.core.tokenizers.WordTokenizer
-delimiters \\\\\\\" \\\\\\\\r \\\\\\\\t.,;:\\\\\\\\\\\\\\\'\\\\\\\\\\\\\\\"()?!\\\\\\\\\\\\\\\%-/<>#@+*£&\\\\\\\"\\\"\"
-F \"weka.filters.supervised.attribute.AttributeSelection
-E \\\"weka.attributeSelection.InfoGainAttributeEval \\\"
-S \\\"weka.attributeSelection.Ranker -T 0.0 \\\"\""
-W weka.classifiers.trees.J48

So now you can change other parameters if you want, in order to test other text representations, classifiers, etc., without dealing with escaping the options, delimiters, etc.

30.1.14

CFP: Sixth International Conference on Social Informatics

The Sixth International Conference on Social Informatics (SocInfo 2014) will be taking place at Barcelona, Spain, from November 10th to November 13th. The ultimate goal of Social Informatics is to create better understanding of socially-centric platforms not just as a technology, but also as a set of social phenomena. To that end, the organizers are inviting interdisciplinary papers, on applying information technology in the study of social phenomena, on applying social concepts in the design of information systems, on applying methods from the social sciences in the study of social computing and information systems, on applying computational algorithms to facilitate the study of social systems and human social dynamics, and on designing information and communication technologies that consider social context.

Important dates

  • Full paper submission: August 8, 2014 (23:59 Hawaii Standard Time)
  • Notification of acceptance: October 3, 2014
  • Submission of final version: October 10, 2014
  • Conference dates: November 10-13, 2014

Topics

  • New theories, methods and objectives in computational social science
  • Computational models of social phenomena and social simulation
  • Social behavior modeling
  • Social communities: discovery, evolution, analysis, and applications
  • Dynamics of social collaborative systems
  • Social network analysis and mining
  • Mining social big data
  • Social Influence and social contagion
  • Web mining and its social interpretations
  • Quantifying offline phenomena through online data
  • Rich representations of social ties
  • Security, privacy, trust, reputation, and incentive issues
  • Opinion mining and social media analytics
  • Credibility of online content
  • Algorithms and protocols inspired by human societies
  • Mechanisms for providing fairness in information systems
  • Social choice mechanisms in the e-society
  • Social applications of the semantic Web
  • Social system design and architectures
  • Virtual communities (e.g., open-source, multiplayer gaming, etc.)
  • Impact of technology on socio-economic, security, defense aspects
  • Real-time analysis or visualization of social phenomena and social graphs
  • Socio-economic systems and applications
  • Collective intelligence and social cognition

My friend Paolo Boldi is in the organizing committee.

23.8.13

Data Mining for Political Elections, and Isaac Asimov

Using Data Mining, Data Science and Big Data is cool in political elections, and in political decision-making. Well, not sure if cool, but it is a trending topic in Data Science in the latest years.

Here are some examples:

From the research point of view, you can check for instance how Twitter information is used in political campaigns in this Twitter and the Real World CIKM'13 Tutorial by Ingmar Weber and by Yelena Mejova. There is an interesting list of references on several ways of using Twitter to predict user political orientation, general public trends, and other. On the opposite side, you can find an interesting paper which provides sound criticism on some of the research performed on Twitter and politics: "I Wanted to Predict Elections with Twitter and all I got was this Lousy Paper": A Balanced Survey on Election Prediction using Twitter Data, by Daniel Gayo-Avello.

Anyway, it should be clear from multiple points of view that governments (e.g. the NSA PRISM case) and politicians are collecting and using citizen data in order to predict their tastes and to guide their decisions and actions in political campaigns.

I will avoid the privacy discussion here, as I want the case for something different. My case is: Hey, if they can predict elections results, then why voting?

But my blog is not a political one; it should be a technical one - or at least, a technically-focused blogger one. And as many computer geeks, I am a scifi fan. And as one of the biggest authors is Isaac Asimov, I have read a lot by him.

What has to do Asimov with data mining in politics? Well, he predicted it .

More precisely, he predicted how elections may evolve in the Era of Big Data . And he answered my question. You will not vote .

Asimov used to publish short stories in scifi magazines (as many others, I know). In August 1955, he published a short story titled " Franchise " in the magazine "If: Worlds of Science Fiction". I read that story many years later, re-printed in one of his short stores collection books. I was young, and I liked the story, but not too much - there were others more appealing to my taste in the volume. However, I have revisited it recently, and under the light of my technical background, things have changed.

That is real scifi. He technically predicted the future. And it is happening.

The plot is simple; just let me quote the Wikipedia article:

In the future, the United States has converted to an "electronic democracy" where the computer Multivac selects a single person to answer a number of questions. Multivac will then use the answers and other data to determine what the results of an election would be, avoiding the need for an actual election to be held.

As the Big Data platform (the computer Multivac in the story) gets to know more and more about the citizens, it will need less and less to accurately predict election results. The problem is reduced to just making a list of (quite Sentiment Analysis related) questions to a single citizen selected as being representative for answering those questions, in order to refine some details, and that's it.

Do not blame him, nor me. It is just happening.

As always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

Update 1: Yet another example: Twitter hashtags predict rising tension in Egypt.

27.7.13

More Clever Tokenization of Spanish Text in Social Networks

Text written by users in Social Networks is noisy: emoticons, chat codes, typos, grammar mistakes, and moreover, explicit noise created by users as a style, trend or fashion. Consider the next utterance, taken from a post in the social network Tuenti:

"felicidadees!! k t lo pases muy bien!! =)Feeeliiciidaaadeeess !! (:Felicidadesss!!pasatelo genialll :DFeliicCiidaDesS! :D Q tte Lo0 paseS bN! ;) (heart)"

This is a real text. Its approximate translation to English would be something like:

"happybirthdaay!! njy it lot!! =)Haaapyyybirthdaaayyy !! (:Happybirthdayyy!!have a great timeee :DHappyyBiirtHdayY :D Enjy! ;) (heart)"

The latest word between parenthesis is a Tuenti code that is shown as a heart.

If you want to find more text like this out there, just point your browser to Fotolog.

As you can imagine, just tokenizing this kind of text for further analysis is quite a headache. During our experiments for the project WENDY (link in Spanish), we have designed a relatively simple tokenization algorithm in order to deal with this kind of text for age prediction. Although the method is designed for the Spanish language, it is quite language-independent and it may well be applied to other languages - not yet tested. The algorithm is the following one:

  1. Separate the initial string into candidate tokens using white spaces.
  2. A candidate token can be:
    1. A proper sequence of alphabetic characters (a potential word), or proper sequence of punctuation symbols (a potential emoticon). In this case, the candidate token is considered already a token.
    2. A mixed sequences of alphabetic characters and punctuation symbols. In this case, the character sequence is divided into sequences of alphabetic characters and sequences of punctuation symbols. For instance, "Hola:-)ketal" is further divided into "Hola", ":-)", and "ketal".

For instance, consider the next (real) text utterance:

"Felicidades LauraHey, felicidades! ^^felicidiadeees;DFelicidades!Un beso! FELIZIDADESS LAURIIIIIIIIIIIIII (LL)felicidadeeeeeees! :D jajaja mira mi tablonme meo jajajajajjajate quiero(:,"

The output of our algorithm is the list of tokens in the next table:

We have evaluate this algorithm directly and indirectly. Direct evaluation consists of comparing how many hits we get with an space-only tokenizer and with out tokenizer, in a Spanish and in a SMS-language dictionary. The more hits you get, the best recognized are words. We find about 9.5 more words in average in the Spanish dictionary with our tokenizer, and an average of 1.13 words more in the SMS-language dictionary, per text utterance (comment).

The indirect evaluation is performed by pipelining the algorithm in the full process of the WENDY age recognition system. The new tokenizer increases the accuracy of the age recognition system from 0.768 to 0.770, which may seem marginal except for the fact that it accounts for 206 new hits in our text collection of Tuenti comments. The new tokenizer provides relatively important increments in recall and precision for the most under-represented but most critical class, that is that of under 14 users.

This is the reference of the paper which details the tokenizer, the experiments, and the context of the WENDY project, in Spanish:

José María Gómez Hidalgo, Andrés Alfonso Caurcel Díaz, Yovan Iñiguez del Rio. Un método de análisis de lenguaje tipo SMS para el castellano. Linguamatica, Vol. 5, No. 1, pp. 31-39, July 2013.

If you are interested in the first steps of text analysis (tokenization, text normalization, POS Tagging), then these two recent news may be useful for you:

And you may want to take a look at my previous post on text normalization.

As always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!

22.7.13

Negobot is in the news!

... And I must say, it is quite popular out there.

Negobot is a conversational agent posing as a 14 year old girl, intended to detecting paedophilic intentions and adapting to them. Negobot is based on Game Theory, and it is the result of a R&D project performed by the Deustotech Laboratory for Smartness, Semantics and Security (S3Lab) and Optenet. The members of the team are:

And myself. Its scientific approach is explained in the following paper:

Laorden, C., Galán-García. P., Santos, I., Sanz, B., Gómez Hidalgo, J.M., García Bringas, P., 2012. Negobot: A Conversational Agent Based on Game Theory for the Detection of Paedophile Behaviour . International Joint Conference CISIS12-ICEUTEA12-SOCOA 12 Special Sessions, Advances in Intelligent Systems and Computing, Vol. 189, Springer Berlin Heidelberg, pp. 261-270. (preprint)

My friend and colleague Carlos Laorden was interviewed by the SINC Agency about the project some days ago, and the agency released a news story that quickly jumped on a wide range of online and offline agencies, newspapers, radio stations, news aggregators, blogs, etc. Here is the original news story in Spanish:

Una 'Lolita' virtual a la caza de pederastas
SINC | 10 julio 2013 10:40

The news story featured a video with the interview to Carlos.

And in English, published by SINC at Alpha Galileo:

A virtual 'Lolita' on the hunt for paedophiles
10 de julio de 2013 Plataforma SINC

From there, to major English-language media:

Controversial 'Lolita' chatbot catches online predators
NBC News
'Virtual Lolita' aims to trap chatroom paedophiles
BBC News Technology
Negobot, 'Virtual Lolita,' Uses Game Theory To Bust Child Predators In Internet Chat Rooms
Huffington Post
Virtual Lolita poses as schoolgirl aged 14 to trap online paedophiles
The Independent
How 'Lolita style' virtual robots posing as teenage girls are being used to uncover paedophiles on social network sites
Daily Mail
'Virtual Lolita' created to trap paedophiles in online chatrooms
METRO

Major international blogs and news aggregators have also featured Negobot:

As of today, Negobot has got:

Negobot has obtained a world-wide coverage in the news:

Argentine Republic
Crearon un programa informático para atrapar pedófilos en los chats y redes sociales
El Intransigente

Bosnia and Herzegovina
Sofisticirani robot "Negobot" služi da namami i otkrije pedofile
Vijesti

Commonwealth of Australia
Artificial intelligence poses as 14-year-old-girl to detect paedophiles in social chatrooms
News Limited Network
Artificial intelligence poses as 14-year-old-girl to detect paedophiles in social chatrooms
Herald Sun, Melbourne

Czech Republic
"Wirtualna Lolita", czyli czatbot, który wskaże pedofilów
PEJ

French Republic
Negobot, l'adolescente virtuelle qui piège les pédophiles sur internet !
Marie Claire
Espagne : une lolita virtuelle traque les pédophiles sur Internet
Metro News
L'adolescente virtuelle qui traquait les pédophiles en ligne
Le Point

Hellenic Republic
Τεχνητή νοημοσύνη- «κυνηγός» παιδόφιλων στο Ίντερνετ
Naftemporiki

Italian Republic
Negobot, il software "Lolita" che individua i pedofili dialogando
Il Tempo
Negobot, la lolita virtuale che stana i pedofili in rete
La Republica
Negobot, la Lolita virtuale che incastra i pedofili in Rete
La Stampa

Kingdom of Spain
Negobot contra los pedófilos
ABC Tecnología
Negobot contra los pedófilos
La Información
Una 'Lolita' virtual a la caza de pederastas
Publico
Idean una lolita virtual para detectar pedófilos en la Red
La Voz de Galicia
Una 'Lolita' virtual para la caza de pederastas
El Correo Gallego
La trampa para los pederastas en la red
El Espectador
Nuevo sistema virtual a la caza de posibles pederastas
El Economista

Kingdom of Sweden
"Virtuell lolita" ska få fast pedofiler på nätet
Nyheter24

Malasya
Robot Virtual Gadis Remaja Digunakan untuk Menjebak Pedofil
Pikiran Rakyat

Netherlands
Digitale pedolokker imiteert schoolmeisje
PCM

Oriental Republic of Uruguay
Desarrollan "Lolita virtual" para dar caza a pederastas y corruptores de menores
La Red 21

Portuguese Republic
A adolescente robótica caçadora de pedófilos
Hype Science

Republic of Austria
Negobot findet Pädophile
style.at Kurzmeldungen
"Negobot": Chatprogramm forscht Pädophile aus
Der Standard

Republic of Chile
Nuevo software permite detectar pedófilos en la red
24 Horas

Republic of Croatia
Napravljen robot koji pronalazi pedofile
Radio Sarajevo

Republic of India
A virtual Lolita on the hunt for paedophiles online
The Times of India

Republic of Kazakhstan
Negobot, 'Virtual Lolita,' Uses Game Theory To Bust Child Predators In Internet Chat Rooms
Safekaznet

Republic of Poland
Negobot sieciową pułapką na pedofilów
Autonom

Republic of Serbia
STOP PEDOFILIJI: Virtuelna Lolita kreće u lov na manijake!
Telegraf.rs
Virtuelna Lolita za lov na pedofile
HOBOCTN

Romania
Robotul care pozeaza in pustoaica de 14 ani - da de gol pedofilii
Ziare

Russian Federation
Поиском педофилов в сети займется бот, выдающий себя за 14-летнюю
Корреспондент.net
Вычисление педофилов в интернете поручат чат-боту
LENTA

Socialist Republic of Vietnam
'Virtual Lolita' aims to trap chatroom paedophiles
Info VN

Swiss Confederation
Spagna: ecco Negobot, 14enne virtuale che scova i pedofili in rete
Ticino News

Ukraine
В іспанських інтернет-чатах підлітків від педофілів захищає Negobot
UBR

Carlos Laorden has been also interviewed for Spanish newspapers and in radio stations:

And last but not least, Negobot has got some criticism in the form of a (quite funny) video.

You can keep on tracking with Google Search in Web Pages and in the news.

Finally, sorry for the SSF, and thanks for reading.

8.7.13

Performance Analysis of N-Gram Tokenizer in WEKA

The goal of this post is to analyze the WEKA class NGramTokenizer in terms of performance, as it depends on the complexity of the regular expression used during the tokenization step. There is a potential trade-off between more simple regex (which lead to more tokens) and more complex regexes (which take more time to be evaluated). This post intends to provide experimental insights on this trade-off, in order to save your time when using this extremely useful class with the WEKA indexer StringToWordVector.

Motivation

The WEKA weka.core.tokenizers.NGramTokenizer class is responsible for tokenizing a text into pieces, which depending on the configuration of its size, they can be token unigrams, bigrams and so on. This class relies on the method String[] split(String regex) for splitting a text string into tokens, which are further combined into ngrams.

This method, in turn, depends on the complexity of the regular expression used to split the text. For instance, let us examine this simple example:

public class TextSplitTest {
public static void main(String[] args) {
String delimiters = "\\W";
String s = "This is a text &$% string";
System.out.println(s);
String[] tokens = s.split(delimiters);
System.out.println(tokens.length);
for (int i = 0; i < tokens.length; ++i)
System.out.println("#"+tokens[i]+"#");
}
}

In this call to the split() method, we are using the regex "\\W", which matches any non-alphanumeric character as a delimiter. The output of this class execution is:

$> java TextSplitTest
This is a text &$% string
9
#This#
#is#
#a#
#text#
##
##
##
##
#string#

This is due that every individual non-alphanumeric character is a match, and we have five delimiters between "text" and "string". In consequence, we find four empty (but not null) strings among these five matches. If we use the regex "\\W+" as the delimiters string, which matches sequences of one or more non-alphanumeric characters, we get the following output:

$> java TextSplitTest
This is a text &$% string
5
#This#
#is#
#a#
#text#
#string#

Which is closer to what we expected at the beginning.

When tokenizing a text, it seems wise to avoid computing empty strings as potential tokens, because we have to invest some time to discard them -- and we can have thousands of instances!. On the other side, it is clear that a more complex regular expression leads to more computation time. So there is a trade-off between using a one-character delimiter versus using a more sophisticated regex to avoid empty strings. To which extent does this trade-off impacts on the StringToWordVector/NGramTokenizer classes?

Experiment Setup

I run these experiments on my laptop, with: CPU - Intel Core2 Duo, P8700 @ 2.53GHz; RAM: 2.90GB (1.59 GHz). For some of the tests, specially those involving a big number of ngrams, I need to make use of the -Xmx option in order to increase the heap space.

I am using the class IndexText.java available at my GITHub repository. I have commented all the output to retain only the computation time for the method index(), which creates the tokenizer and the filter objects and performs the filtering process. This process actually indexes the documents, that is, it transforms the text strings in each instance into a dictionary-based representation -- each instance is an sparse list of pairs (token_number,weight) where the weight is binary-numeric. I have also modified the class to set lowercasing to false, in order to accumulate as many tokens as possible.

I have perfomed experiments using the two next collections:

I am comparing using the strings "\\W" and "\\W+" as delimiters in the NGramTokenizer instance of the index() method, for unigrams, uni-to-bigrams, and uni-to-trigrams. In the case of the SMS Spam Collection, I have divided the dataset into pieces of 20%, 40%, 60%, 80% and 100% in order to evaluate the effect of the collection size.

Finally, I have run the program 10 times per experiment, in order to average and get more stable results. All numbers are expressed in milliseconds.

Results and Analysis

We will examine the results on the SMS Spam Collection. The results obtained for unigrams are the next ones:

It is a bar diagram which shows the time in milliseconds for each collection size (20%, 40%, etc.). The results for the bigrams are:

And the results for trigrams on the SMS Spam Collection are the next ones:

So the times for unigrams, uni-to-bigrams and uni-to-trigrams are exponetially higher (as it can be expected). While on unigrams, using the simple regex "\\W" is more efficient, the more sophisticated regex "\\W+" is more efficient for bigrams and trigrams. There is one anomaly point (at 60% on trigrams), but I believe it is an outlier. So it seems that the cost of using a more sophiticated regex does not pay for unigrams, in which the cost of matching this regex is higher than discarding empty strings. However it is the opposite in the case of uni-to-bigrams and uni-to-trigrams, where the empty strings seem to hurt the algorithm for building the bi- and trigrams.

The results on the Reuters-21578 collection are the next ones:

These results are fully aligned with the results obtained on the SMS Spam Collection, with the advantage of increasing the difference in the case of uni-to-trigrams, as the number of different tokens on the Reuters-21578 test collection is much bigger (as there are more texts, and they are longer).

But all in all, the biggest increment in performance we get are 4.59% in the SMS Spam Collection (uni-to-trigrams, 40% sub-collection) and 4.15% on the Reuters-21578 collection, which I consider marginal. So all in all, there is not a big difference between using these two regexes after all.

Conclusions

In the potential trade-off between using simple regular expressions to recognize text tokens, and using a more sophisticated regular expression in the WEKA indexer classes for avoiding spurius tokens, my simple experiment shows that both approaches are more or less equivalent in terms of performance.

However, when using only unigrams, it is better to use simple regular expressions because the time to match tokens in a more sophisticated regex does not pay.

On the other side, the algorithm for building bi- and trigrams seems to be sensitive to the empty strings generated by a simple regex, and you can get around a 4% increase of performance when using more sophisticated regular expressions and avoiding those empty strings.

As always, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on these topics!