Compilation of Resources for Text-based Age Detection

Text-based age detection consists of estimate the age of a user according to the kind of texts he/she writes. This task is atracting some attention in the latest years, as for instance it promises to add one of the most interesting demographic features required in ad targetting. There is even an online application, TweetGenie, which guesses the age of a Twitter user -- it works for Dutch and English.

Text-based age detection is a text classification task which has close relation with others like genre detection or authorship attribution, as it should be based on stylistic features (e.g. usage of capitalization, average word length, frequencies of prepositions, or even the usage of emoticons) instead of on content bearing words (mostly nouns and verbs) like e.g. in topical text categorization. However, this does not mean that a pure word-based learning would not be effective.

A particular feature of this task is that it can be approached as classification if ages are divided in ranges, or as regression if we try to approach the exact age of the user.

There is a currently ongoing scientific competition at this topic, namely the Author Profiling task at the 9th evaluation lab on uncovering plagiarism, authorship, and social software misuse (PAN 2013). With this competition adding up new text collections, we have the following resources for trying and testing our approaches to text-based age detection:

For your comfort, I summarize some statistics about the collections:

And some notes on the information available in each collection:

The following papers can be of interest in order to avoid repeating others work.

Please feel free to send me a message or comment below if you find any other resource that I should add to this post. Thanks for reading.


Presentación: "Menores y móviles: Usos, riesgos y controles parentales"

El día 19 de abril dí una charla en la Universidad Europea de Madrid, titulada "Menores y móviles: Usos, riesgos y controles parentales". Esta charla se corresponde con un trabajo de investigación que he realizado dentro del proyecto titulado "Protección de usuarios menores de edad de telefonía móvil inteligente", dirigido por Joaquin Pérez y financiado por la Universidad Europea de Madrid (P2012 UEM14).

El resumen de la charla está disponible en la página de la red MAVIR (MA2VICMR: Mejorando el Acceso, el Análisis y la Visibilidad de la Información y los Contenidos Multilingüe y Multimedia en Red para la Comunidad de Madrid), y la presentación utilizada durante la charla es la siguiente:


Language Identification as Text Classification with WEKA

Language Identification, consisting on guessing the natural language in which a text is written (or an utterance is spoken), is not one of the hardest problems in Natural Language Processing, and in consequence, I believe it is a good starting point for learning about the text analysis capabilities available in WEKA.

This is in fact one problem taken by others like in this tutorial on using LingPipe for Language Identification, or by Alejandro Nolla at his post on Detecting Text Language With Python and NLTK. Moreover you can find a wide number of language identification programs, APIs and demos in the Wikipedia article on Language Identification. We may even consider this function as a natural language commodity, as you can see how Google Translate does it on default in the next figure:

The most typical (and rather simple) approach to Language Identification is storing a list of the most frequent character 3-grams in each language and checking the target overlap with each of the lists. Alternatively, you can use stop words lists. Of course, the accuracy depends on how you compute the overlap, but even simple distances can make it rather effective.

However, I will not follow this approach here. Instead, I will show how to build an standard text classifier using WEKA in order to show the options (and how to apply) the StringToWordVector filter, which is the main tool for text analysis in WEKA.

The steps we have to follow are the next ones:

  1. To collect data from different languages in order to build a basic dataset.
  2. To prepare the data for learning, which involves transforming it by using the StringToWordVector filter.
  3. To analyze the resulting dataset, and hopefully, to improve it by using attribute selection.
  4. To test over an independent test collection, which will give us a robust estimation of the accuracy of the approaches on real examples.
  5. To learn the most accurate model as obtained from the previous step, and to use it for our classification program.

So this will be a rather long post. Be prepared for it.

Collecting the data and Creating the Datasets

Following the LingPipe Language ID Tutorial, I collect the data from the Leipzig Corpora Home Page. In particular, I will address guessing among English (EN), French (FR) and Spanish (SP), so I have gone to the download page, completed the CAPTCHA to get the list of available corpora, and downloaded:

For your comfort, I have put these corpora in my LangID GITHub demo page. The files have the following format:

1 I didn't know it was police housing," officers quoted Tsuchida as saying.
2 You would be a great client for Southern Indiana Homeownership's credit counseling but you are saying to yourself "Oh, we can pay that off."
3 He believes the 21st century will be the "century of biology" just as the 20th century was the century of IT.

So I have loaded them into an OpenOffice spreadsheet, and replaced the number columns by the corresponding tags for the different languages: EN, FR, and SP. Then I have escaped the " and ' characters, because they are string delimiters in WEKA Attribute-Relation File Format (ARFF). In order to build the datasets, I have split the data keeping the first 9K sentences of each language for training, and the remaining 1K for testing. As some learning algorithms may be sensitive to the instance order, I have mixed the instances in batches of 1K texts, so the first 1K sentences are in English, the next 1K sentences are in French, and so on. The training data has the following header:

@relation langid_train

@attribute language_class {EN,FR,SP}
@attribute text String

EN,'I didn\'t know it was police housing,\" officers quoted Tsuchida as saying.'
EN,'You would be a great client for Southern Indiana Homeownership\'s credit counseling but you are saying to yourself \"Oh, we can pay that off.\"'
EN,'He believes the 21st century will be the \"century of biology\" just as the 20th century was the century of IT.'

The ARFF files for training and testing are available at the GITHub repository for the demo as well. You can open the training file (langid.collection.train.arff) in the WEKA Explorer, and setting the class to be the first attribute, you should be getting something like the following figure:

So we have a training collection with 9K instances per class (language), and a test collection with 1K instances per class.

Data Transformation

As in previous posts about text classification with WEKA, we need to transform the text strings into term vector to enable learning. This is done by applying the StringToWordVector filter, that is the most remarkable text mining function in WEKA. In previous posts, I have applied this filter with default options, but it offers a wide range of possibilities that can be seen when opening it in the WEKA Explorer. If you click on the Filter button and browse the tree to "weka > filters > unsupervised > attribute > StringToWordVector", and then click on the filter name, you get the next window:

Those are a lot of options, aren't them? So let us focus on the minimum set of options in order to be productive with this example of Language Identification. Those are:

  • doNoOperateOnPerClassBasis - we set this option to True in order to make the filter collect word tokens over the classes as a whole. This should be the standard setting in nearly all text classification problems.
  • lowerCaseTokens - we set this option to True because we are interested on the words independently of using upper or lower case. In other problems, like e.g. when processing Social Networks text, keeping the capitalization may be critical for getting a good accuracy.
  • tokenizer - WEKA provides several tokenizers, intended to break the original texts into tokes according to a number of rules. The most simple tokenizer is the weka.core.tokenizers.WordTokenizer, which splits the string into tokens by using a list of separators that can be set by clicking on the tokenizer name. It is a nice idea to give a look at the texts we have before setting up the list of separating characters. In our case, we have several languages and the default punctuation symbols may not fit our problem -- we need to add opening question and exclamation marks, apart from other symbols from HTML format like &, and other symbols. So our delimiters string will be " \r\n\t.,;:\"\'()?!-¿¡+*&#$%\\/=<>[]_`@" (backslash is escaped).
  • wordsToKeep - we set this option to keep as much words as we can, to include the full vocabulary of the dataset. An appropriate value may be one million.

So we leave the rest of options on default. Most notably, we are not using sophisticated weighting schemas (like TF or TF.IDF), nor stop words or stemming. These options are very frequent in Information Retrieval systems like Apache Lucene/SOLR, and they often lead to nice accuracy improvements in search systems.

We need to have the same vocabulary both in the training and the testing datasets, so we can apply this filter in the command line by using the batch (-b) option:

$> java weka.filters.unsupervised.attribute.StringToWordVector -O -L -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\"\\'()?!-¿¡+*&#$%\\\\/=<>[]_`@\"" -W 10000000 -b -i langid.collection.train.arff -o langid.collection.train.vector.arff -r langid.collection.test.arff -s langid.collection.test.vector.arff

The options -O, -L, -tokenizer and -W correspond to the options above. The delimiter string is escaped because it is included in the specification of the tokenizer. The resulting files are also in the GITHub repository for the LangID example, along with the script stwv.sh (String To Word Vector) which includes this command.

Data Analysis and Improvement

If we take a quick look to the terms or tokens we have got, e.g.:

@attribute archival numeric
@attribute archivarlos numeric
@attribute archivas numeric
@attribute archives numeric
@attribute archiving numeric
@attribute archivo numeric
@attribute archivos numeric

We can imagine that most of them will be useless for Language Identification. This motivates making a more precise analysis of the tokens by using some kind of quality metric, like Information Gain. In fact, I am applying the weka.filters.supervised.attribute.AttributeSelection filter as I did in my posts on selecting attributes by chaining filters and on command line functions for text mining. So I issue the following command:

$> java weka.filters.supervised.attribute.AttributeSelection -c 1 -E weka.attributeSelection.InfoGainAttributeEval -S "weka.attributeSelection.Ranker -T 0.0" -b -i langid.collection.train.vector.arff -o langid.collection.train.vector.ig0.arff -r langid.collection.test.vector.arff -s langid.collection.test.vector.ig0.arff

We apply the filter in batch mode as well, in order to get the same attributes both in the training and in the test collections. We also set up the first attribute as the class (with the option -c), and set the threshold for keeping attributes as 0.0 in the weka.attributeSelection.Ranker search method. This means that we will keep only those attributes with Information Gain score over 0, and they will be sorted according to their score as well. This command is included in the asig.sh (Attribute Selection by Information Gain) script of the GITHub repository for the LangID example, along with the data files.

From the original 65,429 word attributes we got in the previous step, we have kept only 16,840 (a 25.73% of the original ones). We can be more aggressive by setting the threshold to a bigger value (e.g. 0.2).

The first twenty attributes are the next ones:

As we can see, all of them are very frequent words (in each language) that would be present in the stop lists for them. In consequence, our "pure" data mining approach is quite close to the traditional one based on stop words.

It makes sense to learn a J48 tree to get an idea of the complexity of the term relations. The weka.classifiers.trees.J48 algorithm implements the Quinlan's popular C4.5 learner, and as it outputs a decision tree, it can give us valuable insights of the term relations, like e.g. which co-occurring terms are more predictive. If we train that classifier on our new training dataset with the following command:

$> java weka.classifiers.trees.J48 -t langid.collection.train.vector.ig0.arff -no-cv

However, we get a quite complex decision tree populated with 273 nodes and 137 leaves. All the tests in the tree have the following look: "word > 0" or "word <= 0". This means that the algorithm induces that only the occurrence of words is important, but not its weight. The root of the tree is obviously a test on "the", and the smallest side of the tree (its right hand side, with "the > 0") is the following one:

the > 0
| de <= 0: EN (5945.0/8.0)
| de > 0
| | el <= 0
| | | and <= 0
| | | | for <= 0
| | | | | to <= 0: FR (24.0/3.0)
| | | | | to > 0: EN (2.0)
| | | | for > 0: EN (3.0)
| | | and > 0: EN (7.0)
| | el > 0: SP (3.0)

This means, for instance, that the word "the" is an excellent predictive feature, and if it occurs in a text and the word "de" (from French or Spanish) does not occur in the text, that text is most likely written in English (with an estimated likelihood of 99.86% on the training collection). The overall accuracy of J48 over the training collection is 98.3963%.

Training and then Evaluating on the Test Collection

Before start training and evaluating, we have to decide which algorithms are most appropriate for the problem. In my experience with text learning, it is wise to test at least the following ones:

  • The Naive Bayes probabilistic approach, quick and with good results in text learning on average problems. In WEKA, It is incarnated in the weka.classifiers.bayes.NaiveBayes class.
  • The rule learner PART, which induces a list of rules by learning partial decision trees. It is a symbolic algorithm that produces rules which can be very valuable as they are easy to understand. This algorithm is implemented by the weka.classifiers.rules.PART class.
  • Of course, the J48 algorithm because of its visualization capabilities.
  • The lazy learner k-Nearest Neighbors (kNN), which occasionally gives excellent results in text classification problems. The WEKA class that implements this algorithm is weka.classifiers.lazy.IBk.
  • The Support Vector Machines algorithm, which it is probably the most effective on text classification problems because of its ability to focus on the most relevant examples in order to separate the classes. It is a very good learning algorithm for sparse datasets, and it is implemented in WEKA via the weka.classifiers.functions.SMO class or by the library LibSVM. I choose the Sequential Minimum Optimization implementation (SMO) embedded in WEKA.

Also, when Naive Bayes or J48 are effective, I usually get from small to even big accuracy improvements by using boosting, implemented by the weka.classifiers.meta.AdaBoostM1 class in WEKA. Boosting takes as input a weak classifier, and build a classifier committee by iteratively training that weak learner on those dataset subsets on which the previous learners are not effective. In this case, I will not apply boosting because the weak learners get rather high levels of accuracy, and it is most likely that boosting will only achieve a marginal improvement (if any) at the cost of a much bigger training time.

I have written an script named test.sh to execute all these algorithms with default options at the GITHub repository for the LangID demo. The results obtained by the algorithms are included in the repository as well, and summarized in the next table:

The different versions of the lazy algorithm kNN tested here appear to be very weak. It is likely we can improve its performance by changing the way the distance among examples is computed (from the Euclidean distance to a more appropriate one for text, that would be the cosine similarity), but their performance is so low that they will not score better than the rest of the algorithms.

The top algorithms in this test are Naive Bayes and Support Vector Machines. There is a trade off between both algorithms: SVMs are more effective (in fact, they are very effective) but they employ quite a lot of time to be trained, while Naive Bayes is less effective but quicker to be trained. In terms of classification time, both algorithms are linear on the number of attributes.

Even we have used a big number of attributes, there are some examples with rather weak representations. For instance, let us check the following instances or texts:

{58 1,94 1,313 1,1663 1}
{119 1,361 1,2644 1,16840 FR}
{2 1,16840 SP}

The first and second examples have only 3 occurring words (the class value for the first text is EN in the sparse format it is used by WEKA in this example), and the third example has only one word ("el"). The two first examples attribute numbers (58 or over) mean that the attributes are not the most informative ones, while in the third example we find a very informative word. If we apply a more aggressive selection using Information Gain, we will be missing a lot of examples (with null representations) in this example, thus making them fall to the most likely class. As the classes have a balanced distribution, the language chosen in that case will be EN, which is the default value for the class attribute.

Learning the Best Classifier and Using it Programmatically

So after our experiments, we know the best classifier in our tests is SVMs. So it is time to learn it and store the classifier into a file for further programmatic use. For this purpose, I have written an script that trains the classifier and stores the model into a file, using the following command-line call:

$> java weka.classifiers.meta.FilteredClassifier -t langid.collection.train.arff -c first -no-cv -d smo.model.dat -v -F "weka.filters.MultiFilter -F \"weka.filters.unsupervised.attribute.StringToWordVector -O -L -tokenizer \\\"weka.core.tokenizers.WordTokenizer -delimiters \\\\\\\" \\\\\\\r\\\\\\\n\\\\\\\t.,;:\\\\\\\\\\\\\\\"'()?!-¿¡+*&#$%/=<>[]_`@\\\\\\\"\\\" -W 10000000\" -F \"weka.filters.supervised.attribute.AttributeSelection -E weka.attributeSelection.InfoGainAttributeEval -S \\\"weka.attributeSelection.Ranker -T 0.0\\\"\"" -W weka.classifiers.functions.SMO

This call is rather painful because of the nested, and nested, and nested, and nested quotes. So I have pretty-printed it in the script learn.sh script at the GitHub repository for the LangID example. For dealing with nested quotes, follow the advice in the Wikipedia article about nested quotation.

With this call, we have stored a model in the file smo.model.dat, which chains the StringToWordVector filter, the AttributeSelection filter, and an SMO classifier by using the weka.classifiers.meta.FilteredClassifier and the weka.filters.MultiFilter classes, as I have explained in the post on Command Line Functions for Text Mining in WEKA.

One good point of WEKA is that we can learn a model in the command line and use it in a program. I have modified the MyFilteredClassifier.java program I used in my post describing A Simple Text Classifier in Java with WEKA, and I have committed it at the GITHub repository with the name LanguageIdentifier.java. I have created three sample test files as well, test_en.txt, test_fr.txt and test_sp.txt. The operation of the program is the following one:

$> javac LanguageIdentifier.java

$> java LanguageIdentifier
Usage: java LanguageIdentifier <fileData> <fileModel>
$> java LanguageIdentifier test_en.txt smo.model.dat
===== Loaded text data: test_en.txt =====
This is a sample test for the language identifier demo.
===== Loaded model: smo.model.dat =====
===== Instance created with reference dataset =====
@relation 'Test relation'
@attribute language_class {EN,FR,SP}
@attribute text string
?,' This is a sample test for the language identifier demo.'
===== Classified instance =====
Class predicted: EN

$> java LanguageIdentifier test_fr.txt smo.model.dat
===== Loaded text data: test_fr.txt =====
Ceci est un test de l'échantillon pour la démonstration de l'identificateur de langue.
===== Loaded model: smo.model.dat =====
===== Instance created with reference dataset =====
@relation 'Test relation'
@attribute language_class {EN,FR,SP}
@attribute text string
?,' Ceci est un test de l'échantillon pour la démonstration de l'identificateur de langue.'
===== Classified instance =====
Class predicted: FR

$> java LanguageIdentifier test_sp.txt smo.model.dat
===== Loaded text data: test_sp.txt =====
Esto es un texto de prueba para la demostración del identificador de idioma.
===== Loaded model: smo.model.dat =====
===== Instance created with reference dataset =====
@relation 'Test relation'
@attribute language_class {EN,FR,SP}
@attribute text string
?,' Esto es un texto de prueba para la demostración del identificador de idioma.'
===== Classified instance =====
Class predicted: SP

So the program is correct on the three examples. Remember that you have to learn the model before using the program. As a side note, as the program only uses a FilteredClassifier object, you can change the script to accommodate a different algorithm. For instance, you can just change the text "weka.classifiers.functions.SMO" by "weka.classifiers.bayes.NaiveBayes" in the learn.sh script, and the program will be working the same way -- but with a different model.

Concluding Remarks

While being relatively simple, the Language Identification problem helps to identify the essential tasks we have to perform when building text classifiers with WEKA. It is a complete example in the sense that we have not only collected the dataset and learnt on it, but we have also dig a bit into the most suitable representation by playing with attribute selection and tentative classifier to visualize the data. It also demonstrates some basic configurations of the StringToWordVector filter, which is the most remarkable tool in WEKA for text mining.

If you have had the time to read all this post, and even tried the program: thank you! I hope it has been a valuable time investment. I am tempted to suggest you to modify the dataset to include more languages, as the problem I have addressed is relatively simple -- only three and quite different languages.

Finally, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on this topics!


Mapping Vocabulary from Train to Test Datasets in WEKA Text Classifiers

There are several ways of evaluating a (text) classifier: cross validation, splitting your dataset into train and test subsets, or even evaluating the classifier on the training set itself (not recommended). I will not discuss the merits of each method, instead I will focus on a train/test split evaluation.

When you start to work with your train and test text datasets, you have got two labelled text collections like e.g. those I make available at my GITHub project: smsspam.small.train.arff and smsspam.small.test.arff . In this case, we have two collections that are a 50% split of my original simple collection smsspam.small.arff , which in turn is a subset of the the original SMS Spam Collection. The files are formatted according to the WEKA ARFF:

@relation sms_test

@attribute spamclass {spam,ham}
@attribute text String

ham,'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
spam,'Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C\'s apply 08452810075over18\'s'

That is, one text instance per line, the first attribute being the nominal class spam/ham, and the second attribute being the text itself.

In text classification, you have to transform this original representation into a vector of terms/words/stems/etc. in order to allow the classifier to learn expressions like: "if the word "win" occurs in a text, then classify it as spam". In other words, you have to represent your texts as feature vectors, where the features are words and the values are e.g. binary weights, TF weights, or TF.IDF weights. In fact, WEKA provides the handy StringToWordVector filter for this purpose (Thanks, WEKA!).

However, it is most likely that the vocabulary used in your training set and in your test set is not identical. For instance, if you directly apply the StringToWordVector filter to the previous files, you get a bit different results, summarized in the following table:

Obviously, to enable learning you have to ensure that the representation of both datasets is the same. For instance, imagine that the root of the decision tree you have learnt on your training collection poses a test on an attribute that does not exist on your test collection, then what happens?

Fortunately, WEKA provides at least three ways of getting the same vocabulary in your train and test subcollections. Here are them:

  1. Using a batch filter that takes both training and test collections at the same time, using the first for getting the attributes and representing the last using those attributes.
  2. Using a FilteredClasifier (that I have discussed in previous posts), which feeds both the filter and the classifier into a single classifier that takes the original representation class/text as input for both the training and the test sets.
  3. A more recent method, that is separately getting the representations and using an InputMappedClassifier that acts as a wrapper of an underlying classifier, and tries to match attributes from the training collection into the corresponding ones of the test subset.

The first method is quite simple, and it just makes use of the -b option of the WEKA filters. The corresponding command line calls are the next ones:

$> java weka.filters.unsupervised.attribute.StringToWordVector -b -i smsspam.small.train.arff -o smsspam.small.train.vector.arff -r smsspam.small.test.arff -s smsspam.small.test.vector.arff
$> java weka.classifiers.lazy.IBk -t smsspam.small.train.vector.arff -T smsspam.small.test.vector.arff -i -c first
=== Confusion Matrix ===
a b <-- classified as
1 15 | a = spam
0 84 | b = ham

The second method, conveniently discussed in my previous post, can be applied with the following call:

$> java weka.classifiers.meta.FilteredClassifier -t smsspam.small.train.arff -T smsspam.small.test.arff -F weka.filters.unsupervised.attribute.StringToWordVector -W weka.classifiers.lazy.IBk -i -c first
=== Confusion Matrix ===
a b <-- classified as
1 15 | a = spam
0 84 | b = ham

As it is shown in the previous results, both methods achieve the same results. In this case, I have opted for using StringToWordVector without parameters (default tokenization, term weights, no stemming, etc.) with the relatively weak classifier IBk , which implements a k-Nearest-Neighbor learner that, instead of building a model from the training collection, it searches the closest training instance to the test instance (k is 1 on default) and assigns its class to the test instance.

However, the third method achieves different results, as the mapping involves some attributes from the training collection disappearing, and ignoring new attributes in the test collection. It is called the following way:

$> java weka.filters.unsupervised.attribute.StringToWordVector -i smsspam.small.train.arff -o smsspam.small.train.vector.arff
$> java weka.filters.unsupervised.attribute.StringToWordVector -i smsspam.small.test.arff -o smsspam.small.test.vector.arff
$> java weka.classifiers.misc.InputMappedClassifier -W weka.classifiers.lazy.IBk -t smsspam.small.train.vector.arff -T smsspam.small.test.vector.arff -i -c first
Attribute mappings:
Model attributes Incoming attributes
------------------------------ ----------------
(nominal) spamclass --> 1 (nominal) spamclass
(numeric) #&gt --> 2 (numeric) #&gt
(numeric) $1 --> - missing (no match)
(numeric) &amp --> - missing (no match)
(numeric) &lt --> 6 (numeric) &lt
(numeric) *9 --> 7 (numeric) *9
(numeric) + --> - missing (no match)
(numeric) - --> 8 (numeric) -
=== Confusion Matrix ===
a b <-- classified as
2 14 | a = spam
1 83 | b = ham

In fact, this time we get a bit more spam (2 over 14) with a false positive, although the general accuracy is exactly the same: 85%. You can see how some of the attributes are missing (they do not occur in the test dataset), like: "$1", "+", etc. This for sure affects the performance of the classifier, so beware.

With these options, my recommendation is using the first method, as it allows you to fully examine the representation of the datasets (term weight vectors) and it decouples filtering from training, what may be convenient in terms of efficiency.

Before ending this post, I have to thank Tiago Pasqualini Silva, Tiago Almeida and Igor Santos for our experiments with the SMS Spam Collection, and to Tiago Pasqualini in particular because he showed me the InputMappedClassifier.

And last but not least, thanks for reading, and please feel free to leave a comment if you think I can improve this article, or you have questions or suggestions for further articles on this topics!