Compilation of Resources for Text-based Age Detection

Text-based age detection consists of estimate the age of a user according to the kind of texts he/she writes. This task is atracting some attention in the latest years, as for instance it promises to add one of the most interesting demographic features required in ad targetting. There is even an online application, TweetGenie, which guesses the age of a Twitter user -- it works for Dutch and English.

Text-based age detection is a text classification task which has close relation with others like genre detection or authorship attribution, as it should be based on stylistic features (e.g. usage of capitalization, average word length, frequencies of prepositions, or even the usage of emoticons) instead of on content bearing words (mostly nouns and verbs) like e.g. in topical text categorization. However, this does not mean that a pure word-based learning would not be effective.

A particular feature of this task is that it can be approached as classification if ages are divided in ranges, or as regression if we try to approach the exact age of the user.

There is a currently ongoing scientific competition at this topic, namely the Author Profiling task at the 9th evaluation lab on uncovering plagiarism, authorship, and social software misuse (PAN 2013). With this competition adding up new text collections, we have the following resources for trying and testing our approaches to text-based age detection:

For your comfort, I summarize some statistics about the collections:

And some notes on the information available in each collection:

The following papers can be of interest in order to avoid repeating others work.

Please feel free to send me a message or comment below if you find any other resource that I should add to this post. Thanks for reading.