Nihil Obstat: Compilation of Resources for Text-based Age Detection

Text-based age detection consists of estimate the age of a user according to the kind of texts he/she writes. This task is atracting some attention in the latest years, as for instance it promises to add one of the most interesting demographic features required in ad targetting. There is even an online application, TweetGenie, which guesses the age of a Twitter user -- it works for Dutch and English.

Text-based age detection is a text classification task which has close relation with others like genre detection or authorship attribution, as it should be based on stylistic features (e.g. usage of capitalization, average word length, frequencies of prepositions, or even the usage of emoticons) instead of on content bearing words (mostly nouns and verbs) like e.g. in topical text categorization. However, this does not mean that a pure word-based learning would not be effective.

A particular feature of this task is that it can be approached as classification if ages are divided in ranges, or as regression if we try to approach the exact age of the user.

There is a currently ongoing scientific competition at this topic, namely the Author Profiling task at the 9th evaluation lab on uncovering plagiarism, authorship, and social software misuse (PAN 2013). With this competition adding up new text collections, we have the following resources for trying and testing our approaches to text-based age detection:

The PAN 2013 Training Corpus for Author Profiling Task, consisting of a big number of posts and chats from three age ranges in Spanish and English.
The Blog Authorship Corpus, referenced in PAN, consisting of a big number of blog posts from three age ranges in English.
The NPS Chat Corpus, consisting on a relatively small number of chats from five age ranges in English (download from the NLTK corpora page or pay to the LDC).

For your comfort, I summarize some statistics about the collections:

And some notes on the information available in each collection:

The following papers can be of interest in order to avoid repeating others work.

J. Schler, M. Koppel, S. Argamon and J. Pennebaker (2006). Effects of Age and Gender on Blogging , Proceedings of 2006 AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs.
S. Argamon, M. Koppel, J. Pennebaker and J. Schler (2009), Automatically profiling the author of an anonymous text , Communications of the ACM 52 (2): 119-123.
M.Koppel, S. Argamon and A. Shimoni (2003), Automatically categorizing written texts by author gender , Literary and Linguistic Computing 17(4), November 2002, pp. 401-412.
Jenny K. Tam (2009). Detecting Age in Online Chat , Master Thesis, Naval Postgraduate School.
Jane Lin (2007). Automatic Author Profiling of Online Chat Logs , Master Thesis, Naval Postgraduate School.

Please feel free to send me a message or comment below if you find any other resource that I should add to this post. Thanks for reading.

Nihil Obstat

23.5.13

Compilation of Resources for Text-based Age Detection

No hay comentarios: