The PAN Labs (Uncovering Plagiarism, Authorship, and Social Software Misuse) is a series of scientific competitions that have been performed during the recent years, focused on applying automated text analysis to the detection of plagiarism, authorship attribution, and related tasks. Unlike other, more traditional text classification tasks like Text Categorization, the problems are modeled using style attributes (instead of content words), like frequencies of particular syntactic tags, specific collocations, approximate string matching, etc. As in other scientific competitions, the organizers provide a labeled tratining set of texts in order to refine both the input/output format and the algorithms, and the participants are required to run their software on a test set with unknown labels.
The PAN 2012 Lab will be held in Rome in September in conjunction with the CLEF 2012 conference. It features three tasks, being one of them Author Identification. This task focuses on identifying sexual predators in chat logs, and on authorship verification. The training data will be released on Mar 16, 2012.
For those willing to participate in this competition, I provide a series of resources that may help them.
First, the Perverted Justice website, run by the Perverted Justice Foundation Inc., features a big number of English-language chat logs from real sexual predators talking to volunteers acting as female youngsters. These archives are public and legal according to USA Laws. There is no danger on using them for research purposes.
Secondly, here is a list of papers that may be of interest for those willing to prepare an algorithm or system to join the PAN 2012 author identification task on detecting sexual predators. Not all of them are related with sexual predators, but with other child security problems in the Internet like cyberbullying as well:
Myriam Munezero, Tuomo Kakkonen and Calkin Montero, "Towards automatic detection of antisocial behavior from texts", IJCNLP 2011 Proceedings of the Workshop on Sentiment Analysis where AI meets Psychology (SAAIP). http://www.ijcnlp2011.org/proceeding/workshop/WS8_SAAIP/SAAIP-2011.pdf
McGhee, India, Jennifer Bayzick, April Kontostathis, Lynne Edwards, Alexandra McBride, and Emma Jakubowski. (2011). Learning to Identify Internet Sexual Predation. International Journal on Electronic Commerce. Volume 15, Number 3. Spring 2011
Karthik Dinakar, Birago Jones, Catherine Havasi, Henry, Lieberman, Rosalind Picard, "TimeOut: Commonsense Reasoning for Detection, Prevention, and Mitigation of Cyberbullying" ACM Transactions on Interactive Intelligent Systems, 2011, http://web.media.mit.edu/~lieber/Publications/Bullying-TiiS.pdf
Tibor Bosse and Sven Stam, "A Normative Agent System to Prevent Cyberbullying", In IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology 2011, http://www.cs.vu.nl/~tbosse/papers/IAT11-cyberbullying.pdf
Jennifer Bayzick, April Kontostathis and Lynne Edwards, "Detecting the Presence of Cyberbullying Using Computer Software", Poster presentation at WebSci11, June 14-17, 2011, Koblenz Germany. http://www.websci11.org/fileadmin/websci/Posters/63_paper.pdf
Dinakar K., Reichart R.,Lieberman, H., "Modeling the detection of textual cyberbullying", International Conference on Weblog and Social Media - Social Mobile Web Workshop, Barcelona, Spain 2011. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/download/3841/4384
Michal Ptaszynski, Pawel Dybala, Tatsuaki Matsuba, Fumito Masui, Rafal Rzepka, Kenji Araki, and Yoshio Momouchi, "In the Service of Online Order: Tackling Cyber-Bullying with Machine Learning and Affect Analysis", International Journal of Computational Linguistics Research, Vol. 1 , Issue 3, pp. 135-154, 2010. http://arakilab.media.eng.hokudai.ac.jp/~ptaszynski/data/Ptaszynski_IJCLR2010-Cyberbullying_2011.02.23.pdf
Michal Ptaszynski, Pawel Dybala, Tatsuaki Matsuba, Fumito Masui, Rafal Rzepka and Kenji Araki, "Machine Learning and Affect Analysis Against Cyber-Bullying", In Proceedings of The Thirty Sixth Annual Convention of the Society for the Study of Artificial Intelligence and Simulation of Behaviour (AISB'10), 29th March - 1st April 2010, De Montfort University, Leicester, UK, pp. 7-16, 2010. http://arakilab.media.eng.hokudai.ac.jp/~ptaszynski/data/AISB2010_Cyberbullying_paper.pdf
Kontostathis, April, Lynne Edwards, and Amanda Leatherman. (2009). Text Mining and Cybercrime In Text Mining: Application and Theory. Michael W. Berry and Jacob Kogan, Eds., John Wiley & Sons, Ltd. 2009. Link to data used in paper
Kontostathis, April, Lynne Edwards, Jen Bayzick, India McGhee, Amanda Leatherman and Kristina Moore. (2009). Comparison of Rule-based to Human Analysis of Chat Logs. 1st International Workshop on Mining Social Media (MSM09). Seville, Spain. Nov 2009.
Kontostathis, April, Lynne Edwards, and Amanda Leatherman. (2009). ChatCoder: Toward the Tracking and Categorization of Internet Predators. In Proc. Text Mining Workshop 2009 held in conjunction with the Ninth SIAM International Conference on Data Mining (SDM 2009). Sparks, NV. May 2009.
D. Yin, Z. Xue, L. Hong, B. D. Davison, A. Kontostathis, and L. Edwards, "Detection of Harassment on Web 2.0", In CAW 2.0 '09: Proceedings of the 1st Content Analysis in Web 2.0 Workshop, Madrid, Spain, 2009. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.151.8839&rep=rep1&type=pdf