New dataset released: SMS Spam Collection v.1

New dataset released: SMS Spam Collection v.1

The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one dataset composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

The collection is free for all purposes, and it is public available at: http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

This corpus has been collected from free or free for research sources at the Internet including the Grumbletext Web site, the NUS SMS Corpus, Caroline Tag's PhD Thesis, and a smaller previous collection (SMS Spam Corpus v.0.1: http://www.esp.uem.es/jmgomez/smsspamcorpus/, available for historic comparison).

A comprehensive study of this corpus can be found in the following paper, which offers a number of statistics, studies and baseline results for several machine learning methods.

Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, A. Contributions to the study of SMS Spam Filtering: New Collection and Results. Proceedings of the 2011 ACM Symposium on Document Engineering (ACM DOCENG'11), Mountain View, CA, USA, 2011. (Accepted)