8.5.08

The ECML/PKDD Discovery Challenge: blog spam detection

If I had to decide which was going to be the topic of the next ECML/PKDD Discovery Challenge, I would have chosen this one.

The guys organizing the challenge have access to Bibsonomy data, a very interesting social networking site for sharing bookmarks and lists of literature. A site that, as many others, it has caught the attention of spammers. According to the statistics of the dataset, spammers have passed over that strange things called BibTeX records, and they have focused on tags and bookmarks.

  1. Number of legitimate tag assignments: 816,197 / Number of spam tag assignments: 13,258,759
  2. Number of legitimate bookmarks: 181,833 / Number of spam bookmarks: 2,059,991
  3. Number of legitimate BibTeX records: 219,417 / Number of spam BibTeX records: 716

Added to the collections on Web Spam available at the AIRWeb series, they make up a pretty interesting arena for those who like playing with data for security. Like me :-)

The challenge will be at Antwerp, Belgium, on 15 Sept. 2008.