13.3.08

A note on blog comment spam

Blog comment spam uses to plague blogs, in the seek for links that may increase the pagerank of a Web site, and thus its rank in a popular query to a search engine like Google. Under the light of the recent study by Bowei Xi et al. on adversarial classification, there is a fundamental difference between email spam and blog comment spam (I resist calling it blog spam, splog, that consists of building blogs automatically to get links. Wikipedia says these are Spam Blogs, and the former are Spam in Blogs; I do not like it).

It is a matter of cost. In email spam, the adversary (the spammer) tries to disguise his message in order to get it pass through the spam filter. Well, the better the disguise is, the more difficult for the message to deliver the payload. For instance, if a link is hidden in an image, the user cannot click on it, and there will be less users accessing the target website and buying fake Cialis. However, if a blog comment is made very similar to other comments (e.g. by automatically copying text from them), as it is not intended for human reading but for getting a link, it gets successfully disguised and delivers the whole payload!!!

In consequence, content based filtering is much harder in the case of comment spam. An important note is that in email spam, the spammer has not access to the users legitimate email (although he probably builds a sample using the Enron collection, messages from public lists, etc.). But he does on comment spam, in which he is able to access the post and related comments.

Instead of content filtering, I believe that captchas are more suitable for this problem. Justin Mason, well known for his wonderful SpamAssassin, has posted an analysis on how Google captchas have been broken, and I think it is very accurate. Captchas are still hard for blog comment spammers, still a trustable mechanism.