Some Success Blocking Comment Spam with Blacklists

Comment spam is definitely on the rise, a new apocalpyse size plague, unprecedented evil that threatens to engulf the Earth it in dark cluthes. Perhaps not, but I am seeing a lot of people struggling with it, and talking about it.

I think the most important thing to realize about comment spam is it isn’t email spam. The style is different, the tools are different, and most importantly the goals are different. I’ve tripped over a number of people saying, “Why spam my little site? I don’t get any traffic, no one is going to follow those links.” It isn’t about click throughs. Unlike email spam where the spammer hopes to convince you of something, comment spam is trying to convince Google of something. Your comments are just a convient place to seed yet more links in the battle for page rank.

An Old Technique, Fresh Again

This is great news actually. In email spam, in response to the rise of spam filters spammers have starting mutating their emails, replacing letters with numbers, or mispelled equivalents, adding chunks of random characters, anything to make themselves seem different then the previous incarnation. People trying to spam Google’s pagerank can’t do this. They have to use the real URL, or it doesn’t work. Spammers in all their multiplicity have a huge number of domains available to them, but it is finite. Also a few well choosen wild cards can do wonders.

MT Blacklists

Using my blacklist implementation, plus a seed of Simon’s blacklist, plus a few wild cards (viagra, phentermine, casion, all words which have never appeared legitimately in my comments) I’ve had 3 spam comments in my first week of use. Down from about 30 last week. It would be interesting to log failed attempts to know how well the filters are actually working, but that is a project for another day. (or another person)

It Isn’t a Bayesian Shaped Nail

There have been a number of calls (I probably made a few myself) to appy successful tools like SpamAssassin, SpamSeive, and the more general idea of Bayesian filtering to solving the spam comment problem. I have serious doubts about whether this would work. Email is rich with clues: headers, mime types, lots of content. Blog comments have none of these. In fact blog comments and spam comments look remarkably similar. I occasionally with my fast (if currently groggy) neural net have trouble distinguishing between spammed, and legitimate comments, I have little faith that Thomas B. will come riding to our rescue on this one.

Distributed Checks

Whatever happened to Razor? (and DCC) You never hear anything about it anymore. Did the problem of finding the similarity of different emails prove to be too difficult? Simply not as successful as Bayesian filtering? Or did the insistence on keeping the servers proprietary kill the enthusiasm? Kalsey proposed similar distributed spam prevention solution. Be intersting to learn from Razor.