Dealing with comment spam

This is a message I sent to the WordPress hackers mailing list today:

Subject: New(?) anti-spam technique

I’ve seen an increase in comment spam the last couple of days. This particular species looks a lot like legitimate comments in that it isn’t riddled with links, the comment text is not advertising, and it isn’t anonymous. Of course, the email address is fake and the spam comes from dozens, probably hundreds, of different IP addresses. Flood control was the only thing keeping it in check. The blacklist worked for a little while but then the spam would change and I’d have to update the blacklist again. So.

The first thing I did was check my logs and noticed that the bot was posting directly to wp-comments-post.php with no referrer. So, this took care of the problem immediately (in wp-comments-post.php):

// Anti-spam // Prevent posting directly to this page without having come from another page // on this site. $siteurl = strtolower(get_settings(‘siteurl’)); $referer = strtolower($_SERVER[‘HTTP_REFERER’]); if ( strpos($referer, $siteurl) === false ) { // I chose this action because I don’t want to give a spammer any clues. die(); }

This works for now but since HTTP_REFERER is ridiculously easy to spoof it probably won’t work for long. I’m a little surprised that spam bots aren’t sophisticated enough to do this already.

So, that took care of the immediate problem but I wanted a more long term solution. Captchas came to mind immediately but I dislike captchas because they require legitimate commenters to jump through hoops. I remembered the discussion Kitty started last month about putting a unique hash in the page that must be posted back when the form is submitted. This is a good idea, and I implemented it, but it is also easy for a bot programmer to circumvent since all of the information you need to circumvent it is right there in the form. I made it a little more complicated by requiring the client to compute an md5 hash of the comment + post title and post that back as well (Javascript can do this nicely). But, while that makes a spammer jump through more hoops, once they’ve done that we’re back where we started again.

Unsatisfied, I tried “hashcash”. Hashcash is a clever technique that requires the client to expend a certain amount of work computing a “stamp”. It requires 2-3 seconds to compute a hashcash stamp, it can’t be faked, and it takes a trivial amount of time to verify. This delay is no problem for a reader submitting a single comment but is a huge drain on a spammer who wants to post hundreds of comments a second. And if you use date, post_id, server address, and remote address as the stamp seed then the spammer has to compute a new stamp for every post on every server every day. I implemented this as well, in Javascript, but unfortunately Javascript is too slow to compute stamps in a reasonable amount of time. I found that a 16-bit collision takes much too long to compute in Javascript but it only takes a fraction of a second in C which is probably more like what spammers would use (even 12-bit collisions are barely possible with Javascript). I chose Javascript because I wanted the requirements burden on the client to be as low as possible (Java, Flash, ActiveX are all unnacceptable requirements for posting comments). Here’s something like the algorithm I implemented:

The form onsubmit action would call something like this:

function hashcash() { stampseed1 = ‘some combination of date + post id + server address + remote address’; stampseed2 = 0; while(true) { hash = hex_md5(stampseed1 + stampseed2); if (hash.substr(0,4) == “0000”) { // Look for a 16-bit collision document.getElementById(‘commentform’).stamp.value = stampseed2; break; } stampseed2++; } }

And then wp-comments-post.php would just need to substr(md5(some combination of date + post id + server address + remote address + $_POST[‘stamp’]), 0, 4) == “0000” to verify the stamp.

I still like this idea the best, even though I abandonded it. If Javascript could compute these faster then it would be a viable solution.

Then it occurred to me that one way to tell humans from bots is that humans are much slower. This is what I came up with:

When a client hits the comment form, start a timer. Start a unique timer for each client (by IP address or some other method).
When the client posts the form, check the timer. If the time elapsed is less than 3 seconds or more than 5 minutes (the "window"), moderate the comment.
Delete the timer.

Here’s how I implemented it:

[1] At the top of wp-comments.php: $SESSION[‘CommentTimer’.$_SERVER[‘REMOTE_ADDR’]] = time(); [2] In wp-comments-post.php after check_comment(): if (!isset($SESSION[‘CommentTimer’.$_SERVER[‘REMOTE_ADDR’]]) || time()-$SESSION[‘CommentTimer’.$_SERVER[‘REMOTE_ADDR’]] < 3 || time()-$SESSION[‘CommentTimer’.$_SERVER[‘REMOTE_ADDR’]] > 300) { $approved = 0; $comment = “** COMMENT TIMER SPAM **\n”.$comment; } unset($SESSION[‘CommentTimer’.$_SERVER[‘REMOTE_ADDR’]]); [3] In wp-config.php: session_start();

This is simple, completely server-side, and transparent to readers. It has zero client-side requirements other than you take a couple of seconds to enter your comment. It busts spambots that spoof IPs on every request, that don’t use the form, and that don’t wait. I used $_SESSION for this quick implementation but there is no reason why you couldn’t store the timers in a database table. You’d just need to clean out old timers occassionally (perhaps automatically each time the admin went to comment moderation). Also, these could probably be wrapped in existing function calls so templates wouldn’t have to be modified (for example, embedding [2] in check_comment()).

So far it has stopped my spam completely. The main drawback that comes to mind is that it requires the client to have a stable IP address during a session, so it might be a problem for people with certain ISPs (AOL?). This can be gotten around by using cookie based identification but then it doesn’t work for people who won’t accept cookies. This method is partially defeatable too. Defeating this method requires a bot to:

Hit/Wait/Post cycle:

Hit hundreds of comment forms on different sites.
Hit them all again within the window to post the comment.
Repeat.

I say partially because, while the above spamming technique will post the spam, it should drastically slow down the spammer and thus reduce the number of sites he can cover thus reducing the spread of spam. If this is widely implemented, you will be less likely to even encounter spam because fewer spammers will even be able to get to your site in their lists.

Assume:

The spambot has 100,000 wp-comment-post.php entries in its index
Each of those blogs implements a 10-second flood control
The spammer can currently post to them all in 10 seconds

A naive spam bot would Hit/Wait/Post each post reducing its output from 100,000 posts/10 seconds to 1 post/3 seconds. Even if he has a network of zombie machines doing his bidding, each zombie doing naive Hit/Wait/Post would be reduced to 1 post/3 seconds. 100 zombies in aggregate would be able to do 100 posts/3 seconds.

A more sophisticated circumvention would be to hit them all once (10 seconds), then hit them all again (10 seconds), reducing the output to 100,000 spams every 20 seconds. But using this technique, if a spambot has millions of posts in his index, then by the time it got around to do the second hit it might have fallen outside of the window.

Another technique would be to start multiple threads on the same spamming machine that do Hit/Wait/Post. A machine that could start 100,000 threads in 10 seconds would be reduced to 100,000 posts/13 seconds. But I don’t think 100,000 threads on a single machine in 10 seconds is feasible (is it?). If he could start 100 threads then he would be reduced to 100 posts/3 seconds. If a zombie network using this technique split up the 100,000 posts among 100 machines and each machine could do 100,000 posts/10 seconds then they could do all 100,000 posts in just over 3 seconds (vs. 0.1 seconds without the comment timer).

Someone check my math? :-)

But is this the general case? This spammer appeared to be coming from hundreds of different IP addresses, but I wonder if he really had access to a zombie network or if he was just using IP spoofing from a single machine.

Comments

Mithrandir on 2004-12-09 15:28:49 wrote: [Ed: I accidentally deleted this comment from Mithrandir because it had the word “spam” in it. This was all I could recover of it. The gist of his comment was that current kernels can do 100,00 threads but that you don’t need to anyway. You can just use two threads, one chasing three seconds behind the first one, which is an improvement on the “hit them all once/hit them all again” circumvention I mention in the post. It would allow you to spam the list in 13 seconds using the numbers above.] The 100,000 threads thing is easy to get past. First off, recent Linux kernels can handle hundreds of thousands…
duplicacion serigrafiado on 2005-02-14 02:36:46 wrote: *Three rules for the spam game: 1) you can not win. 2) you can not draw. 3) you can not leave the play. Greetings, Antonio, from Malaga (Spain) *
Maggie on 2006-05-18 15:59:13 wrote: J: In English words, as few as possible please, will you please tell a non-code writing mother of four little kids, how to her new job as comment spam moderator? Pretty please? M