Forcing Bytespider and Claudebot back from Cloud Containers

/
Date

We’re stopping two bots, or spiders, that crawl the web for AI training data from hammering Cloud Container servers too hard.

.

In recent weeks we saw an increase in bot traffic hitting websites on Cloud Container servers.  This wasn’t just one bot, and it wasn’t from a particular IP range. But it was escalating quickly: In a single 15 minute period one of the offenders, ClaudeBot, hit multiple servers in our data centres thousands of times each. 

In response we have implemented new rate limiting on all Cloud Container servers. In our usual cautious way, we’ve started by targeting two bots in particular and limiting them to 60 requests a minute. 

With an eye on other bots that we haven’t limited (yet), and on overall traffic levels amongst much else, you can be assured that we’re not declaring victory and walking off.

So if you noticed worrying traffic spikes earlier this month, this ought to have calmed down a lot by now. And if you’d like to know a bit more about how we went about this work, and why AI is to blame for this disruption, read on. 

Choosing the right targets

Bot traffic picked up across the Cloud Container platform towards the end of May. Multiple servers in different regions were affected. So we took a close look and identified about a dozen main actors. 

Some bots are incredibly useful - GoogleBot is welcome on the overwhelming majority of sites, for example - so this initial list needed a bit more analysis before we suppressed anything.

In the end, two bots stuck out as being particularly active but not offering anything useful to the sites that they visited. These were Bytespider and ClaudeBot.

  • Bytespider comes from Bytedance, best known as the company behind TikTok. It’s no surprise that Bytedance is working hard on AI projects including LLMs, which need mountains of training data. Bytespider is believed to supply a lot of that training data by scraping whichever bits of the web it can reach. An ingracious visitor, Bytespider doesn't respect robots.txt so site owners and developers can’t do much to stop it.

  • Claudebot is similar, scraping the web for training data that its owner Anthropic can use to train Claude, their AI-powered “assistant” (a word which sometimes means “glorified chatbot”, but that’s a discussion for another time).

TikTok: It's all fun and games until someone's parent company slows down your server with an unfriendly, voracious bot. (Photo by Mart Production on Pexels)

While these two data miners are the most obvious candidates for our attention today, it’s more likely than not that we’ll add more bots to this list in the future. Bot traffic can fluctuate a lot, and it can change fast. It’s only a few days since we slowed down Bytespider and Claudebot, so we are still watching to gauge the overall effect on Cloud Container servers.

How, and how much, to restrict these bots

Knowing who we want to keep from hammering our servers, the next question is how to stop them - or, at least, slow them down.

In early 2023 we saw how tricky it is to use IP-blocking effectively, when we took on a bot that cycled through IP addresses very quickly. The blunt solution that we used then - cutting off millions of IP addresses in blocks - wasn’t ideal, but it worked because we knew where to look (addresses controlled by Azure).

This time around we’re not taking the IP address route. There’s no useful pattern to follow, and there’s a better way to identify Bytespider and Claudebot, by focusing on user agents.

A user agent is a short string included in a request that is used to identify the tool, person, or entity that is making the request. It’s not always a perfect way to identify traffic sources - user agents can be spoofed or faked - but it works well enough in this case. 

Rather than put up a hard block, we’ve rate-limited traffic from relevant user agents. Where we were seeing thousands of requests pile in almost simultaneously, we now only accept 60 a minute. That’s a low enough number to stop requests from these bots from draining server resources and impacting legitimate site visitors.

Small-scale testing

Like any change, it made sense to roll this out with a bit of caution. So rather than update rate-limiting rules on every Cloud Container server we started small, picking one hardware node in one of our less busy regions.

After 5 days of this rate limiting pilot test, around 130,000 requests had been affected. We checked for any negative feedback from customers, and found none. So it was unlikely that we’d accidentally refused legitimate traffic in all of those requests.

Armed with that data, we applied the new rate-limiting rules across the Cloud Container fleet. The full roll-out happened on June 18. 

See for yourself

The recently-released Cloud Container Metrics screens let you see the load that your containers and servers are under. To see whether you’ve been inundated with bots, and whether our recent changes have helped, look at your CPU usage up until June 18. Only some websites were affected, so if there are no unwelcome spikes on that graph you’ll know you were untouched. 

New Cloud Container metrics show you a lot more data than ever before.

These metrics come from the same toolset that we use in-house when we monitor servers for bot traffic, amongst many other signs of health.

If you’re technical enough, you can also check your sites’ access logs for any specific User Agents that you weren’t expecting. Tools like grep and awk can collate those logs into parseable data about user agents. 

A nice period of calm, we hope

As we’ve already said, this definitely won’t be the last time that bots cause problems. But we’re hopeful that we’ll at least get a bit of respite. With an eye on those other bots that we haven’t limited (yet), and on overall traffic levels amongst much else, you can be assured that we’re not declaring victory and walking off.

As always, if you spot anything unusual in your own Cloud Containers, we can help investigate. The first step is to get in touch.


Main photo by Karolina Kaboompics on Pexels