We blocked thousands of IPs from Azure, all because of one bot

/
Date

A single bot with a cunning method of hiding behind multiple IP addresses from Azure has led us to go big on defence. Our Technical Director explains.

.

Last month, we made an unprecedented announcement on our status page. We reported a long list of IP prefixes, all from Microsoft’s Azure cloud, that we were taking action against. It’s not every day that we block or rate-limit thousands of connections from one of the world’s biggest cloud providers. So we asked our Technical Director, Quintin, what went on and why we acted so drastically


What’s going on here, Quintin?

It’s hard to say exactly, but it looks like a badly developed bot that’s utilising cloud resources. This bot is particularly aggressive with the rate at which it wants to go through websites. It isn’t running with its own user agent, either. It's masking itself as a browser - a slightly older version of Safari.

You can think of it as a different form of DDoS attack: a very active campaign using a distinctive user agent from Azure IP addresses. It does a small number of requests from one IP, then disappears.

We haven't looked into our data forensically yet, so right now I would describe our investigation as more reactive than scientific. Our focus has been on the reliability of our infrastructure.

Can you explain why our reaction has covered so many IP addresses?

For sure. A single IP address might show up on a single website in our Cloud Container fleet, and make 8 or 10 requests, then stop. And we never see that IP address again. To my knowledge, we’ve only seen each IP address on a single server for a single website. And then it goes away.

So it’s like a game of whack-a-mole. IP addresses are popping up, and you don't know which one's going to come up next. All we know is that it will come from particular ranges of IPs that Azure has under their control. And the game slows down and then speeds up. 

Initially we blocked big blocks of Azure IPs. Hundreds of thousands of them, or maybe even millions. If we’d blocked individual IPs we would have stayed five seconds behind the bot.

When did we first notice this?

Going back to the Christmas break, one of our team members was working overnight and he encountered a large number of Azure IPs hitting a couple of sites on Cloud Container servers. So he went through the process of blocking and reporting them. Like I say, it’s sped up since then.


"It’s like a game of Whac-A-Mole. IP addresses are popping up, and you don't know which one's going to come up next"

He did well to spot something that’s changing IPs and using an innocuous user agent.

Yes, it's very needle-and-haystack and hard to pin down unless you look at it in aggregate. It was hard to be certain about anything except that we saw a lot more traffic and a lot more load. And when we blocked the Azure IPs we saw instant relief.

It’s not a normal day at work when you block so many IPs though, is it?

You've got to get out in front of it, right? The only way to do that is to look at where we're seeing lots of this behaviour. We saw that whole prefixes had a lot of this kind of behaviour so we’d block that whole prefix, but of course, that had downsides. There are plenty of legitimate people doing business with Azure, and suddenly they find themselves one house over from the tyre fire.

We ended up blocking traffic from places that didn’t deserve it, and which we’ve now whitelisted. It's not a permanent solution by any stretch of the imagination.

If blocking was too strict, what are we doing instead while the bot is still loose?

We changed to rate-limiting on Friday, January 20th. Basically, all of the Azure prefixes now get one slot per second. 

Think about a crowd going into a stadium. There's a line for “Azure holders only” and while everybody else is going through the gates, anyone from Azure is in a slower queue. It doesn't matter whether you're wearing a fedora, or a hoodie, or jandals. You're in that queue and for now we're not trying to distinguish who you are.

"Anyone from Azure is in a slower queue. It doesn't matter whether you're wearing a fedora, or a hoodie, or jandals. You're in that queue and for now we're not trying to distinguish who you are."

We’re not the only ones who are dealing with this bot, are we?

Others have reported on this overseas.

There are some characteristics about this bot that mean it's either doing a lot of traffic everywhere, or it's doing a lot of traffic in a really narrow part of the internet. I think it might be doing a lot of traffic globally. Otherwise we'd see it stop and start on our servers. I need to look at the data, but I don't think we've seen that.

For customers of ours who have been hit by this bot, what would they see?

You're just going to notice that you're getting a lot of traffic. from a lot of different IP addresses, but not many requests from any of them. You'd see broken journeys where they start in one place, then finish here until another IP picks up from where they left off.

It would look like someone’s browsing a website on mobile, then went home and switched to wifi, then down to the library and the public wifi there. One journey, three IPs. But all in a couple of seconds.

How well have common mitigation strategies worked?

We had customers in different data centres behind Cloudflare and other things, and they were not protected by this. Or at least, not sufficiently protected. It hasn’t been enough to just put up a WAF [web application firewall] and move on. 

There’s a common narrative that you only need a DDoS mitigation tool, but there’s more to it than that.

It’s not a standard DDoS attack, so how well should we expect standard defences like Cloudflare to work?

We have only seen data from a few customers of ours using Cloudflare, and we need to bear in mind that Cloudflare has different plans which may have fared better than what we’ve seen. For this kind of pattern of traffic though, it isn’t enough to simply have a particular service in place. You also need to have a support team on hand, or be running on a proactively managed platform.

A number of our Government or Enterprise clients have WAF’s in front of them as standard. These sorts of websites do not talk HTTP or HTTPS to the Internet except through the WAF, but we’ve still seen bad bots and other behaviour come through. And the general consensus when we’ve looked into this kind of traffic has been, “Well, it'd be easier for the host to block it”.

For DDoS attacks that are about sheer volume of bandwidth, which is what the leading WAF providers market themselves on, that kind of protection absolutely needs to be done at the edge. But when attacks are more sophisticated, WAFs are less useful. They could be more highly attuned to malicious traffic by looking at a broader range of characteristics. They could also change the judgement of legitimate vs illegitimate traffic on a per-website basis.

WAF or no WAF, we have still seen this traffic creep through.

Do we know what the bot is actually doing? Speculation in places like Hacker News has suggested everything from this being a misfiring Bingbot to a data scraper for AI training.

No, that’s all just people on the internet guessing. We have a bunch of data that we could pour over to try and figure it out. I think it could just be trying to mine a bunch of information from the Internet. It could be trying to find vulnerabilities. Who knows?

"There's been no engagement from Microsoft. It's like a black hole."

Since this bot’s requests are coming from Azure, have we heard anything from Microsoft about it?

There's been no engagement from their side. We tried contacting their Abuse team and didn't hear anything back. Companies always say that they take things like this seriously, and I’m sure that’s true, but there's no visibility from our side, not even a reply to say “Hey, we're looking at it”. It's like a black hole.

Because they haven’t engaged, we've set a very hard rate limit on all Azure IPs. We have no mitigating information from their side.

How can this bot cycle through so many IP addresses?

There's an increase in this kind of behaviour where IP addresses are for sale for milliseconds at a time. To a lesser extent Azure and the big hyperscalers are putting their reputation up for sale at the same time.

What do you mean by “putting their reputation up for sale”?

In the last decade we have changed the way that we sell servers on the Internet. You used to have servers for a month, then for a day, now you can have a server for an hour or less.

Look at things like AWS Lambda or Azure functions which are commonly billed in milliseconds. If you use these cloud providers’ serverless functions, every time you fire one up and do an outbound web request you'll get a different IP address. Now there could be 3,600 different people using an IP address in an hour. If that IP is on a reputation list, all 3,600 of those people get the same treatment. That will affect how reliable your app might run on those IPs in time or what you can use those services for.

Is there any way of knowing how many of those 3,600 people are trustworthy?

No! Which means that the provider is selling their IP reputation every second as well. What we're effectively seeing is the weaponisation of that, as a feature inside a crawler. There's an industry problem here, and we're going to have to change the way that IP reputation databases work or how fast they are updated.

How else is the way that the industry treats IP addresses changing?

Cloudflare have blogged about how they're moving IP addresses across different parts of the world, just by using port maps. They are allowing port ranges to go to different parts of their infrastructure, which is quite interesting. 

Currently web logs do not include the source port that connects to you. They only log the IP address. So if this behaviour becomes more common with multiple tenants sharing the same IP address a number of things are going to have to be thought through here. You already have this problem emerging with carrier grade NAT (network address translation) on the telco side of things.

But if that stuff really gets going on steroids, what happens next? You could have an outbound connection appearing to come from an IP address that an Azure enterprise customer is using one range of ports for, and their Azure functions are using another range of ports for. How do we know which ones to block and which ones to trust?

Do you have any ideas for how the industry could maintain trust in the right IPs?

Even just a clear indication of which IPs are going to be floating and which IPs are dedicated. If we could see details from Azure, for example, saying that “these are the fast-changing IPs”, then we could treat that list differently.

On the other hand, when it comes to spinning up virtual machines there’s probably no difference in terms of the IP ranges. You might spin a VM up for an hour, or a year, or pay by the hour and keep it for a year. How would Azure allocate IPs then? I doubt Azure always knows how long is each customer going to need an IP for.

Azure might be able to make decisions depending on functions that we can’t see. It’s an industry problem that needs an industry solution.

The alternative, where no IP is trusted, wouldn’t be a starter.

You don't want to impact the reachability for Googlebot, for example. That would be catastrophic for SEO. 

A strategy for how companies might do the stuff in the future is that they might choose to not filter certain IP ranges like SEO crawlers, then apply different rate limits on different categories. So, you wouldn't rate-limit Googlebot. For trusted IPs you wouldn't rate-limit anybody who has shown up in the past and built up a reputation. For new ones you might say, okay, they get two requests per second. Next week, if they’re still not in any blocklist, they might give a few more requests per second. And you might dynamically scale those requests per second over time.

I think that those sorts of things are where we are going to be. We’re going to need a multi-pronged approach.

Closer to home, how will this bot change things at SiteHost?

It’s something to learn from. Like every incident we’ll review how well our defences worked and how we can be more prepared next time. We’ll keep a close eye on others as well, to see how Microsoft or Cloudflare respond to this bot, for example.

It’s good to reflect on what went well, too. One of our team members spotted this bot before there were many - perhaps any - other reports of it online. That level of attention is crucial, and it proves the value of our team. Whatever new technical defences we stand up, we’ll keep improving our own processes and expertise as well.


Our Server Management includes round-the-clock monitoring of hardware and networks. We often respond to threats before you’d even notice. If you want to worry less about bad bots and other risks, talk with us about your management options.

Whac-A-Mole image: Steven Miller, CC BY 2.0, via Wikimedia Commons