Nuxt HN | Ask HN: How to stop an AWS bot sending 2B requests/month

Ask HN: How to stop an AWS bot sending 2B requests/month

I have been struggling with a bot– 'Mozilla/5.0 (compatible; crawler)' coming from AWS Singapore – and sending an absurd number of requests to a domain of mine, averaging over 700 requests/second for several months now. Thankfully, CloudFlare is able to handle the traffic with a simple WAF rule and 444 response to reduce the outbound traffic.

I've submitted several complaints to AWS to get this traffic to stop, their typical followup is: We have engaged with our customer, and based on this engagement have determined that the reported activity does not require further action from AWS at this time.

I've tried various 4XX responses to see if the bot will back off, I've tried 30X redirects (which it follows) to no avail.

The traffic is hitting numbers that require me to re-negotiate my contract with CloudFlare and is otherwise a nuisance when reviewing analytics/logs.

I've considered redirecting the entirety of the traffic to aws abuse report page, but at this scall, it's essentially a small DDoS network and sending it anywhere could be considered abuse in itself.

Are there others that have similar experience?

87 points | by lgats 9 hours ago

28 comments

swiftcoder 8 hours ago
Making the obviously-abusive bot prohibitively expensive is one way to go, if you control the terminating server.
gzip bomb is good if the bot happens to be vulnerable, but even just slowing down their connection rate is often sufficient - waiting just 10 seconds before responding with your 404 is going to consume ~7,000 ports on their box, which should be enough to crash most linux processes (nginx + mod-http-echo is a really easy way to set this up)
[-]
- gildas 6 hours ago
  Great idea, some people have already implemented it for the same type of need, it would seem (see the list of user agents in the source code). Implementation seems simple.
  https://github.com/0x48piraj/gz-bomb/blob/master/gz-bomb-ser...
- Orochikaku 7 hours ago
  Thinking along the same lines a PoW check like like anubis[1] may work for OP as well.
  [1] https://github.com/TecharoHQ/anubis
  [-]
  - hshdhdhehd 6 hours ago
    Avoid if you dont have to. It is not really good traffic friendly. Especially if current blocking works.
- mkj 7 hours ago
  AWS customers have to pay for outbound traffic. Is there a way to get them to send you (or cloudflare) huge volumes of traffic?
  [-]
  - _pdp_ 6 hours ago
    A KB zip file can expand to giga / petabytes through recursive nesting - though it depends on their implementation.
    [-]
    - sim7c00 5 hours ago
      thats traffic in the other direction
      [-]
      - swiftcoder 1 hour ago
        The main joy of a zip bomb is that it doesn't consume much bandwidth - the transferred compressed file is relatively small, and it only becomes huge when the client tries to decompress it in memory afterwards
  - horseradish7k 6 hours ago
    yeah, could use a free worker
- lagosfractal42 7 hours ago
  This kind of reasoning assumes the bot continues to be non-stealthy
  [-]
  - swiftcoder 7 hours ago
    I mean, forcing them to spend engineering effort the make their bot stealthy (or to be able to maintains 10's of thousands of open ports), is still driving up their costs, so I'd count it as a win. The OP doesn't say why the bot is hitting their endpoints, but I doubt the bot is a profit centre for the operator.
    [-]
    - lagosfractal42 3 hours ago
      You risk flagging real users as bots, which drives down your profits and reputation
      [-]
      - swiftcoder 2 hours ago
        In this case I don't think they do - unless the legitimate users are also hitting your site at 700 RPS (in which case, the added load from the bot is going to be negligible)
neya 7 hours ago
I had this issue on one of my personal sites. It was a blog I used to write maybe 7-8 years ago. All of a sudden, I see insane traffic spikes in analytics. I thought some article went viral, but realized it was too robotic to be true. And so I narrowed it down to some developer trying to test their bot/crawler on my site. I tried asking nicely, several times, over several months.
I was so pissed off that I setup a redirect rule for it to send them over to random porn sites. That actually stopped it.
[-]
- sim7c00 5 hours ago
  this is the best approach honestly. redirect them to some place that undermines their efforts. either back to themselves, their own provider, or nasty crap that no one want to find in their crawler logs.
  [-]
  - throwaway422432 4 hours ago
    Goatse?
    Wouldn't recommend Googling it. You either know or just take a guess.
bigfatkitten 9 hours ago
Do you receive, or expect to receive any legitimate traffic from AWS Singapore? If not, why not blackhole the whole thing?
[-]
- caprock 9 hours ago
  Agreed. You should be able to set the waf to just drop the packets and not even bother with the overhead of a response. I think cloud flare waf calls this "block".
  [-]
  - marginalia_nu 7 hours ago
    Yeah, this is the way. Dropping the packets makes the requests cheaper to respond to than to make.
    The problem with DDoS-attacks is generally the asymmetry, where it requires more resources to deal with the request than to make it. Cute attempts to get back at the attacker with various tarpits generally magnifies this and makes it hit even harder.
- firecall 7 hours ago
  Yep, I did this for a while.
  The TikTok Byte Dance / Byte Spider bots were making millions of image requests from my site.
  Over and over again and they would not stop.
  I eventually got Cloudinary to block all the relevant user agents, and initially just totally blocked Singapore.
  It’s very abusive on the part of these bot running AI scraping companies!
  If I hadn’t been using the kind and generous Cloudinary, I could have been stuck with some seriously expensive hosting bills!
  Nowadays I just block all AI bots with Cloudflare and be done with it!
- lozenge 6 hours ago
  Here's the IP address ranges- https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-work...
MrThoughtful 9 hours ago
If it follows redirects, have you tried redirecting it to its own domain?
stevoski 6 hours ago
> Thankfully, CloudFlare is able to handle the traffic with a simple WAF rule and 444 response to reduce the outbound traffic.
This is from your own post, and is almost the best answer I know of.
I recommending you configure a Cloudflare WAF rule to block the bot - and then move on with your life.
Simply block the bot and move on with your life.
_pdp_ 6 hours ago
As others have suggested you can try to fight back depending on the capabilities of your infrastructure. All crawlers will have some kind of queuing system. If you manage to cause for the queues to fill up then the crawler wont be able to send as many requests. For example, you can allow the crawler to open the socket but you only send the data very slowly causing the queues to get filled quickly with busy workers.
Depending on how the crawler is designed this may or may not work. If they are using SQS with Lambda then that will obviously not work but it will fire back nevertheless because the serverless functions will be running for longer (5 - 15 minutes).
Another technique that comes to mind is to try to force the client to upgrade the connection (i.e. websocket). See what will happen. Mostly it will fail but even if it gets stalled for 30 seconds that is a win.
Scotrix 8 hours ago
Just find a Hoster with low traffic egress cost, reverse proxy normal traffic to Cloudflare and reply with 2GB files for the bot, they annoy you/cost you money, make them pay.
[-]
- tgsovlerkhgsel 8 hours ago
  Isn't ingress free at AWS? You'd have to find a way to generate absurd amounts of egress traffic - absurd enough to be noticed compared to billions of HTTP requests. 2B requests at 1 KB/request is 2 TB/month so they're likely paying a double-digit dollar amount just for the traffic they're sending to you (wtf - where does that money come from?).
  But since AWS considers this fine, I'd absolutely take the "redirecting the entirety of the traffic to aws abuse report page" approach. If they consider it abuse - great, they can go turn it off then. The bot could behave differently but at least curl won't add a referer header or similar when it is redirected, so the obvious target would be their instance hosting the bot, not you.
  Actually, I would find the biggest file I can that is hosted by Amazon itself (not another AWS customer) and redirect them to it. I bet they're hosting linux images somewhere. Besides being more annoying (and thus hopefully attention-getting) for Amazon, it should keep the bot busy for longer, reducing the amount of traffic hitting you.
  If the bot doesn't eat files over a certain size, try to find something smaller or something that doesn't report the size in response to a HEAD request.
  [-]
molszanski 7 hours ago
Maybe add this IP to a blacklist? https://iplists.firehol.org/ It would be easier to pressure AWS when it is there
Jean-Papoulos 9 hours ago
You don't even need to send a response. Just block the traffic and move on
shishcat 8 hours ago
if it follows redirect, redirct him to a 10gb gzip bomb
[-]
- cantor_S_drug 7 hours ago
  https://zadzmo.org/code/nepenthes/
  This is a tarpit intended to catch web crawlers. Specifically, it targets crawlers that scrape data for LLMs - but really, like the plants it is named after, it'll eat just about anything that finds it's way inside.
  It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, Markov-babble is added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.
  https://news.ycombinator.com/item?id=42725147
  Is this a good solution??
  [-]
  - iberator 6 hours ago
    Best tarpit ever.
- nake89 8 hours ago
  I was just going to post the same thing. Happy somebody else thought of the same thing :D
  [-]
  - sixtyj 7 hours ago
    You nasty ones ;)
theginger 5 hours ago
If it follows the redirect I would redirect it to random binary files hosted by Amazon, then see if it continues to not require any further action
bcwhite 2 hours ago
An idea I had was a custom kernel that replied ACK (or SYN+ACK) to every TCP packet. All connections would appear to stay open forever, eating all incoming traffic, and never replying, all while using zero resources of the device. Bots might wait minutes (or even forever) per connection.
locusm 4 hours ago
I am dealing with a similar situation and kinda screwed up as I managed to get Google Ads suspended due to blocking Singapore. I see a mix of traffic from AWS, Tencent and Huawei cloud at the moment. Currently Im just scanning server logs and blocking ip ranges.
bcwhite 3 hours ago
I redirect such traffic to a subdomain with an IP address that isn't assigned (or legally assignable). The bots just wait for a response to connection requests but never gets them. This seems to typically cost 10s waiting. The traffic doesn't come to my servers and it doesn't risk legitimate users who might hit it by mistake.
hamburgererror 6 hours ago
There might be some ideas to dig here: https://news.ycombinator.com/item?id=41923635
ahazred8ta 4 hours ago
Silly suggestion: feed them bogus DNS info. See if you can figure out where their DNS requests are coming from.
sim7c00 5 hours ago
if they have some service up on the machines the bot connect from then u can redirect them to themselves.
otherwise, maybe redirect to aws customer portal or something -_- maybe they will stop it if it hit themselves...
hyperknot 5 hours ago
Use a simple block rule, not a WAF rule, those are free.
nurettin 5 hours ago
What kind of website is this that makes it so lucrative to run so many requests?
znpy 6 hours ago
> I've tried 30X redirects (which it follows) to no avail
Make it follow redirects to some kind of illegal website. Be creative, I guess.
The reasoning being that if you can get AWS to trigger security measures on their side, maybe AWS will shut down their whole account.
giardini 9 hours ago
Hire a lawyer and have him send the bill for his services to them immediately with a note on the consequences of ignoring his notices. Bill them aggressively.
[-]
- tempestn 7 hours ago
  That's not how lawyers or bills work, unfortunately in this case, but fortunately in general.
- Animats 7 hours ago
  Yes. Computer Fraud and Abuse Act to start.
  The first demand letter from a lawyer will usually stop this. The great thing about suing big companies is that they have to show up. You have no contractual agreement which prevents suing; this is entirely from the outside.
2000swebgeek 8 hours ago
block the IPs or setup an WAF on AWS if you cannot be on Cloudflare.
[-]
- re-thc 7 hours ago
  AWS WAF isn’t free. Definitely cheaper but all the hits still cost.
brunkerhart 8 hours ago
Write to aws abuse team
snvzz 8 hours ago
Null-route the entirety of AWS ip space.
pingoo101010 7 hours ago
[dead]
JCM9 7 hours ago
Have ChatGPT write you a sternly worded cease and desist letter and send it to Amazon legal via registered mail.
AWS has become rather large and bloated and does stupid things sometimes, but they do still respond when you get their lawyers involved.
reisse 7 hours ago
What kind of content do you serve? 700 RPS is not a big number at all, for sure not enough to qualify as a DoS. I'm not surprised AWS did not take any action.
[-]
- Hizonner 1 hour ago
  > 700 RPS is not a big number at all, for sure not enough to qualify as a DoS.
  That depends on what's serving the requests. And if you're making the requests, it is your job to know that beforehand.
- marginalia_nu 7 hours ago
  FWIW, a HN hug of death, which fairly regularly knocks sites offline tends to peak at a few dozen RP.
  [-]
  - reisse 7 hours ago
    On the other hand, I've only seen complaint letters from AWS for doing tens of thousands of RPS on rate-limited endpoints for multiple days. Even then, AWS wasn't the initiator of inquiry (it was their customer being polled), and it wasn't a "cease and desist" kind of letter, it was "please explain what you're doing and prove you're not violating our ToS".
    [-]
    - hsbauauvhabzb 6 hours ago
      Why would aws care if you’re consuming one of their customers resources when the customer is the one that pays?