I've submitted several complaints to AWS to get this traffic to stop, their typical followup is: We have engaged with our customer, and based on this engagement have determined that the reported activity does not require further action from AWS at this time.
I've tried various 4XX responses to see if the bot will back off, I've tried 30X redirects (which it follows) to no avail.
The traffic is hitting numbers that require me to re-negotiate my contract with CloudFlare and is otherwise a nuisance when reviewing analytics/logs.
I've considered redirecting the entirety of the traffic to aws abuse report page, but at this scall, it's essentially a small DDoS network and sending it anywhere could be considered abuse in itself.
Are there others that have similar experience?
gzip bomb is good if the bot happens to be vulnerable, but even just slowing down their connection rate is often sufficient - waiting just 10 seconds before responding with your 404 is going to consume ~7,000 ports on their box, which should be enough to crash most linux processes (nginx + mod-http-echo is a really easy way to set this up)
https://github.com/0x48piraj/gz-bomb/blob/master/gz-bomb-ser...
[1] https://github.com/TecharoHQ/anubis
I was so pissed off that I setup a redirect rule for it to send them over to random porn sites. That actually stopped it.
Wouldn't recommend Googling it. You either know or just take a guess.
The problem with DDoS-attacks is generally the asymmetry, where it requires more resources to deal with the request than to make it. Cute attempts to get back at the attacker with various tarpits generally magnifies this and makes it hit even harder.
The TikTok Byte Dance / Byte Spider bots were making millions of image requests from my site.
Over and over again and they would not stop.
I eventually got Cloudinary to block all the relevant user agents, and initially just totally blocked Singapore.
It’s very abusive on the part of these bot running AI scraping companies!
If I hadn’t been using the kind and generous Cloudinary, I could have been stuck with some seriously expensive hosting bills!
Nowadays I just block all AI bots with Cloudflare and be done with it!
This is from your own post, and is almost the best answer I know of.
I recommending you configure a Cloudflare WAF rule to block the bot - and then move on with your life.
Simply block the bot and move on with your life.
Depending on how the crawler is designed this may or may not work. If they are using SQS with Lambda then that will obviously not work but it will fire back nevertheless because the serverless functions will be running for longer (5 - 15 minutes).
Another technique that comes to mind is to try to force the client to upgrade the connection (i.e. websocket). See what will happen. Mostly it will fail but even if it gets stalled for 30 seconds that is a win.
But since AWS considers this fine, I'd absolutely take the "redirecting the entirety of the traffic to aws abuse report page" approach. If they consider it abuse - great, they can go turn it off then. The bot could behave differently but at least curl won't add a referer header or similar when it is redirected, so the obvious target would be their instance hosting the bot, not you.
Actually, I would find the biggest file I can that is hosted by Amazon itself (not another AWS customer) and redirect them to it. I bet they're hosting linux images somewhere. Besides being more annoying (and thus hopefully attention-getting) for Amazon, it should keep the bot busy for longer, reducing the amount of traffic hitting you.
If the bot doesn't eat files over a certain size, try to find something smaller or something that doesn't report the size in response to a HEAD request.
This is a tarpit intended to catch web crawlers. Specifically, it targets crawlers that scrape data for LLMs - but really, like the plants it is named after, it'll eat just about anything that finds it's way inside.
It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, Markov-babble is added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.
https://news.ycombinator.com/item?id=42725147
Is this a good solution??
otherwise, maybe redirect to aws customer portal or something -_- maybe they will stop it if it hit themselves...
Make it follow redirects to some kind of illegal website. Be creative, I guess.
The reasoning being that if you can get AWS to trigger security measures on their side, maybe AWS will shut down their whole account.
The first demand letter from a lawyer will usually stop this. The great thing about suing big companies is that they have to show up. You have no contractual agreement which prevents suing; this is entirely from the outside.
AWS has become rather large and bloated and does stupid things sometimes, but they do still respond when you get their lawyers involved.
That depends on what's serving the requests. And if you're making the requests, it is your job to know that beforehand.