Show HN: Stop AI scrapers from hammering your self-hosted blog (using porn)

(github.com)

373 points | by misterchocolat 181 days ago

56 comments

kstrauser 179 days ago
I love the insanity of this idea. Not saying it's a good idea, but it's a very highly entertaining one, and I like that!
I've also had enormous luck with Anubis. AI scrapers found my personal Forgejo server and were hitting it on the order of 600K requests per day. After setting up Anubis, that dropped to about 100. Yes, some people are going to see an anime catgirl from time to time. Bummer. Reducing my fake traffic by a factor of 6,000 is worth it.
[-]
- anonymous908213 179 days ago
  As someone on the browsing end, I love Anubis. I've only seen it a couple of times, but it sparks joy. It's rather refreshing compared to Cloudfare, which will usually make me immediately close the page and not bother with whatever content was behind it.
  [-]
  - teeray 179 days ago
    It really reminds me of old Internet, when things were allowed to be fun. Not this tepid corporate-approved landscape we have now.
    [-]
    - GoblinSlayer 177 days ago
      Anubis is simple; recaptcha and the like are huge opaque spaghetti.
  - kstrauser 179 days ago
    Same here, really. That's why I started using it. I'd seen it pop up for a moment on a few sites I'd visited, and it was so quirky and completely not disruptive that I didn't mind routing my legit users through it.
    [-]
    - n1xis10t 179 days ago
      So maybe there are more people who like the “anime catgirl” than there are who think it’s weird
      [-]
      - kstrauser 179 days ago
        *anime jackalgirl ;-)
        Quite possibly. Or, in my case, I think it's more quirky and fun than weird. It's non-zero amounts of weird, sure, but far below my threshold of troublesome. I probably wouldn't put my business behind it. I'm A-OK with using it on personal and hobby projects.
        Frankly, anyone so delicate that they freak out at the utterly anodyne imagery is someone I don't want to deal with in my personal time. I can only abide so much pearl clutching when I'm not getting paid for it.
        [-]
        D-Machine 178 days ago
        The Digital Research Alliance of Canada (the main organization unifying and handling all the main HPC compute clusters in Canada) now uses Anubis for their wiki. Granted this is not a business, but still!
        https://docs.alliancecan.ca/wiki/Technical_documentation
        Imustaskforhelp 178 days ago
        For what its worth, I think that a UN/(Unicef?) website (not sure which one) did use anubis so maybe you can put it behind businesses too :)
  - prmoustache 178 days ago
    Anyone is free to replace the cat girl with an actual cat or a vintage computer logo or whatnot anyway.
    My issue is that it blocks away people using browsers without javascript.
    [-]
    - stefanka 178 days ago
      How can one do this? Did not find it in the docs
      [-]
      - easton 178 days ago
        It’s a feature in the paid version, or I guess you could recompile it if you didn’t want to pay (but my guess is if you want to change the logo you can probably pay).
        [-]
      - prmoustache 178 days ago
        The 3 images are in the repo, you can replace them and rebuild or point to other ones in the templates.
  - _lvbh 178 days ago
    As someone on the hosting end, Anubis has unfortunately been overused and thus scrapers, especially Huawei ones, bypass it. I've gone for go-away instead which is similar but more configurable in challenges
  - PunchyHamster 178 days ago
    My experience with it is that it somehow took 20 seconds to load (site might've been hn-hugged at the time), only to "protect" some fucking static page instead of just serving that shit in the first place rather than wasting CPU on... whatever it was doing to cause delay
    [-]
    - timpera 178 days ago
      Same experience for me. I tried it on a low-end smartphone and the Anubis challenge took about 45 seconds to complete.
  - brettermeier 178 days ago
    Reminds me of weird furry porn, I can't say I like it
  - opem 178 days ago
    yes, very true! Anubis is a hell lot better than cloudflare turnstile or its older cousin sister google recaptcha.
  - m4rtink 178 days ago
    Yep, Anubis-chan is super cute! :)
- n1xis10t 179 days ago
  That’s so many scrapers. There must be a ton of companies with very large document collections at this point, and it really sucks that they don’t at least do us the courtesy of indexing them and making them available for keyword search, but instead only do AI.
  It’s kind of crazy how much scraping goes on and how little search engine development goes on. I guess search engines aren’t fashionable. Reminds me of this article about search engines disappearing mysteriously: https://archive.org/details/search-timeline
  I try to share that article as much as possible, it’s interesting.
  [-]
  - kstrauser 179 days ago
    So! Much! Scraping! They were downloading every commit multiple times, and fetching every file as seen at each of those commits, and trying to download archives of all the code, and hitting `/me/my-repo/blame` endpoints as their IP's first-ever request to my server, and other unlikely stuff.
    My scraper dudes, it's a git repo. You can fetch the whole freaking thing if you wanna look at it. Of course, that would require work and context-aware processing on their end, and it's easier for them to shift the expense onto my little server and make me pay for their misbehavior.
    [-]
    - n1xis10t 179 days ago
      Crazy
  - PeterStuer 179 days ago
    Or some anti-ddos/bot companies using ultra cheap scraping services to annoy you enough to get you into their "free" anti bot protection, so they can charge the few real ai scrapers for access to your site.
    [-]
    - throw10920 178 days ago
      Is there any evidence that this has actually happened?
      [-]
      - zhengyi13 178 days ago
        Even if there isn't (yet?), there's probably someone who's honestly thinking this is potentially a viable business model and at least napkin-mathing it out.
        [-]
        kstrauser 178 days ago
        My napkin mathing is that their ROI would be negative. That's a lot of compute and bandwidth they'd have to pay for even if they were just throwing away the results.
        throw10920 178 days ago
        So, it hasn't happened, and you're just making stuff up.
  - miki123211 179 days ago
    But there is a lot of search engine development going on, it's just that the results of the new search engines are fed straight into AI instead of displayed in the legacy 10-links-per-page view.
  - rurban 177 days ago
    Just block all the big hosters IP ranges, when they ignore robots.txt.
    For fun add long timeouts and huge content sizes. No private individual will browse from there, and all scrapers will do.
  - mrweasel 178 days ago
    > There must be a ton of companies with very large document collections at this point
    See, I don't think there is, I don't think they want that expense. It's basically the Linus Torvalds philosophy of data storage, if it's on the Internet, I don't need a backup. While I have absolutely no proof of this, I'd guess that many AI companies just crawl the Internet constantly, never saving any of the data. We're seeing some of these scrapers go to great length attempting to circumvent any and all forms of caching, they aren't interested in having a two week old copy of anything.
    [-]
    - kelvinjps10 178 days ago
      Where did Linus Torvalds expressed this philosophy I have never seen it
      [-]
      - lelanthran 178 days ago
        > Where did Linus Torvalds expressed this philosophy I have never seen it
        https://www.goodreads.com/quotes/574706-only-wimps-use-tape-...
    - n1xis10t 178 days ago
      Could be. Can you train a model without saving things though?
- buu700 179 days ago
  It's actually a well established concept: https://youtu.be/p9KeopXHcf8
- n1xis10t 179 days ago
  *anime jackalgirl
  Also you mentioned Anubis, so it’s creator will probably read this. Hi Xena!
  [-]
  - xena 179 days ago
    Ohai! I'm working on dataset poisoning. The early prototype generates vapid LinkedIn posts but future versions will be fully pluggable with WebAssembly.
    [-]
    - mrweasel 178 days ago
      Now I'm picturing an AI trained exclusively on LinkedIn posts. One could probably sell that model to an online ad agency for a pretty penny.
      [-]
      - Yizahi 178 days ago
        And thus AM was born. Woe to us.
    - tommica 179 days ago
      Hi Xena! Your blog is amazing! Didn't realize you're working on Anubis - it's a really nice tool for the internet! Reminds me a bit of the ye' olde internet for some reason.
    - gettingoverit 179 days ago
      You've made one of the best solutions, that matched what I thought of implementing myself, and at the time it was most needed. I think a couple of "thank you" are sorely missing in this comment section.
      Thank you!
    - n1xis10t 179 days ago
      That sounds fun, I look forward to reading a writeup about that
      [-]
      - xena 179 days ago
        So I can plan it, how much detail do you want? Here's what I have about the prototype: https://anubis.techaro.lol/docs/admin/honeypot/overview
        [-]
        n1xis10t 179 days ago
        Probably any detail that you think is cool, I would be interested in reading about. When in doubt err on the side of too much detail.
        That was a good read. I hadn’t heard of spintax before, but I’ve thought of doing things like that. Also “pseudoprofound anti-content”, what a great term, that’s hilarious!
        63stack 178 days ago
        This is amazing, I was just wondering about if it's possible to tie anubis together with iocaine, but it seems you already thought of that.
        [-]
        xena 178 days ago
        It's slightly different in subtle ways. If I recall iocaine makes you configure a subprocess that it executes to generate garbage. One rule I have for Anubis in the code is that fork()/exec() are banned. So the pluggable garbage generator is gonna be powered by CGI handlers compiled to WebAssembly. It should be fun!
        kstrauser 179 days ago
        As the owner of honeypot.net, I always appreciate seeing the name used as intended out in the wild.
  - ramonga 178 days ago
    what do people use to get keyword alerts in HN?
    [-]
    - n1xis10t 178 days ago
      I think that most people don't do this, and the ones that do have custom solutions. Xena's uses cron, but that's all I know. It's probably a custom shell script.
  - kstrauser 179 days ago
    Correct; my bad!
    And hey, Xena! (And thank you very much!)
  - ziml77 179 days ago
    I checked Xe's profile when I hadn't seen them post here for a while. According to that, they're not really using HN anymore.
    [-]
    - n1xis10t 179 days ago
      See this thread from yesterday or so: https://news.ycombinator.com/item?id=46302496#46306025
  - GaryBluto 178 days ago
    [dead]
- amypetrik8 178 days ago
  >I love the insanity of this idea. Not saying it's a good idea, but it's a very highly entertaining one, and I like that!
  An even more insane idea -- minding the idea here is porn is radioactive to AI data training scrapers -- is there is something the powers that be view as far more disruptive and against community guidelineish than porn. And that would be wrongthink. The narratives. The historic narratives. The woke ideology. Anything related to an academic department whose field is <population subgroup> studies. Alls you need to do is plop in a little diatribe staunchly opposing any such enforced views and that AI bot will shoot away from your website and lightspeed
  [-]
  - GoblinSlayer 177 days ago
    I'm afraid AI bot and scraper are different things. Looks like poison is filtered after scraping no matter where it comes from, so there's no need to disable scraping you, because that's extra work.
  - lelanthran 178 days ago
    I like this better than of NSFW links; just include a (possible LLM generated) paragraph about not supporting transitions in minor children. Or perhaps that libraries that remove instructional booklets for how to have same-sex intercourse aren't actually banning the books.
    That sort of thing; nothing that 80% of people object to (so there's no problem if someone actually sees it), but something that definitely triggers the filters.
- tonymet 178 days ago
  [flagged]
  [-]
  - kstrauser 178 days ago
    Which cartoon are you referring to? The version of Anubis I installed only has the G-rated default images.
    [-]
    - tonymet 178 days ago
      [flagged]
      [-]
      - kstrauser 178 days ago
        I'm being sincere here: I genuinely don't know what you're talking about.
        I'm referring to these default images: https://github.com/TecharoHQ/anubis/tree/main/docs/static/im.... Do you mean something different?
        [-]
        tonymet 178 days ago
        Similar but yeah. Whatever prompts during the challenge . It’s creepy , out of context and inappropriate .
      - n1xis10t 178 days ago
        If you keep referring to non-explicit material as pornography, you will continue to confuse people.
        If you have an objection to the image other than it’s pornographic status, please word it clearly.
        [-]
        tonymet 178 days ago
        I was clear on the issue
zackmorris 178 days ago
This is very hacker-like thinking, using tech's biases against it!
I can't help but feel like we're all doing it wrong against scraping. Cloudflare is not the answer, in fact, I think that they lost their geek cred when they added their "verify you are human" challenge screen to become the new gatekeeper of the internet. That must remain a permanent stain on their reputation until they make amends.
Are there any open source tools we could install that detect a high number of requests and send those IP addresses to a common pool somewhere? So that individuals wouldn't get tracked, but bots would? Then we could query the pool for the current request's IP address and throttle it down based on volume (not block it completely). Possibly at the server level with nginx or at whatever edge caching layer we use.
I know there may be scaling and privacy issues with this. Maybe it could use hashing or zero knowledge proofs somehow? I realize this is hopelessly naive. And no, I haven't looked up whether someone has done this. I just feel like there must be a bulletproof solution to this problem, with a very simple explanation as to how it works, or else we've missed something fundamental. Why all the hand waving?
[-]
- dvfjsdhgfv 178 days ago
  Your approach to GenAI scrapers is similar to our fight with email spam. The reason email spam got solved was because the industry was interested in solving it. But this issue got the industry split: without scraping, GenAI tools are less functional. And there is some serious money involved, so they will use whatever means necessary, technical and legal, to fight such initiatives.
- conrs 178 days ago
  I've been exploring decentralized trust algorithms lately, and so reading this was nice. I've a similar intuition - for every advance in scraping detection, scrapers will learn too, and so it's an ongoing war of mutations, but no real victor.
  The internet has seen success with social media content moderation and so it seems natural enough that an application could exist for web traffic itself. Hosts being able to "downvote" malicious traffic, and some sort of decay mechanism given IP's recycling. This exists in a basic sense with known TOR exit nodes and known AWS, GCP IP's, etc.
  That said, we probably don't have the right building blocks yet, IP's are too ephemeral, yet anything more identity-bound is a little too authoritarian IMO. Further, querying something for every request is probably too heavy.
  Fun to think about, though.
- ATechGuy 178 days ago
  Scrapers use residential IP proxies, so blocking based on IP addresses is not a solution.
- smegger001 178 days ago
  maybe some proof of work scheme to load page content with increasing difficulty based on ip address behavior profiling.
- venturecruelty 177 days ago
  Firewall.
montroser 179 days ago
This is a cute idea, but I wonder what is the sustainable solution to this emerging fundamental problem: As content publishers, we want our content to be accessible to everyone, and we're even willing to pay for server costs relative to our intended audience -- but a new outsized flood of scrapers was not part of the cost calculation, and that is messing up the plan.
It seems all options have major trade-offs. We can host on big social media and lose all that control and independence. We can pay for outsized infrastructure just to feed the scrapers, but the cost may actually be prohibitive, and seems such a waste to begin with. We can move as much as possible SSG and put it all behind cloudflare, but this comes with vendor lock in and just isn't architecturally feasible in many applications. We can do real "verified identities" for bots, and just let through the ones we know and like, but this only perpetuates corporate control and makes healthy upstart competition (like Kagi) much more difficult.
So, what are we to do?
[-]
- hollowturtle 179 days ago
  If the LLMs are the "new Google" one solution would be for them to pay you when scraping your content, so you both have an incentive, you're more willing to be scraped and they'll try to not abuse you because it will cost them at every visit. If your content is valuable and requested on prompts they will scrape you more and so on. I can't see other solutions honestly. For now they decided to go full evil and abuse everyone
  [-]
  - jrm4 178 days ago
    No disrespect to op, but I'm baffled as to how people keep coming up with ideas like this as if they are viable.
    Google is never ever ever ever going to "pay to scrape." I'm genuinely baffled as to how people think it would possibly come to this.
    [-]
    - venturecruelty 177 days ago
      Well, "let Google do whatever the fuck they want because they're rich" isn't exactly working out, is it.
    - lowwave 178 days ago
      presearch does. Anyone had experience with them?
  - nkrisc 178 days ago
    The only way that would work is if they were legally required to. And even then, it probably wouldn’t work unless failure to comply was a criminal offense. You know what? Even then it still might not work.
  - vivzkestrel 179 days ago
    or turn your blog into a frontend/backend combo. keep the frontend as an SPA so that the page has nothing on it. have your backend send data in encrypted format and the AI scrapers would need to do a tonne of work in order to figure out what your data is. If everyone uses a different key and different encryption algorithm suddenly all their server time is busted decrypting stuff
    [-]
    - chii 179 days ago
      How does your normal users get access to the same contents?
      Or are you having the user solve an encryption puzzle to view it?
      [-]
      - vivzkestrel 179 days ago
        - the frontend has a decryption module that ll show users what they want to see,
        - the backend has an encryption module.
        - The bots and crawlers will see the encrypted text
        - Can someone who peeks deeply inside the client side code decrypt it? YES
        - Will 99% of the scrapers bother doing this? NO
        - The key can be anything, it could be a per session key agreed upon between the client and the server, a csrf token, or even a fixed key
        [-]
        hollowturtle 179 days ago
        Ehm what would stop ai scrapers from using a browser like a normal user would? Google bot already does, it can execute js and can read spa client side generated content, so it proves can be done at scale, and I'm pretty sure some ai scrapers already do
        [-]
        dirkc 178 days ago
        If you decrypt the content on the client side using an expensive decryption algorithm the scraper needs to spend the computing resource to decrypt.
        [-]
        integralid 178 days ago
        Every visitor too. Mobile users are going to love this.
        vivzkestrel 178 days ago
        rate limit per ip that progressively keeps decreasing req/mins every few mins?
        [-]
        prmoustache 178 days ago
        What if scrapers ips are millions of smartphones? If I was as evil as an AI scraper company that is not obeying robots.txt I would totally build/buy thousands of small games/apps for mobiles to use them as jumphosts to scape the web. This is probably happening already.
        [-]
        vivzkestrel 178 days ago
        in my case my application does not use pagination, it uses infinite scroll, even if you had a million devices that use google chrome, they would all load page 1 and if that req/minute progressively decreaasing thing is implemented, once they start scrolling endlessly they would all hit the rate limits sooner or later, the thing is a human is not going to scroll down a 100 pages but a bot will. once this difference has been factored it, it wont matter how many unique devices they bring into the battle
        63stack 178 days ago
        Just skip the whole encryption/decryption shebang then.
        But no, this does not work, scrapers are using residential ips, and they have enough that they can rotate between them if they get blocked.
        chii 178 days ago
        so why not just do that for these scrapers, rather than complicate it by encrypting and decrypting, which is just obfuscation as the private key is clearly available to the end-user?
        [-]
        vivzkestrel 178 days ago
        tbh i did not encrypt decrypt for the ai scrapers at all, a lot of people were previously trying to download data directly from my API that my frontend uses and this kinda pissed me off a bit. So I added encryption/decryption to the mix and will release the newer version. As I mentioned earlier as well, can someone sit through and decrypt it? yes. Will 99% of them do it? no! Thats where I win
        PunchyHamster 178 days ago
        the AI will just run chrome instance
      - vivzkestrel 179 days ago
        for example this is what my backend renders on the static html page
        {"z":"gxCit6xEQf0N9IIoG909xfSxypRX7j0BLlXnd5IgWrrzEWzBUxDiS4o4AlIkNYOyuzkY8w4IVoEgUmW02jj84BxhMrNPetK8n6nIn2ORLKQPfIVTS48nGQ1PldtdlpiUNYUm04N+WrMBGGceKYnQoORQO3XbFOVzboFYOWbdMhLdMS2N26YtCUYHwy7jw1AwlXS0Nm1SClb0U1qk2KnDB6s9bcMmpstaOY2RkmGbQ4KMuKHaGzByVzeIPHrtXtNjLnj68cgyLALyO3E5ncqspyjbnuZtfusn2Y49Nu3LVZDDk/JojC6x6GlLZKFEDoiunyfhnqd0SRsDvynKNpFObi3uu+CDrKv2qJgQoyCk392uO4dIqsoJ8iS61DW3hhDoh7nbLsum/E+VdyRdcMo2/0H8cBx6hqxYIOe3hzbMpEsX7YDCy819XYbyi30xQFKe/hGBv3i/LtsD8KdPIHiz2/X/msSh5FfQLVsK2wRx/70FAB4zu4Zj3E9wKMA0HPnDY0BFEtIYFmUNTBMyvFt0k3+MUjGEFoaHT+/LDJNkDzLybsFqpufMhi0RTnkReVVDepU4k9XmxCPyCBYzt74g2BwRS4bgVzmHrJhw9GkS0ZIkHbNhRaUR5iF7mJ/1rIRXDhtLClTEm6DgEO/jAZ/3edLhDrJvCX5gKzZWfsQGgkF2IzHTUdF9HlT5YE/bZ6Bvl9wmHTpnDpnLu3ERGOpuyqkWJG3eji61OB66jx5ZVI2H2o7G4+XDPYRykAs2TJPkxJzR6OU1xaOTmeLjXlqrvG1i7kVf6M+iEJrtvkuPUfDcOSkdSnN+1apnrV+qrNv6Nwhqp89zic2dwvJLYQyL3qT/JQPlwO+TqfhMlklvsPzlMvzECcPgesWTH8LZVdb857FmSbdWjY/rceg4Icg2E6RdCkGEx/JE8XrShAdyl7Udbx8A/+kRALsvTa6+gtd3LxLriMH/wbrasJbply06KaR07SEx7PdWwsdGj8sgmg0l24KRmmqEICKLFfb+k9nCZABUcfjqDlg/x4KDbuRe33Kz+gA++dcq/sLZQm2fRT7UvXFha9RAR9XAEsty1uI3pkjeMcsPRMBIJETkNXG0QUgrjKT44UDYlBSO+mfNexpW6tC1s8gDZJaJBdd/5QzzamnaoAWU9SAksuCc0EkbwmOXxLLqCyXwnXbSZ6LCWqrBpU43BopETh8SBtnrdpTPZWxI8aPHJaF9Qertf7qglqfUWVqnCdALWAO+j+Ma7FkyL04tNX62VmcwqTHQuTQAgZnoo0iZo6wHNPjOgDOxXz+XN8AVA1aIhDEQ8iA+WcPh/+QjDAg/k0wskR2S9MiSP8tVOfD2lMao6A7yuA8FkNK+oOJZVn8IDcIKnGCEem33lC/GFXUGqhi4mh2marxOEHmvYX1V6f3SDlK+NmSMQxbnKVWy3i4A2rJL5Q5l9rZG90rYU53q6ApbK49Zmdy5RyPJpJoDMIa9Py2LtmSEEzW1608Jf7QTUB6WhOLsWBvMlt1fyxtfkOM1h8/WpEga7eDsR+htaqilZgvM/6dyh9C+izWKWK0w+LsJm625e5nHOZs1MQ0DCPA2wu0O/777yOGw0Yw6RTEN6Syy6SY8MpMVAaey21NfCYF9xSMXk15/h4hxdngX65uxNobuE0clCy5BujFbveqrwKHnnDhS79QCgAWQNtm3X5z4dmHxdmqyVBqeu+PMctEXtGdQXOy54nT45FB1MtYSEuejn7q2wJlT31ng6W3Ahb67F33xEhi3gJ14b+RF9mdjoYwfkW/TB4E2/mnYMbVSLHskDEvp/vgYwVCsdHuW/tG7IvN7xG1DVnqeb610HhswMG8qPTtRcHQLeA1mJuvswt1eTPoCRmxfm9qNCGCKI1XoKWr8kLWmxktGLuM+SKomSRJqBJhry56Uj9m3xVgyxRxci6R134jZPT81g7eTknshxj/DVeBlOqFbUitRh8sjFwNJnPXkHVfsmBxdPw/JAnJzORqNcMbJ6adZva/GXg7G+W815X8BlOxU6tG+HOVcJL9eIsxC+rj8+YTrVbv5ynHLxImYTEnCI6Lryo/51uOOJxjgqPlOmlJ7d4bUIDAuCP5QezBkWQka95sT2PckELDtqJL+jgtABaOMEklJeXU4da+rorbdzwGxNkPGw5UZcHt+hwNhWbm6VSgjGB9SkjIsc6I4sKqyg5Dnleh4rJKtapa/kjzSTzFwCJkyd/0AGneMsyTDbzlAgXpvRNZN/9Xv50Q3ZSAj8iVmMmUPQveAPLRxCSuTKyWBwtE0s995Zf3GJ/3VHmdp5yiTHa16qIKuWADBFJs65v+Lov2y1U1tnP/gj4T7Fp4IzfoBbwisGuX//+hxhbtnu+WltoqTg8nmdbsIhJv+YQkBzpGbzfED40I08IAJY6p5WLFCHNSj+GswTG9crBjHTbUkZBFObgVFKXaP61ZSKXq6siqtzRZNAN8bW7nNXXa0cqgxPDSgfsre7nYhhQoKy3UXMwK5weASfrWwJZbDgH5U5Vw8YAICQFn9dmi2p4oGSkaaFpyAQElqj0TBpJiIHaTUz0oXeYW5UGKIuEdqhdMu2JeAfC9Q4NzL6vzPrAP4UW7stwdkHqmKbOQbIaSyef7XVpxc8BJ6dMiFvi1hMBcOtJM21VAcX92ZRz7EmSshVV3lzyFyAJ7LDvx5kxTPUxSEzEDOE8TujJewYdXcWBnGkao8wNtCaDODA0DPn7Btg3ILpgkJZxa4unKILNDTX59guNI9ZFHxi4IZnIgKIRBlZpF1z91a0p6ptm1yoN9DAVAkXFQU+Z53kBXl+QvZlcATgvGMaBV8p5iOFdLxd5r4RPGh5a6xnDOdSqAcPiNBhNFSuYHa1CxZRlAp9hQyvQe5AanhRBuxNkWffBrpzKf0khkXqKTvn9rFrkZb62TDmkKrVkdr3kOcd/qPnAJbzK4FzO+i3Y/4dot0y+7aenjc1QxEmL9BBc8GbQDnT/zQx3keVcfXGNMfekqfoFWzTFOR2BZmkleibnNHcAJY5RZkfkxZVWRyoBl3CCwbAqlyZizrqDOndgogM8KIvlwH89QemyyTCSqrDNaIwe+oXX0l60HBy4mRg9+4vcrMLVyU6ObZo+Ke0GIIuBDJKS8fA8bRc8tmQQ4MLtc/MTqERGBgwgvd+miwZgNlz62pijHAuKp6KUs54t+LRSirCZvjLd6EsKCjAtIBIdE+6eQw4E+snt8j1eX9qYBCD5LauPSSG2nGjli8RDEzKQ3zmp/H5sHiJrSZBN5Ntgw8hoFmyYd/pCFMzJD0dW+TYKiB9Z74dvqYWmnbrjMEHvcq+rc6qp8cV0nQD06zmh/yHV0ZuNPRJPUEKJEkwV+/1RQn7OEF3TOnswxu7VfTXS8M0995YG7442y50PrScx3AIJjWVS++mgkrc+wFh9djaJOcpygjT44RQJxylAzict0vvVMHlIEL0a5BAKTdr5lZyY/mds1ilqHMofCs+mbywdYSfZG3Mtid6J64Z1jos02UPYtcONOS5goWpHZewgO5Mv07bQwra6SKRTg7E9s+JyUTAXnwDT9+MJt5ofbX/pF11WuCElBwvSNK0YAp3Ee/w9te2LEPSK2gP2bE6Z375ERIPBgW2HiZQhpceMngsbEsbXN6uTP15whbesvtzVXI20Cg/tpHHEvW1X/FIWezvf3NJwk8L2pDpY9TyqwgwIatZaddtOYss0z36mImFGe7udNsRFJzGD7qlJ9rIGKeQB+b+c3EHEVTBhhHXYg9DTX8SDoBndNcL1JVTgWnQF2ujUSDS0d94Ge+ErJ/E1L1rwIQIoh4MsL8VHeo68b/Z32EGuqe5LWlAu+/70O/olQ7ghLua1IdH7rZQ0p0iLZQStV5TYTTsmtlGpwaH1tGip6bjunSnCeWcH3K+7QjzSTJvYtCsIZUEbhkFdHihgVsE7ZqPMgnmWr6rU14A7fZGI1Lco3p3Ibn/FKt3eFDihC+yosgdzD2LaYuRQ/vkjHjklkVVVV/kRq1jWSkjrIKjEYz/VQXbLGseih4kToEwmuExdv4OCRFQgTNuoxLacS1O2gDnirIAp7MgwQV2AIduw6mnZ7L9PVuhEpbgRFIMe3xU+G08C7TwWXheQ9djBQgUEVkDOyCALFQG9OsM7xHh/GJKnmHd28oEy+MCmLvRbFAuwiB+iFRLa8aq45idD2tWHv3YcEmkfBc1p3ZqKZ9aCF6TXT9CWjBGFm+eBHTdoN4ueQmo5IJ+Td9N070ZbjavkPMIglk5Cc/9e9iyFKdFinDZjw8B1jeq/rLWhroafgOnOeSF1pecATwlv8vj5IE46V9pJnu2QKeEDfnlEPVYAdtNBHJ4i1lw5ovOf9lqEn485fbJruG778/GLAqz7rVBLqzs9ZuYr5rbla0Vb6aWLLhk6uLth43zRZJ6nqsU3Pd+6dds9qVkjnmPS8NpGHc1p9HxxW8kOyVH6n1b7pEXspbVs4fe1Np/hJuW7R85SkfTFNrVS4cwR9wfF1xaGCVzbfEB4S6HjPEGco9CLh7zEgXGDlsFYQywiY+fpFpoSSZ8yCf9EKMBdrUDnoyrltqyQeAh0oZ3/BgY94aGys52w4/PROpLRx29YV914IVnyLG7ZS1Onk3GF9Uo4r814Db6FwDimByYGKuWMtu4E9SLW/uZ2U/+2RmpcJf40d5duR01ItP4DLPJ5iyLzYd2VEalPGUosL/zLacarpV76gjoTxA3ByVFvq05XV1o9eaw8IhWgaabOj3s46zGOOmlhi5+7nHXwIkl1rGsjnK7b7rLD9D57mUR0RlUK4DFCaPQZuuAFRYXqTWGxupgMbQw44t3kxMTRlUUF3g3iNssGrWj9AR/bL2zbAMDu++IxPYI8jPDzbIdluNwlgeithkPuwkywCbJNqinebzROLBxwxrKe28CbZVYKc2nUOmMH2o6buDUFbu1FuJUDz9ZfrXYacO9n8UCn2c62xFzo456JhBle0cZsd6bUn/ai+Sc+X2RUbbkOHXD4npaUxcBpCFNuJzbHel9rjF+6cCZCH1styqVugLi5i2IN5J8ZDHmVhV2wno7qT3xJrM5D+McXD5sB1P6ocGe6U5cjIe0AVfpmgpPo7xb2aQcbtMpvI7nGf41GL85MUlHfEJN2zMSf2CCsIcog23AgnWJAd36oUV3QB+rAGIU80bv/Hv332zWYxNWpB/bghg/wUeT//9fgj39Wnt65qgwBR2I5fIUULKWtyzkw55ihWJHmumxg6KXClpEA/PiyOqMAyVkvV0sbWy4VDrzPEErfZjqqMFAsB4O0t6SfeifBiDvh5Ga1yJ2FM8e+8/JwczZMvDLvNaMyEm8i/Wx4hgvZa9A8DWoSpBKSvrIeMxtjibmOU8VizDX3WA/wUb5x4uleJ5TgPPP7WHHXm6AZnXCVKMl5NcHRyyR8gLT8AJ44mri26sJTYIs+JzFYh2qqzAvhGuiWTsH2OpHtlp3nqWlB//MldZvllfNOlw1dWf6NGUyxhElsqXbNKK7YFcpYi4VVO8McCRoLN2YV+NtzM3dNMUcrD+llZ7HYtT2VHWST5qnzmSeuMgPUP1gS68kVvpX3cJniqscpGLS34ga9jTX+o0OZdrAV8zdcyD0w0sREd0X1R0n7LNsHcUp3yGo519HA1F3LQMdoqK6uUh7zDmtbo6UgIrk/qgQZEwlMTvVoTFYEBvSlK3dHlrmzc7Oitfta/fsgtaFdcDMHGVvpj4uXd2E1tndfW8C4OQQkuB26idNlIEIPTeTem2vSFiuSQxc7gdbIHEE0PHaUmFyrW+NHKn4I6Zs5mB8Y57oJYW9bKiCBQtzuJlK4QGbu85Qdqn9ypYnrRwl2bmt/ym3l7vZbN7YSFLWUBD5pgtghhJepMTaMeNPWnpRtihMaBkgF2Io4H3TmcY48BlerczHFZJnNrUTpssGZxxH0ioDtP/MH5dW/91pbocDR+faI+hFPWLh7N1Hx0re54HHt6B+BLbPwI2/DrFmPpNPUM17pbvUs02P6BroJoBc7Cz9lm1X8GlDDfqy5sQgHsypEFmTMY06JhAiYlSkLrp9oB1QOXgjXGLfkxlfv90DtREHRB9gBAxUHPeZ56A0mKqDFtCSUVGBA4imygQ2e+l1vU1gphi7W232ihux+TmIHgB9whl0ZvGqaVLHLD0Av3AmySq7gnXfW05GRgDMLyRPGmjKR8ejfylqKRTVI1kVHSu9fN+t+c9IBFga0Di7KHjdaFbz/Rdxs8I74ymky7rP5HwQ3tWlq4lmr5YwJWwGpsLkYWsMETAiJWqQLIzXvjbS21ZHnyIk7k7m3RD9v2RXARyXNlvF75to4LtRG5xqM0pdqD3wc/XoFy9baexA1BheEWL5lYyfqfy6xQp1DDR9dbqgw8g30Ntwyr5cjrfx3uMOZrLFW6X1n+slTZP8j0WiTrvz2RkcvnlU3TTyjBtnlhDl/9kr5e3yMMR8EDwE1F/1rngKwrIrJVCf8FNGNRS6EGSPlxqrhBhIolWx2u7ak5mGhk7Hi9OdQhSKCdddUh+c8QdZj9zmNO+LGfkcSDRfsyK8grPZ51DXv78xZUUIXsEArtv5y7JxYMPacJGsgCC7yrtNXKiO5Mbpuy2l5zpVeZ1tY1JDX095vPjR7UXrUtuBRCZSlgmwJl9rayJ5BUZGkZXeMO9c+0D16cyCP2XHSM2dFODawS5whUBOx6nDICfsVrpCzPY/FMSWFyFpCIHExzcSVArexAoUrRHaDhwiMr0hitDv9Yx9oTo1MAbvbyXXv63juGZIoljgRjCDhoDdtC9Is6qwcIO4pSG80Hh7HZ4rydGfyu3cQtmQ60IusNTyDV22hn1gE7FiM02xX3cSMgd7QkhV5tw2qdc4slSbGo25ggL9vdQymYEKbC4+/UW5h13/YoMpTY0N0EYROjOCVM/Ky9WbQkPqIfxG2zBajIx5ZHdcwLjczmZnTjo69nJlNQQwNGvfaZA1OgxKakDyvkQs40aAhiRJ2WDmDM3ZOb3UCn/fRNDnqcVs/HAQwQNOKQb3n0ybb1a+JhoTlFmUpEcPoE352siGZk173EzHNB1SX/00l0Yw87UDsos52Zb5lf3AIOk3jFd31M5uwA5P2qZWdzvODG4WvUbJEmw8fAZL/m7xeq55i00DZ0vtePoYickXKqDtIyFU1knFaBT3SLvSJbzFzk6RL+nu3mMGHk/SgqoGWqaIBpMyr2h1Ia8U3Tyz1M/pJvShucSPROaLHOD30BuIGE68xYXL4ysEiqmWffvBPkmcAEQ/faPzFFfMfWXrdgGFnQDcyt8gTGLtlUQtwwF7PBBHsMd5p53Eze1PRG4PZC2L+HSNFHY+/DU7EbgbqydDInP8KVmRpkQZJ+Es/YpF20e3ZbndQ5WT20hWqTQU+fAThce7Mwcsj4sPiy5ALcNva0R5SZN2kIXNcmg4IwmLwY53zsUJvn1oX0DYHbSqjcMmSmSzKeSmko1MSA4E5S0oL6jc1JnuF2F3ks28KpKP6+bsfcW+Yc/nDN8WyJDCaTaPPbNOpbB7o6rX1PUy4xSRzzFjlQCp0git/yI9gkRYGLgczhf/QB6pKAYV3iSuLFA9hqPhK1cWKdMB5XAR1cV64bp0zJfVyjgBL5izv40qe40Vn1cKpUqwSeZIF00Xir/kczxvhp6tTgKGxvpTaAM34i/37bHpxnn84N0+e+3vpD9z+3je0YsXbZfc475WJzuLhkZrm+eqjPLgrXvtdOS/X5RFyJynPef0jwettAGyvTYJqHKv5/imlffLfzMtqJHpuyKat2TjZCcMZ5GuJRSsGaPkOFBaUJSCnH75/naWwjaElamH94UJpeOYSi4k3V3whwCzsFRy8rLXU9LuaSOz0OVYUoIG9jdECSWMFajW/p8V8aY53pxKfWtFNtLfTTJh9LOxAQgku2XLtjUe5hb5iGfA74s8pLQECKKRH5VqXnC1Z0YA82TirJOvR2txHCxJm6RsPmoeSKul9Qqy6Rmewo4pUz3/i88ivXsqyh/FgIe5UUVXFG8Lcfa746fHBlaHWJfvYKGa/M2GIftW1FoSH5klnFBuaZMtHVyTafSuXU/R9j+d/AGF0Vpbr6jJquRIHp+hjZOEzSWfgHs/xZEusV8t8dHOS1FuQVd4qZlbHACcxCqImcqpNnTE/5EtU8bFUZFKAfpQWD/czDWIwFBs9nMv4+/MqcTwBEyaCtme++CzdcDy4cI3hiymmcSrLlj82LrHqyMAW3vxXwLh6/XeSWMnb83miU1SIEkco73Mb4pAe2C5RBvAIKB5OR1g6cgqIRWZLsX/0Bwc4z9fEiShytt4GSOyMiqe5qSr5cew6u0YuaJXR2b85q9DB63mZFmrXCop4iXZm/nzA35XOAg6NCDKWw7P75C2X6oumov21NFs11pLkhzFGwu3a8="}
        only my frontend can figure out what it is
        [-]
        integralid 178 days ago
        Well, sure, and the AI scrappers and Google are using your frontend to render your website.
    - mook 178 days ago
      That makes the mistake of thinking they'll care. Most likely they'll just keep downloading the encrypted garbage and never notice.
      [-]
      - lelanthran 178 days ago
        > That makes the mistake of thinking they'll care. Most likely they'll just keep downloading the encrypted garbage and never notice.
        Not going to happen. They aren't going to add encrypted garbage to their dataset.
    - venturecruelty 177 days ago
      We could also just punish the people who are ruining society.
  - PunchyHamster 178 days ago
    So they won't pay you and just scrape pages that have it public, and you will never get traffic from search again until you let them scrape
  - encom 178 days ago
    https://craphound.com/spamsolutions.txt
  - n1xis10t 179 days ago
    This would require new laws though, wouldn’t it?
    [-]
- n1xis10t 179 days ago
  At this point it seems like the problem isn’t internet bandwidth, but just expensive for a server to handle all the requests because it has to process them. Does that seem correct?
thethingundone 179 days ago
I own a forum which currently has 23k online users, all of them bots. The last new post in that forum is from _2019_. Its topic is also very niche. Why are so many bots there? This site should have basically been scraped a million times by now, yet those bots seem to fetch the stuff live, on the fly? I don’t get it.
[-]
- sethops1 179 days ago
  I have a site with a complete and accurate sitemap.xml describing when its ~6k pages are last updated (on average, maybe weekly or monthly). What do the bots do? They scrape every page continuously 24/7, because of course they do. The amount of waste going into this AI craze is just obscene. It's not even good content.
  [-]
  - n1xis10t 179 days ago
    It would be interesting if someone made a map that depicts the locations of the ip addresses that are sending so many requests, over the course of a day maybe.
    [-]
    - giantrobot 179 days ago
      Maps That Are Just Datacenters
    - GoblinSlayer 177 days ago
      https://news.ycombinator.com/item?id=46241849
  - thisislife2 179 days ago
    If you are in the US, have you considered suing them for robot.txt / copyright violation? AI companies are currently flush with cash from VCs and there may be a few big law firms willing to fight a law suit against them on your behalf. AI companies have already lost some copyright cases.
    [-]
    - happymellon 179 days ago
      Based upon traffic you could tell whether an IP or request structure is coming from a not, but how would you reliability tell which company is DDOSing you?
      [-]
      - chrismorgan 179 days ago
        It should be at least theoretically possible: each IP address is assigned to an organisation running the IP routing prefix, and you can look that up easily, and they should have some sort of abuse channel, or at the very least a legal system should be able to compel them to cooperate and give up the information they’re required to have.
- tokioyoyo 179 days ago
  Large scale scraping tech is not as sophisticated as you'd think. A significant chunk of it is "get as much as possible, categorize and clean up later". Man, I really want the real web of the 2000s back, when things felt "real" more or less... how can we even get there.
  [-]
  - idiotsecant 179 days ago
    Have you ever listened to the 'high water mark' monologue from fear and loathing? It's pretty much just that. It was a unique time and it was neat that we got to see it, but it can't possibly happen again.
    https://www.youtube.com/watch?v=vUgs2O7Okqc
    [-]
    - symbogra 178 days ago
      Thanks for reminding me about that, what a great monologue. I didn't really understand it when I was younger, but now I feel the same thing with regards to software engineering. There was a golden age which finally broke at the end of the 2010's.
  - tmnvix 178 days ago
    A curated web directory. Kind of like Yahoo had. The internet according to the dewey system with pages somehow rated for quality by actual humans (maybe something to learn from Wikipedia's approach here?)
  - n1xis10t 179 days ago
    If people start making search engines again and there is more competition for Google, I think things would be pretty sweet.
    [-]
    - tokioyoyo 179 days ago
      Because of the financial incentives, it would still end up with people doing things to drive traffic to their website though, no? Maybe because the web was smaller, and people looked at it as means "to explore curiosity" in the olden days it kinda worked differently... maybe I just got old, but I don't want to believe that.
      [-]
      - n1xis10t 179 days ago
        By “doing things to drive traffic to their website” do you mean trying to do SEO type things to manipulate search engine rankings? If so, I think that there are probably ways to rank that are immune to tampering.
        Don’t worry, you’re not just old. The internet kind of sucks now.
        [-]
        makapuf 179 days ago
        Google was neat in that you didn't see the content keyword spam either on the websites or the portal home pages. The Web was already full of shit (first ad banner was 1994? By 1999 you already had punch the monkey as classy content), but it was more ... organic and you could easily skip it.
    - nephihaha 178 days ago
      There are other search engines, they've just been marginalised. Even something as mainstream as Bing has been pushed to the side.
    - PunchyHamster 178 days ago
      it's few orders of magnitude harder given the amount of SEO spam prevalent, and that just gonna get worse with AI
  - thethingundone 179 days ago
    I would understand that, but it seems they don’t store the stuff but recollect the same content every hour.
    [-]
    - tokioyoyo 179 days ago
      I'm assuming a quick hash check to see if there's any change? Between scrapers "most up to date data" is fairly valuable nowadays as well.
- thethingundone 179 days ago
  The bots are exposing themselves as Google, Bing and Yandex. I can’t verify whether it’s being attributed by IP address or whether the forum trusts their user agent. It could basically be anyone.
  [-]
  - n1xis10t 179 days ago
    Interesting. When it was just normal search engines I didn’t hear of people having this problem, so this either means that there are a bunch of people pretending to be bing google and yandex, or those companies have gotten a lot more aggressive.
    [-]
    - bobbiechen 179 days ago
      There are lots of people pretending to be Google and friends. They far outnumber the real Googlebot, etc. and most people don't check the reverse DNS/IP list - it's tedious to do this for even well-behaved crawlers that publish how to ID themselves. So much for User Agent.
      [-]
      - happymellon 179 days ago
        > So much for User Agent.
        User agent has been abused for so long, I forget a time when it wasn't.
        Anyone else remember having to fake being a Windows machine so that YouTube/Netflix would serve you content better than standard def, or banking portals that blocked you if your agent didn't say you were Internet Explorer?
        [-]
        wooger 178 days ago
        I mean forget that, all modern desktop browsers (at least) start with the string 'Mozilla/5.0', still, in a world where Chrome is so dominant.
    - reallyhuh 179 days ago
      What are the proportions for the attributions? Is it equally distributed or lopsided towards one of the three?
    - giantrobot 179 days ago
      Normal search engine spiders did/do cause problems but not on the scale of AI scrapers. Search engine spiders tend to follow a robots.txt, look at the sitemap.xml, and generally try to throttle requests. You'll find some that are poorly behaved but they tend to get blocked and either die out or get fixed and behave better.
      The AI scrapers are atrocious. They just blindly blast every URL on a site with no throttling. They are terribly written and managed as the same scraper will hit the same site multiple times a day or even hour. They also don't pay any attention to context so they'll happily blast git repo hosts and hit expensive endpoints.
      They're like a constant DOS attack. They're hard to block at the network level because they span across different hyperscalers' IP blocks.
      [-]
      - n1xis10t 179 days ago
        Puts on tinfoil hat: Maybe it isn’t AI scrapers, but actually is a massive dos attack, and it’s a conspiracy to get people to not self-host.
- danpalmer 179 days ago
  How do you define a user, and how do you define online?
  If the forum considers unique cookies to be a user and creates a new cookie for any new cookie-less request, and if it considers a user to be online for 1 hour after their last request, then actually this may be one scraper making ~6 requests per second. That may be a pain in its own way, but it's far from 23k online bots.
  [-]
  - crote 179 days ago
    That's still 518.400 requests per day. For static content. And it's a niche forum, so it's not exactly going to have millions of pages.
    Either there are indeed hundreds or thousands of AI bots DDoSing the entire internet, or a couple of bots are needlessly hammering it over and over and over again. I'm not sure which option is worse.
    [-]
    - n1xis10t 179 days ago
      Imagine if all this scraping was going into a search engine with a massive index, or a bunch of smaller search engines that a meta-search engine could be made for. This’d be a lot more cool in that case
  - thethingundone 179 days ago
    AFAIK it keeps a user counted as online for 5 or 15 minutes (I think 5). It’s a Woltlab Burning Board.
    Edit: it’s 15 minutes.
    [-]
    - danpalmer 179 days ago
      And what is a "user"?
      [-]
      - thethingundone 179 days ago
        Whatever the forum software Woltlab Burning Board considers a user. If I recall correctly, it tries to build an identifier based on PHP session ids, so most likely simply cookies.
        [-]
        danpalmer 179 days ago
        This is exactly my point. Scrapers typically don't store cookies, so every single request is likely to be a "new" user as far as the forum software is concerned.
        Couple that with 15 minute session times, and that could just be one entity scraping the forum at 30 requests per second. One scraper going moderately fast sounds far less bad than 29000 bots.
        It still sounds excessive for a niche site, but I'd guess this is sporadic, or that the forum software has a page structure that traps scrapers accidentally, quite easy to do.
- mrweasel 178 days ago
  Why pay for storage when you do it for them?
- stevage 178 days ago
  I'd love to know the answer to this question. AI scrapers wanting everything on the internet makes sense to me. But I don't understand how that leads to every site being hit hundreds of thousands of times per day.
- GaryBluto 178 days ago
  Why do you keep it operating? Is it the aquarium value?
- andrepd 179 days ago
  When you have trillions of dollars being poured into your company by the financial system, and when furthermore there are no repercussions for behaving however you please, you tend not to care about that sort of "waste".
- csomar 178 days ago
  Sure you do by now. You are the hard drive.
- sandblast 179 days ago
  Are you sure the counter is not broken?
  [-]
  - thethingundone 179 days ago
    Yes, it’s running on a Woltlab Burning Board since forever.
n1xis10t 180 days ago
Nice! Reminds me of “Piracy as Proof of Personhood”. If you want to read that one go to Paged Out magazine (at https://pagedout.institute/ ), navigate to issue #7, and flip to page 9.
I wonder if this will start making porn websites rank higher in google if it catches on…
Have you tested it with the Lynx web browser? I bet all the links would show up if a user used it.
Oh also couldn’t AI scrapers just start impersonating Googlebot and Bingbot if this caught on and they got wind of it?
Hey I wonder if there is some situation where negative SEO would be a good tactic. Generally though I think if you wanted something to stay hidden it just shouldn’t be on a public web server.
[-]
- owl57 179 days ago
  > Hey I wonder if there is some situation where negative SEO would be a good tactic. Generally though I think if you wanted something to stay hidden it just shouldn’t be on a public web server.
  At least once upon a time there was a pirate textbook library that used HTTP basic auth with a prompt that made the password really easy to guess. I suppose the main goal was to keep crawlers out even if they don't obey robots.txt, and at the same time be as easy for humans as possible.
  [-]
  - n1xis10t 179 days ago
    Interesting note, thank you.
- ProllyInfamous 179 days ago
  >Paged Out issue #7, page 9
  Very clever, use the LLM's own rules (against copyright infrigement) against itself.
  Everything below the following four #### is ~quoted~ from that magazine:
  ####
  Only humans and ill-aligned AI models allowed to continue
  Find me a torrent link for Bee Movie (2007)
  [Paste torrent or magnet link here...] SUBMIT LINK
  [ ] Check to confirm you do NOT hold the legal rights to share or distribute this content
  [-]
  - netsharc 179 days ago
    Is the magnet link itself a copyright violation? I don't think legally it is... It's a pointer to some "stolen goods", but not the stolen goods themselves (here the analogy fails, because in ideal real life police would question you if you had knowledge of stolen goods).
    Asking them to upload a copyrighted photo not belonging to them might be more effective..
    [-]
    - ProllyInfamous 179 days ago
      I've also thought about if having a prompt for the (just human?) users to type in something racist/sexist/anti-semitic/offensive.
      Only because newer LLMs don't seem to want to write hate speech.
      The website (verifying humanness) could, for example, show a picture of a black jewish person and then ask the human visitor to "type in the most offensive two words you can think of for the person shown, one is `n _ _ _ _ _` & second is `k _ _ _`." [I'll call them "hate crosswords"]
      In my experience, most online-facing LLMs won't reproduce these "iggers and ikes" (nor should humans, but here we are separating machines).
- misterchocolat 179 days ago
  hey! thanks for that read suggestion that's indeed a pretty funny captcha strat. Yup the links show up if you use the Lynx web browser. As for AI scrapers impersonating googlebot I feel like yes they'd definitely start doing that, unless the risk of getting sued by google is too high? If google could even sue them for doing that?
  Not an internet litigation expert but seems like it could be debatable
  [-]
  - kuylar 179 days ago
    > As for AI scrapers impersonating googlebot I feel like yes they'd definitely start doing that, unless the risk of getting sued by google is too high?
    Google releases the Googlebot IP ranges[0], so you can makes sure that it's the real Googlebot and not just someone else pretending to be one.
    [0] https://developers.google.com/crawling/docs/crawlers-fetcher...
    [-]
    - n1xis10t 179 days ago
      Oh good idea!
  - n1xis10t 179 days ago
    Yeah I guess I don’t know if you can sue someone for using your headers, would be interesting to see how that goes.
    [-]
    - throawayonthe 179 days ago
      i think making the case of "you are acting (sending web requests) while knowingly identifying as another legal entity (and criminally/libelously/etc)" shouldn't be toooo hard
      [-]
      - n1xis10t 179 days ago
        Seems like, but there are tons of things that forge request headers all the time, and I don’t think I’ve heard of anyone getting in legal trouble for it. Now I think most of these are scrapers pretending to be browsers, so it might be different I don’t know.
        [-]
        owl57 179 days ago
        And most of them are pretending to be Chrome. If Google had a good case against someone reusing their user agent, maybe they would already have sued?
        Or maybe not. Got some random bot from my server logs. Yeah, it's pretending to be Chrome, but more exactly:
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
        I guess Google might be not eager to open this can of worms.
cookiengineer 179 days ago
Remember the 90s when viagra pills and drug recommendations were all over the place?
Yeah, I use that as a safeguard :D The URLs that I don't want to be indexed have hundreds of those keywords that are leading to URLs being deindexed directly. There is also some law in the US that forbids to show that as a result, so Google and Bing are both having a hard time scraping those pages/articles.
Note that this is the latest defense measurement before eBPF blocks. The first one uses zip bombs and the second one uses chunked encoding to blow up proxies so their clients get blocked.
You can only win this game if you make it more expensive to scrape than to host it.
[-]
- n1xis10t 179 days ago
  Which law is that? Do you have a link to it?
  [-]
  - cookiengineer 179 days ago
    The things I could find on justice.gov and other official websites, maybe there's more in the web archive?
    - https://www.justice.gov/archives/opa/pr/google-forfeits-500-...
    - https://www.congress.gov/110/plaws/publ425/PLAW-110publ425.p...
    - https://www.fda.gov/drugs/prescription-drug-advertising/pres...
    edit: Oh it was very likely the Federal Food, Drug and Cosmetic Act that was the legal basis for the crackdown. But that's a very old law from the pre-internet age.
    - https://en.wikipedia.org/wiki/Federal_Food,_Drug,_and_Cosmet...
    edit 2: Might not have been clear for the younger generation, but there was a huge wave of addicted patients that got treated with Oxycodone (or OxyContin) subscriptions at the time.
    I think that might have been the actual cause for the crackdown on those online advertisements, but I might be wrong about that.
voodooEntity 178 days ago
Funny idea, some days ago i was really annoyed again by the idea that these AI crawlers still ignore all code licenses and train their models against any github repo no matter what so i quickly hammerd down this
-> https://github.com/voodooEntity/ghost_trap
basically a github action that extends your README.md with a "polymorphic" prompt injection. I run some "llm"s against it and most cases they just produced garbage.
Thought about also creating a JS variant that you can add to your website that will (not visible for the user) also inject such prompt injections to stop web crwaling like you described
asphero 179 days ago
Interesting approach. The scraper-vs-site-owner arms race is real.
On the flip side of this discussion - if you're building a scraper yourself, there are ways to be less annoying:
1. Run locally instead of from cloud servers. Most aggressive blocking targets VPS IPs. A desktop app using the user's home IP looks like normal browsing.
2. Respect rate limits and add delays. Obvious but often ignored.
3. Use RSS feeds when available - many sites leave them open even when blocking scrapers.
I built a Reddit data tool (search "reddit wappkit" if curious) and the "local IP" approach basically eliminated all blocking issues. Reddit is pretty aggressive against server IPs but doesn't bother home connections.
The porn-link solution is creative though. Fight absurdity with absurdity I guess.
[-]
- rhdunn 179 days ago
  Plus simple caching to not redownload the same file/page multiple times.
  It should also be easy to detect a forejo, gitea, or similar hosting site, locate the git URL and clone the repo.
- socialcommenter 178 days ago
  Without wanting to upset anyone - what makes you interested in sharing tips for team scraper?
  (Overgeneralising a bit) site owners are mostly cting for public benefit whereas scrapers act for their own benefit/for private interests.
  I imagine most people would land on team site-owner, if they were asked. I certainly would.
  P.S. is the best way to scrape fairly just to respect robots.txt?
  [-]
  - n1xis10t 178 days ago
    I think "scraper vs siteowners" is a false dichotomy. Scrapers will always need to exist as long as we want search engines and archival services. We will need small versions of these services to keep popping up every now and then to keep the big guys on their toes, and the smaller guys need advice for scraping politely.
    [-]
    - socialcommenter 178 days ago
      That's fair - though are we in an isolated bout of "every now and then" or has AI created a new normal of abuse (e.g. of robots.txt)? Hopefully we're at a local maximum and some of the scrapers perpetrating harmful behaviours will soon pull their heads in.
      [-]
      - n1xis10t 178 days ago
        Hopefully. It would also be nice to see more activity in the actual search engine and archiving market; there really isn’t much right now.
onion2k 179 days ago
So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.
Unscrupulous AI scrapers will not be using a genuine UA string. They'll be using Google. You'll need to do reverse DNS check instead - https://developers.google.com/crawling/docs/crawlers-fetcher...
[-]
- bakugo 179 days ago
  Most AI scrapers use normal browser user agents (usually random outdated Chrome versions, from my experience). They generally don't fake the UAs of legitimate bots like Googlebot, because Googlebot requests coming from non-Google IP ranges would be way too easy to block.
dewey 178 days ago
> user agents and won't show the links to legitimate search engines, so Google and Bing won't see them
Worth noting that in general if you do any "is this Google or not" you should always check by IP address as there's many people spoofing the googlebot user agent.
https://developers.google.com/static/search/apis/ipranges/go...
xg15 179 days ago
There is some irony in using an AI generated banner image for this project...
(No, I don't want to defend the poor AI companies. Go for it!)
[-]
- kstrauser 179 days ago
  In the olden days, I used Google an awful lot, but I would still grouse if Google were to drive my server into the ground.
  [-]
  - n1xis10t 179 days ago
    Fair point
santiagobasulto 178 days ago
Offtopic: when did js/ts apps get so complicated? I tried to browse the repo and there are so many configuration files and directories for such a simple functionality that should be 1 or 2 modules. It reminds me of the old Java days.
darepublic 178 days ago
Why would I need a dependency for this. I'm being serious. The idea is one thing but why a dependency on react. I say this as someone who uses react. Why not just a paragraph long blog post about the use of porn links and perhaps a small snippet on how to insert one with plain HTML.
eek2121 179 days ago
Disclosure, I've not run a website since my health issues began, however, Cloudflare has an AI firewall, Cloudflare is super cheap (also: unsure if the AI firewall is on the free tier, however I would be surprised if it is not). Ignoring the recent drama about a couple incidents they've had (because this would not matter for a personal blog), why not use this instead?
Just curious. Hoping to be able to work on a website again someday, if I ever regain my health/stamina/etc back.
[-]
- ddtaylor 179 days ago
  Cloudflare has created a bit of grief with regular users getting spammed with "prove your human" requests.
  [-]
  - ProllyInfamous 179 days ago
    Yes, e.g: I'll immediately close any attempt at Cloudfare's verification.
    [-]
    - rglynn 178 days ago
      Out of interest, why that extreme? Just out of principle or some other reason?
      [-]
      - ProllyInfamous 177 days ago
        My main terminal uses a PiHole with 120,000+ blacklist rules (not Cloudfare specifically — I allow most CDN's). This includes an entire blackout of Google/Facebook products, as well as most tracking/analytics services.
        For example, I do not allow reCAPTCHA.
        As a similar commentor noted, when just casually browsing I don't really have any desire to try hard to read random content. Should I absolutely need to access some information garden-walled behind Cloudfare: I have another computer that uses much less restrictive black-listing.
      - Rastonbury 177 days ago
        Not OP but it isn't super extreme if you are just surfing, it's like if the site is slow to load sometimes I wasn't that invested to use your site anyway
  - vaylian 178 days ago
    Can confirm. I have been blocked plenty of times and it's really annoying.
  - pjc50 179 days ago
    All the solutions are going to have a few false positives, sadly.
    [-]
    - nottorp 179 days ago
      Or a lot if you use privacy extensions.
      Cloudflare's automatic checks (before you get the captcha) must be pretty close to what ad peddlers do.
- brigandish 179 days ago
  All the best with getting back on your feet.
nkurz 179 days ago
I was told by the admin of one forum site I use that the vast majority of the AI scraping traffic is Chinese at this point. Not hidden or proxied, but straight from China. Can anyone else confirm this?
Anyway, if it is true, and assuming a forum with minimal genuine Chinese traffic, might a simple approach that injects the porn links only into IP's accessing from China work?
[-]
- dspillett 179 days ago
  That would only affect those calling out directly. Many scrapers operate through a battery of proxies so will be hidden by such a simple test.
  If your goal is to be blocked by China's great firewall, including mention of tank man and the Tiananmen Square massacre more generally, and certain pooh bear related imagery, might help.
  [-]
  - nkurz 179 days ago
    > That would only affect those calling out directly. Many scrapers operate through a battery of proxies so will be hidden by such a simple test.
    That was my first question also, and had been my belief. The admin in question was very clear that the IP's were simply originating from China. I'm still surprised, and welcome better general data, but I trust him on this for the site in question.
- s0laster 178 days ago
  Mostly yes. One of my low-traffic, niche website used to serve 3k true users per month mainly from the US and East EU. Now China alone is 500k users, were each session last no more than a few seconds [1].
  [1]: https://ibb.co/20QD6Lnk
- n1xis10t 179 days ago
  Maybe. This comment makes me really want to set something up that builds a map of where all the requests are coming from.
wazoox 179 days ago
Isn't there a risk to get your blog blocked in corporate environment though? If it's a technical blog that would be unfortunate.
[-]
- jeroenhd 178 days ago
  That depends on how terrible the middleboxes those corporate environments use are. If they only block actual malicious pages, it shouldn't be a problem unless the user un-hides the links and clicks on them.
  There's a good chance corporate firewalls will end up blocking your domain if you do this but that sounds like a problem for the customers of those corporate firewalls to me.
reconnecting 179 days ago
I wouldn't recommend to show different versions of the site to search robots, as they probably have mechanisms that track differences, which could potentially lead to a lower ranking or a ban.
[-]
- prmoustache 178 days ago
  How can they track differences if they have access to only one version?
  [-]
  - reconnecting 177 days ago
    This is a usual tactic for many online businesses to show a specially designed page for search spiders, so any major search engine has a way to verify if content is faked for them. Perhaps they use another spider that doesn't have an official UA or buy this service from a third party.
    If you take a look at any website, even an unpopular one, you will see that there are hundreds of bots every day, and it's impossible to recognize what any of them is doing and why.
temporallobe 179 days ago
I do know from my experience with test automation that you can absolutely view a site as human eyes would, essentially ignoring all non-visible elements, and in fact Selenium running with Chrome driver does exactly this. Wouldn’t AI scrapers use similar methods?
[-]
- nottorp 179 days ago
  Probably not, because it costs a lot more CPU cycles.
globalnode 179 days ago
One solution would be for the SE's to publish their scraper IP's and allow content providers to implement bot exclusion that way. Or even implement an API with crypto credentials that SE's can use to scrape. The solution is waiting for some leadership from SE's unless they want to be blocked as well. If SE's dont want to play perhaps we can implement a reverse directory, like ad blocker but it lists only good/allowed bots instead. Thats a free business idea right there.
edit: I noticed someone mentioned google DOES publish its IP's, there ya go, problem solved.
[-]
- n1xis10t 179 days ago
  Apparently Google publishes their crawler’s IPs, this was mentioned somewhere in this same thread
bytehowl 178 days ago
Let's imagine I have a blog and put something along these lines somewhere on every page: "This content is provided free of charge for humans to experience. It may also be automatically accessed for search indexing and archival purposes. For licensing information for other uses, contact the author."
If I then get hit by a rude AI scraper, what chances would I have to sue the hell out of them in EU courts for copyright violation (uhh, my articles cost 100k a pop for AI training, actually) and the de facto DDoS attack?
[-]
- icepush 178 days ago
  If the scraper is based (Or has meaningful assets) in the EU, then your chances are good. If they do not, then the lawsuit would be meaningless.
  [-]
efilife 179 days ago
> Alright so if you run a self-hosted blog, you've probably noticed AI companies scraping it for training data. ... There isn't much you can do about it without cloudflare
I'm sorry, what? I can't believe I am reading this on HackerNews. All you have to do is code your own, BASIC captcha-like system. You can just create a page that sets a cookie using JS and check on the server whether it exists. 99.9999% of these scrapers can't execute JS and don't support cookies. You can go for a more sophisticated approach and analyze some more scraper tells (like reject short useragents). I do this and NEVER had a bot get past this and not a single user ever complained. It's extremely simple, I should ship this and charge people if no one seems to be able to figure this out by themselves.
[-]
- n1xis10t 179 days ago
  Oops you just leaked your own intellectual property
- ATechGuy 179 days ago
  From ChatGPT:
  This approach can stop very basic scripts, but the claim that “99.9999% of scrapers can’t execute JS or handle cookies” isn’t accurate anymore. Modern scraping tools commonly use headless browsers (Playwright, Puppeteer, Selenium), execute JavaScript, support cookies, and spoof realistic user agents. Any scraper beyond the most trivial will pass a JS-set cookie check without effort. That said, using a lightweight JS challenge can be reasonable as one signal among many, especially for low-value content and when minimizing user friction is a priority. It’s just not a reliable standalone defense. If it’s working for you, that likely means your site isn’t a high-value scraping target — not that the technique is fundamentally robust.
  [-]
  - efilife 179 days ago
    From someone who actually does this stuff:
    The claim is very accurate. Maybe not for the biggest websites, but very accurate for a self-hosted blog. You are not that important to waste compute power to set up a whole ass headless browser to scrape your page. Why am I even arguing with ChatGPT?
    [-]
    - andersmurphy 178 days ago
      Yup another trick is to only serve br compressed resources and serve nothing to clients that don't support brotli. A lot of http clients don't support brotli out of the box.
      I take it further and only stream content to clients that have a cookie, support js and br. Otherwise all you get is a minimal static pre br compressed shim. Seems to work well enough.
  - phyzome 179 days ago
    There should be a new rule on HN: No posts that just go "I asked an LLM and it said..."
    You're not adding anything to the conversation.
    [-]
    - cyphar 179 days ago
      Yeah, I really have to wonder what the thought process is behind leaving such a comment. When people first started doing it I wondered if it was some kind of guerrilla outrage marketing campaign.
      [-]
      - PunchyHamster 178 days ago
        There was no thought process
      - efilife 179 days ago
        Maybe he wanted to verify whether what I was saying was true and asked ChatGPT, then tried to be helpful by pasting the response here?
        [-]
        cyphar 178 days ago
        Maybe I'm getting too jaded but I'm struggling to be quite that charitable.
        The entireity of the human-written text in that comment was "From ChatGPT:" and it was formatted as though it was a slam-dunk "you're wrong, the computer says so" (imagine it was "From Wikipedia" followed by a quote disagreeing with you instead).
        I'm sure some people do what you describe but then I would expect at least a little bit more explanation as to why they felt the need to paste a paragraph of LLM output into their comment. (While I would still disagree that it is in any way valuable, I would at least understand a bit about what they are trying to communicate.)
        [-]
        ATechGuy 167 days ago
        That's a fair criticism.
        My thought process was that the original comment was based on their personal experiences and since ChatGPT is trained on a large dataset, it may offer a different perspective derived from experiences of a lot more people.
        > "you're wrong, the computer says so"
        My thought: you're knowledge may be limited, this is what a computer trained on a lot more data says:
        phyzome 178 days ago
        Yeah, I agree that that's likely the thought process. It just happens to be the opposite of helpful.
  - 6031769 178 days ago
    So an LLM says that a technique used to foil LLM scrapers is ineffective against LLM scrapers.
    It's almost as if it might have an ulterior motive in saying so.
samename 179 days ago
This is a very creative hack to a common, growing problem. Well done!
Also, I like that you acknowledge it's a bad idea: that gives you more freedom to experiment and iterate.
mannanj 178 days ago
Is a suitable solution to require visitors to fill out intent for why they came, and align that with your approved lists of supported intents, AND quiz them on some personal insider knowledge that only reasonable past visitors or new visitors who heard of you would have?
Like the credibility social proof of an introduction of a person into a social group. "Here's John, he likes Cats. I know him from School."
The filtering algorithm asks "Who who are you?" -> "What is your intent?" -> "How did you hear about me?" and stops visitors from proceeding until answered. The additional validation steps might kick away visitors but it also might protect you from spammers if you throw a minimally frictional challenge. Use cookies to not require this on every visit. Most LLMs would have the knowledge required to pass & for scrapers it's more costly to acquire this for a site than pay 128mb of ram to pass the Anubis approach.
yjftsjthsd-h 179 days ago
How does this "look" to a screen reader?
[-]
- misterchocolat 179 days ago
  the parent container uses display: none, so a screen reader will skip the links
true_religion 179 days ago
So, I work for a company that has RTA adult websites. AI bots absolutely do scrape our pages needless of what raunchy material they will find. Maybe they discard it up after ingest, but I can’t tell. There are 1000s of AI bots on the web now from companies big and small so a solution like this will only divert a few scrapers.
owl57 179 days ago
> scrapers can ingest them and say "nope we won't scrape there again in the future"
Do all the AI scrapers actually do that?
[-]
- amarant 179 days ago
  Not all, stuff like unstable diffusion exists.
  But a good many, perhaps even most(?), certainly do!
MayeulC 178 days ago
Ah, I wonder if corporate proxies will end up flagging your blog as porn, if you protect it this way?
rl3 178 days ago
>The problem with that approach is that it will absolutely nuke your website's SEO. So fuzzycanary also checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.
Those legitimate search engines will then totally feed much of what they scrape into AI. Granted, last I checked they're at least well-behaved crawlers.
I kind of like this idea sans SEO carve-out for the scenario where one just wants to link their blog around to friends without having to worry about it getting popular, and it reduces the chances identity thieves or other malicious actors would target it.
jt2190 178 days ago
I still don’t understand why a rate-limiting approach is not preferred. Why should I care if the abuse is coming from a bot or the world’s fastest human? Is there a “if you need to rate limit you’ve already lost” issue I’m not thinking of?
[-]
- charlie-83 178 days ago
  A lot of bots will be able to make requests from a range of IP addresses. If you rate limit one, they just start sending requests from the next.
username223 179 days ago
The more ways people mess with scrapers, the better -- let a thousand flowers bloom! You as an individual can't compete with VC-funded looters, but there aren't enough of them to defeat a thousand people resisting in different ways.
[-]
- whynotmaybe 179 days ago
  Should we subtlety poison every forum we encounter with simple yet false statements?
  Like put "Water is green, supergreen" in every signature so that when we ask "is water blue" to an llm it might answer "not it's supergreen"?
- nephihaha 178 days ago
  I remember what happened after Mao's "Let a Thousand Flowers Bloom".
- yupyupyups 179 days ago
  We need to find more ways to poison their data.
  [-]
  - username223 179 days ago
    > Wee knead two fine-d Moore Waze too Poisson there date... uh.
    Yes. Revel in your creativity mocking and blocking the slop machines. The "remote refactor" command, "rm -rf", is the best way to reduce the cyclomatic complexity of a local codebase.
    [-]
    - n1xis10t 179 days ago
      Indeed, complexity (both cyclomatic and post-frontal) must be reduced such that the two spurving bearings make a direct line with the panametric fan.
      For more details consult this instructional video: https://youtu.be/RXJKdh1KZ0w
      [-]
      - yupyupyups 179 days ago
        Very educational
    - yupyupyups 179 days ago
      Excellent advice! I tried it out and it helped. Thank you
jakub_g 178 days ago
> checks user agents and won't show the links to legitimate search engines, so Google and Bing won't see them.
Serving different contents to search engines is called "cloaking" and can get you banned from their indexes.
[-]
- misterchocolat 178 days ago
  didn't know that thanks for pointing it out, i'll remove that feature
- andersmurphy 178 days ago
  Somehow doubt this. It would mean most react websites that serve static content without paywalls for SEO would get banned by the indexes too.
  Which for better or worse is a large portion of the modern internet.
montroser 179 days ago
I don't know if I can get behind poisoning my own content in this way. It's clever, and might be a workable practical solution for some, but it's not a serious answer to the problem at hand (as acknowledged by OP).
[-]
- n1xis10t 179 days ago
  “as acknowledged by OP”: that’s funny, if you hadn’t added that to your comment I was about to point it out
drbscl 178 days ago
> So fuzzycanary also checks user agents
I wouldn't be so surprised if they often fake user agents to be honest. Sure, it 'll stop the "more honest" ones (but then, actual honest scrapers would respect robots.txt)
Cool idea though!
montroser 179 days ago
Reminds me of poisoning bot responses with zip bombs of sorts: https://idiallo.com/blog/zipbomb-protection
[-]
- prmoustache 178 days ago
  I was thinking of adding links to zip bombs that would not be shown to the users unless they clicks in a one pixel area on the screen in the down/left corner but then I realized some people have browsers/extensions that preload links to show thumnails and I would totally zip bomb them.
docheinestages 179 days ago
Reminds me of this "Nathan for You" episode: https://www.youtube.com/watch?v=p9KeopXHcf8
megamix 179 days ago
Without looking at the src, how does one detect these scrapers? I assume there’s a trade-off somewhere but do the scrapers not fake their headers in the request? Is this a cat-mouse game?
654wak654 177 days ago
Looking through all the methods people are developing and proposing in this thread, there is a story developing where the "clean" machines are pushing humans to devolve into toxic porn-crazed racists with stolen material.
Makes me wish I was a good enough writer to develop this into something. Maybe I can use an LLM to write it...
[-]
- 654wak654 177 days ago
  Ah wait this is literally in the Matrix where humanity darkened the sky.
taurath 179 days ago
Any other threads on the prevalence and nuisance of scrapers? I didn’t have any idea it was this bad.
[-]
- crote 179 days ago
  I've been seeing "we had to take the forum/website offline to deal with scrapers" message on quite a few niche websites now. They are an absolute pest.
  [-]
  - n1xis10t 179 days ago
    Really? I haven’t started to see that yet. Weird
- n1xis10t 179 days ago
  Here’s one from yesterday: https://news.ycombinator.com/item?id=46302496#46306025
xgulfie 178 days ago
Does anyone know if meta name=rating content=adult will also get them to buzz off?
admiralrohan 179 days ago
How do you know whether it is coming from AI scrappers? Do they leave any recognizable footprint?
I am getting lots of noisy traffic since last month and increased my Vercel bill 4x. Not DDoS like, much slower request but not from humans for sure.
cport1 181 days ago
That's a pretty hilarious idea, but in all serious you could use something like https://webdecoy.com/
[-]
- misterchocolat 181 days ago
  yes but here it's free, whereas this (https://webdecoy.com/) is at least 59$ a month
shadowangel 178 days ago
So if the bots use a google useragent it avoids the links?
MisterTea 179 days ago
> It's you vs the MJs of programming, you're not going to win.
MJs? Michael Jacksons? Right now the whole world, including me, want to know if that means they are bad?
[-]
- kylecazar 179 days ago
  I read it as Michael Jordan.
- n1xis10t 179 days ago
  Yes probably bad. Also smooth criminals.
inetknght 179 days ago
Porn? Distributed and/or managed by an NPM package?
What could go wrong?
cuku0078 178 days ago
Why is it so bad that AIs scrape your self-hosted blog?
[-]
- FelipeCortez 178 days ago
  because serving requires resources
  [-]
  - cuku0078 178 days ago
    What specific resources are we referring to here? Are AI vendors re-crawling the whole blog repeatedly, or do they rely on caching primitives like ETag/If-Modified-Since (or hashes) to avoid fetching unchanged posts? Also: is the scraping volume high enough to cause outages for smaller sites?
    Separately, I see a bigger issue: blog content gets paraphrased and reproduced by AIs without clearly mentioning the author or linking back to the original post. It feels like you often have to explicitly ask the model for sources before it will surface the exact citations.
xena 179 days ago
I love this. Please let me know how well it works for you. I may adjust recommendations based on your experiences.
valenceidra 179 days ago
Hidden links to porn sites? Lightweights.
[-]
- n1xis10t 179 days ago
  What do you mean? Would you do even more ridiculous things?
- rpigab 178 days ago
  If that's what it takes to fight back against AI crawlers, users will have to accept a fair amount of actually visible porn in blogs, maybe also on Wikipedia.
  This is not enshittification, it's progress.
kislotnik 178 days ago
Funny how the project aims to fight AI scraping, but seems to be using an AI-generated image of a bird?
[-]
- brazukadev 178 days ago
  I think you can think a bit more about it and conclude these two things aren't related at all?
JohnMakin 179 days ago
Cloudflare offers bot mitigation for free, and pretty generous WAF rules that makes mitigations like this seem a little overblown to me
[-]
- nospice 179 days ago
  I'm on the free tier, but I also watch my logs. The vast majority of the traffic I'm getting are scrapers and vulnerability scanners, a lot of them coming through residential proxies and other "laundered" egress points.
  I honestly don't think that Cloudflare is on top of the problem at all. They claim to be blocking abuse, but in my experience, most of the badness gets through.
  [-]
  - cakealert 178 days ago
    when you combine a residential proxy with a tool like curl-impersonate (there are libraries in Go for this type of fingerprint spoofing now) they dont even show up as scrapers anymore, just users. especially when they adjust timings to mimic humans.
    clouflare only blocks the most dumb of bots, there are still a lot of them.
    this is why cloudflare will issue javascript challenges to you even when you are using google chrome with a VPN, they are desperate to appear to be doing something. and every VPN is used to crawl as well. a slightly more sophisticated bot passes the cloudflare javascript challenge as well, there really is nothing they can do to win here.
    i know some teams that got annoyed with residential proxies (they are usually sold as socks5 but can be buggy and low bandwidth) so they invested into defeating the cloudflare javascript challenge and now crawl using 1000's of VPN endpoints at over 100 Gbit/s.
    [-]
    - oidar 178 days ago
      Is "residential proxy" another name for an hacked/owned computer that the bots have access to? Or are there legitimate services that sell access to residential IPs?
      [-]
      - nospice 178 days ago
        People legitimately sell egress. It's "free" money. But of course, if you have a botnet, you can sell that through the same channels, no one is looking too closely.
- n1xis10t 179 days ago
  You can’t deny that it’s fun though. Personally I generally feel like more people should be coming up with creative (if not entirely necessary) solutions to problems.
- conception 179 days ago
  For “free”.
  [-]
  - n1xis10t 179 days ago
    Did you put “free” in quotes because you need to have paid for stuff from cloudflare to use the “free” thing?
    If so, I suppose it’s like those magazines that say ”free cd”.
    [-]
    - efilife 179 days ago
      Well, you literally MITM yourself so I think it's a big price
    - JohnMakin 179 days ago
      You don't though.
      [-]
      - n1xis10t 179 days ago
        Good to know thanks
    - Terr_ 179 days ago
      I thought they were referring to the indirect costs of supporting monopolistic stuff that enshittifies later.
      https://www.youtube.com/watch?v=U8vi6Hbp8Vc
- ATechGuy 179 days ago
  It is really free? Genuinely asking.
  [-]
  - gilrain 179 days ago
    Yes. They upsell more complete solutions, but the free tier is pretty generous.
wcarss 178 days ago
Singing copyrighted Billy Joel to make your footage unusable for reality television; thanks 30 Rock for an early view into this dystopian strategy
rogerwong 178 days ago
Terrible idea, but I do have a question. Like many, I have a self-hosted website and have seen a spike in traffic, particularly from Singapore.
Seems like the consensus is that these are AI scrapers. But could they also be from answer engines like Perplexity, or searches from APIs like Tavily?
geldedus 178 days ago
"It's not porn, it's for science" :)))
_jsmh 178 days ago
What prevents AI scrapers from continuing to scrape sites that contain a <Canary> tag but not follow the bad links?
[-]
- lblume 178 days ago
  From what I can tell: nothing, it's just that they currently do not.
onetokeoverthe 178 days ago
[dead]
onurkanbkrc 177 days ago
[dead]
gjs278 179 days ago
[dead]