LWN is currently under the heaviest scraper attack seen yet

(social.kernel.org)

93 points | by luu 1 hour ago

10 comments

  • fancyfredbot 30 minutes ago
    Who are these agressive scrapers run by?

    It is difficult to figure out the incentives here. Why would anyone want to pull data from LWN (or any other site) at a rate which would cause a DDOS like attack?

    If I run a big data hungry AI lab consuming training data at 100Gb/s it's much much easier to scrape 10,000 sites at 10Mb/s than DDOS a smaller number of sites with more traffic. Of course the big labs want this data but why would they risk the reputational damage of overloading popular sites in order to pull it in an hour instead of a day or two?

    • velox_neb 5 minutes ago
      I bet some guy just told Claude Code to archive all of LWN for him on a whim.
    • philipkglass 26 minutes ago
      I don't think that most of them are from big-name companies. I run a personal web site that has been periodically overwhelmed by scrapers, prompting me to update my robots.txt with more disallows.

      The only big AI company I recognized by name was OpenAI's GPTBot. Most of them are from small companies that I'm only hearing of for the first time when I look at their user agents in the Apache logs. Probably the shadiest organizations aren't even identifying their requests with a unique user agent.

      As for why there are a lot of dumb bots interested in my web pages now, when they're already available through Common Crawl, I have no idea.

      • iamnothere 10 minutes ago
        Maybe someone is putting out public “scraper lists” that small companies or even individuals can use to find potentially useful targets, perhaps with some common scraper tool they are using? That could explain it? I am also mystified by this.
    • bjackman 19 minutes ago
      LWN includes archives of a bunch of mailing lists so that might be a factor. There are a LOT of web on that domain.
    • mikkupikku 15 minutes ago
      NSA, trying to force everybody onto their Cloudflare reservation.
    • kylehotchkiss 28 minutes ago
      china (alibaba and tencent)
      • fancyfredbot 19 minutes ago
        I'm not at all sure alibaba or tencent would actually want to DDOS LWN or any other popular website.

        They may face less reputational damage than say Google or OpenAI would but I expect LWN has Chinese readers who would look dimly on this sort of thing. Some of those readers probably work for Alibaba and Tencent.

        I'm not necessarily saying they wouldn't do it if there was some incentive to do so but I don't see the upside for them.

    • gubicle 11 minutes ago
      [dead]
  • tedivm 45 minutes ago
    I solved this problem for my blog by simply not being interesting.
    • naiv 8 minutes ago
      TIL about Git Brag because of your blog. It is interesting.
    • fancyfredbot 26 minutes ago
      If you can bore an LLM that's exciting.
  • jacquesm 51 minutes ago
    AI allows companies to resell open source code as if they wrote it themselves doing an end run around all license terms. This is a major problem.

    Of course they're not going to stop at just code. They need all the rest of it as well.

    • zipy124 46 minutes ago
      From the creators of easy money laundering (crypto bros), we now bring you easy money laundering 2: intellectual property laundering, coming to a theatre near you soon!
      • gruez 29 minutes ago
        >From the creators of easy money laundering (crypto bros),

        Is there even any evidence that "crypto bros" and "AI bros" are even the same set of people other than being vaguely "tech" and hated by HN? At best you have someone like Altman who founded openai and had a crypto project (worldcoin), but the latter was approximately used by nobody. What about everyone else? Did Ilya Sutskever have a shitcoin a few years ago? Maybe Changpeng Zhao has an AI lab?

        • themafia 22 minutes ago
          > and had a crypto project (worldcoin)

          That was a biometric surveillance project disguised as a crypto project.

          > Is there even any evidence that "crypto bros" and "AI bros" are even the same set of people

          No, the "AI" people are far worse. I always had a choice to /not/ use crypto. The "AI" people want to hamfistedly shove their flawed investment into every product under the sun.

  • iamnothere 19 minutes ago
    I am starting to think these are not just AI scrapers blindly seeking out data. All kinds of FOSS sites including low volume forums and blogs have been under this kind of persistent pressure for a while now. Given the cost involved in maintaining this kind of widespread constant scraping, the economics don’t seem to line up. Surely even big budget projects would adjust their scraping rates based on how many changes they see on a given site. At scale this could save a lot of money and would reduce the chance of blocking.

    I haven’t heard of the same attacks facing (for instance) niche hobby communities. Does anyone know if those sites are facing the same scale of attacks?

    Is there any chance that this is a deniable attack intended to disrupt the tech industry, or even the FOSS community in particular, with training data gathered as a side benefit? I’m just struggling to understand how the economics can work here.

    • zomiaen 9 minutes ago
      How many of these scrapers are written by AI by data-science folks who don't remotely care how often they're hitting the sites, and is data they wouldn't even think to give or ask the LLM about?
  • gulugawa 1 hour ago
    I've had luck blocking scrapers by overwriting JavaScript methods

    " a.getElementsByTagName = function (...args) {//Clear page content}"

    One can also hide components inside Shadow DOM to make it harder to scrape.

    However, these methods will interfere with automated testing tools such as Playwright and Selenium. Also, search engine indexing is likely to be affected.

    • TurdF3rguson 10 minutes ago
      You think you've had luck. The truth is you have no idea of knowing if this ever had any effect at all.
    • bogwog 17 minutes ago
      This is a fun idea, especially if you make those functions procedurally generate garbage to get them stuck
  • blakesterz 50 minutes ago

      "It is a DDOS attack involving tens of thousands of addresses"
    
    It is amazing just how distributed some of these things are. Even on the small sites that I help host we see these types of attacks from very large numbers of diverse IPs. I'd love to know how these are being run.
    • wongarsu 9 minutes ago
      There are plenty of providers selling "residential proxies", distributing your crawler traffic through thousands of residential IPs. BrightData is probably the biggest, but its a big and growing market.

      And if you don't care about the "residential" part you can get proxies with data center IPs for much cheaper from the same providers. But those are easily blocked

    • PaulDavisThe1st 13 minutes ago
      another reference point: we've had well over 1M unique IP addresses hit git.ardour.org as part of stupid as hell git scraping effort. 1M !!!
    • smitty1e 36 minutes ago
      Call it a "Distributed Intelligence Logic Denial Of Service" (DILDOS) attack both to name it distinctly and characterize the source.
  • bloppe 29 minutes ago
    I'm curious how they concluded this was done to scrape for AI training. If the traffic was easily distinguishable from regular users, they would be able to firewall it. If it was not, then how can they be sure it wasn't just a regular old malicious DDOS? Happens way more often than you might think. Sometimes a poorly-managed botnet can even misfire.
    • MBCook 27 minutes ago
      Why would anyone ever DDOS them? They’ve been around for about three decades now, I don’t know if they’ve ever had a DDOS attack before the AI crawling started.
  • zahlman 1 hour ago
    Is it still ongoing? The thread appears to be over 24 hours old and as a quick test I had no issue loading the main page (which is as snappy and responsive as expected from a low-bandwidth site like LWN).
    • jzb 3 minutes ago
      Not at the moment. It’s subsided for now.
  • blibble 1 hour ago
    the perverse incentive is if you ddos the website such that it shuts down, no other "AI" parasites can get the valuable data

    big tech incentivised to ddos... what a world they've built

    • phkahler 1 hour ago
      Its called pulling up the ladder behind you, or building a moat!
    • ronsor 1 hour ago
      This sounds like a conspiracy theory.
      • MBCook 55 minutes ago
        I don’t think they’re saying that’s actually happening here, just that it could happen and is accidentally incentivized.
      • pwdisswordfishy 54 minutes ago
        If it's a conspiracy, it would be one where the Minimum Viable Conspirator Count is 1 (inclusive of one's own self).

        In that case, by that rubric literally anything that you conspire with yourself to accomplish (buying next week's groceries, making a turkey sandwich...) would also be a conspiracy.

    • NitpickLawyer 57 minutes ago
      Umm... what data? That's a very old newsletter-like site. Everything that's public on it has been long scraped and parsed by whoever needed it. There's 0 valuable data there for "parasites" to parasite off of.

      I also don't get the comments on the linked social site. IIUC the users posting there are somehow involved with kernel work, right? So they should know a thing or two about technical stuff? How / why are they so convinced that the big bad AI baddies are scraping them, and not some miss-configured thing that someone or another built? Is this their first time? Again, there's nothing there that hasn't been indexed dozens of times already. And... sorry to say it, but neither newsletters nor the 1-3 comments on each article are exactly "prime data" for any kind of training.

      These people have gone full tinfoil hat and spewing hate isn't doing them any favours.

      • MBCook 55 minutes ago
        I don’t think they were talking about LWN specifically but just in general.
      • homebrewer 39 minutes ago
        Because it started in 2022 and hasn't subsided since? This is just the latest iteration of "AI" scrapers destroying the site, and the worst one yet.

        https://lwn.net/Articles/1008897

        Your nonsense about LWN being a "newsletter" and having "zero valuable data" isn't doing you any favors. It is the prime source of information about Linux kernel development, and Linux development in general.

        "AI" cancer scraping the same thing over and over and over again is not news for anybody even with a cursory interest in this subject. They've been doing it for years.

        • NitpickLawyer 33 minutes ago
          > LWN.net is a reader-supported news site

          I mean...

          Again, the site is so old that anything worth while is already in cc or any number of crawls. I am not saying they weren't scraped. I'm saying they likely weren't scraped by the bad AI people. And certainly not by AI companies trying to limit others from accessing that data (as the person who I replied to stated).

          • MBCook 26 minutes ago
            Why is it each of your comments seems to include a dig attacking LWN?
  • chrisjj 59 minutes ago
    So which is it? DDOS attack or "AI" scrapers?
    • TurdF3rguson 7 minutes ago
      Scrapers because DDOS implies that it's malicious rather than accidental and there's no reason to think that.
    • fabian2k 42 minutes ago
      Sufficiently aggressive and inconsiderate scraping is indistinguishable from a DDOS attack.
    • Y-bar 41 minutes ago
      A sufficiently stupid and egregious AI scraper is indistinguishable from a DDOS attack.

      Edit: Fabian2k was ten seconds ahead. Damn!