Ask HN: Best practice to protect from back end data exfiltration via website?

Hey I am the odd non-engineer/dev here (although long-time geek for sure!) that has read HN for years. So first please consider this, first, as a “thank you” to everyone from whom I have learned so much just reading fascinating articles everyone has posted and likewise following the many lively discussions that have ensued.

And if I am asking this question the wrong way (or thinking about it the wrong way), please don’t hesitate to let me know.

I have been thinking through a B2C concept where the moat/value, if you will, would be the data. The website would allow you to query against those data via an increasingly specific filters. If I go through with this, collecting and organizing these data will be extremely expensive/time-consuming.

And that leads me to ask… is there a best practice for keeping a bot from running through every permutation of every filter and re-creating my master data set? The data itself isn’t proprietary. It’s just the organization and presentation of it that no one else is doing and I think would solve an existing pain point and provide real value to folks.

Would love thoughts on this. I get no solution is bullet proof, but I am sure others have had to contend with the same thing and so I would love to at least understand what best practices are here such that I am — at minimum — not leaving the front door wide open if there’s at least a cheap lock I can put on it. :)

Also, at least in an ideal world, I don’t want to force people to create accounts to access this, so it seems like I would need something in place to try and identify bots and block/confuse them. The good news is the number of permutations will be so large, it wouldn’t be realistic possible to do this by hand… so it seems like identifying “real” vs “bot” should/could (?) be relatively straightforward.

Thank you!!

2 points | by markden 15 hours ago

1 comments

  • solardev 14 hours ago
    Web dev here, but not cybersec focused... if I'm wrong, someone will be along to correct me shortly :)

    That said, I'm reasonably confident that what you want isn't doable/practical, unfortunately :(

    While there are certainly companies that make valuable datasets available over the web, the usual way they prevent mass scraping is by enforcing account limits, making retrieval expensive and also limited to only one tiny slice of data at a time. An example industry that does this are the mass data harvesting/targeting companies like Meta, Alphabet, or political companies (NGPVan, Actblue, etc.). They cross-reference a lot of PII floating around the internet, and/or harvest their own and then sell that to advertisers or political campaigns, but only a slice at a time, and at prices that they determine. You can of course pay to scrape any one slice of it, but if you wanted the whole dataset, you'd probably end up paying more than the entire company's worth.

    That, or their data is inherently time-sensitive, such that older copies of it aren't as valuable. Stocks, real estate sites, news tickers, etc. come to mind, where sure, you can scrape their stuff, but unless you perform some sort of value-added collation/analysis on top of it, it's going to be stale by the time you serve it to your own users. The data originators are always one step ahead of you.

    If your data isn't proprietary to begin with (i.e. you're not the one making it and adding updates) AND you want it to be publicly accessible without an account... it's only a matter of time before some botnet or another scrapes all of it.

    You can do things to slow down the scraping, such as adding Cloudflare, but realistically, bots and labor are very cheap in much of the world, and if someone really wants your data, they'll get it. It's essentially free to them, especially if you've done all the hard work of collecting it and putting it all on a single website.

    It will always take more time for you add to manually add filter permutations than it takes a script & botnet to enumerate through them. They can just tweak parameters and send them through thousands of headless browsers running in dispersed instances across the world.

    You can require account signup and verification before accessing the data, but that's also trivially faked unless you're requiring real payments.

    Identifying real users vs bots is anything BUT trivial. Google and Cloudflare and hCaptcha have spent decades trying to solve that with huge teams and world-class researchers. And even they only have limited success rates, especially since anybody can spend pennies to hire real humans to run through your captchas. And that problem is only going to get harder, much harder, with all the advancements in machine learning, natural language processing, and machine vision.

    Sorry for the bad news =/ I hope I'm wrong, but I'm fairly confident you can't really accomplish this.

    • markden 14 hours ago
      While you are right, this isn’t what I was hoping to hear :), I do really appreciate the helpful response. Thank you!
      • solardev 13 hours ago
        You're welcome, but also keep in mind that it's just my opinion :) Someone else might come along and tell you all the ways I'm wrong.

        Also, it's not a black & white situation. If your dataset isn't super valuable, or if it's just niche enough, it's possible that adding Cloudflare by itself would be "good enough" protection. It's a LOT better than nothing, and also much better protection than what most people can DIY on their own.

        • markden 13 hours ago
          Yeah, and that’s kind of exactly what I am looking for. This is niche enough that I am likely overly concerned someone would do real work to “steal” it. But I also always lock my car, even if someone can still smash the window. :)
          • solardev 13 hours ago
            That's a good analogy. If you have the first mover advantage and can earn user loyalty through good UX or whatever, it might not really matter thwt much even if someone does steal your data. Worth a shot?
            • markden 12 hours ago
              Haha still thinking through that. But potentially!