And if I am asking this question the wrong way (or thinking about it the wrong way), please don’t hesitate to let me know.
I have been thinking through a B2C concept where the moat/value, if you will, would be the data. The website would allow you to query against those data via an increasingly specific filters. If I go through with this, collecting and organizing these data will be extremely expensive/time-consuming.
And that leads me to ask… is there a best practice for keeping a bot from running through every permutation of every filter and re-creating my master data set? The data itself isn’t proprietary. It’s just the organization and presentation of it that no one else is doing and I think would solve an existing pain point and provide real value to folks.
Would love thoughts on this. I get no solution is bullet proof, but I am sure others have had to contend with the same thing and so I would love to at least understand what best practices are here such that I am — at minimum — not leaving the front door wide open if there’s at least a cheap lock I can put on it. :)
Also, at least in an ideal world, I don’t want to force people to create accounts to access this, so it seems like I would need something in place to try and identify bots and block/confuse them. The good news is the number of permutations will be so large, it wouldn’t be realistic possible to do this by hand… so it seems like identifying “real” vs “bot” should/could (?) be relatively straightforward.
Thank you!!
That said, I'm reasonably confident that what you want isn't doable/practical, unfortunately :(
While there are certainly companies that make valuable datasets available over the web, the usual way they prevent mass scraping is by enforcing account limits, making retrieval expensive and also limited to only one tiny slice of data at a time. An example industry that does this are the mass data harvesting/targeting companies like Meta, Alphabet, or political companies (NGPVan, Actblue, etc.). They cross-reference a lot of PII floating around the internet, and/or harvest their own and then sell that to advertisers or political campaigns, but only a slice at a time, and at prices that they determine. You can of course pay to scrape any one slice of it, but if you wanted the whole dataset, you'd probably end up paying more than the entire company's worth.
That, or their data is inherently time-sensitive, such that older copies of it aren't as valuable. Stocks, real estate sites, news tickers, etc. come to mind, where sure, you can scrape their stuff, but unless you perform some sort of value-added collation/analysis on top of it, it's going to be stale by the time you serve it to your own users. The data originators are always one step ahead of you.
If your data isn't proprietary to begin with (i.e. you're not the one making it and adding updates) AND you want it to be publicly accessible without an account... it's only a matter of time before some botnet or another scrapes all of it.
You can do things to slow down the scraping, such as adding Cloudflare, but realistically, bots and labor are very cheap in much of the world, and if someone really wants your data, they'll get it. It's essentially free to them, especially if you've done all the hard work of collecting it and putting it all on a single website.
It will always take more time for you add to manually add filter permutations than it takes a script & botnet to enumerate through them. They can just tweak parameters and send them through thousands of headless browsers running in dispersed instances across the world.
You can require account signup and verification before accessing the data, but that's also trivially faked unless you're requiring real payments.
Identifying real users vs bots is anything BUT trivial. Google and Cloudflare and hCaptcha have spent decades trying to solve that with huge teams and world-class researchers. And even they only have limited success rates, especially since anybody can spend pennies to hire real humans to run through your captchas. And that problem is only going to get harder, much harder, with all the advancements in machine learning, natural language processing, and machine vision.
Sorry for the bad news =/ I hope I'm wrong, but I'm fairly confident you can't really accomplish this.
Also, it's not a black & white situation. If your dataset isn't super valuable, or if it's just niche enough, it's possible that adding Cloudflare by itself would be "good enough" protection. It's a LOT better than nothing, and also much better protection than what most people can DIY on their own.