How we run Firecracker VMs inside EC2 and start browsers in less than 1s

(browser-use.com)

93 points | by gregpr07 1 day ago

12 comments

losteric 1 hour ago
> Plain headless Chromium is easy to detect by websites with anti-bot measures. Plain headless Chromium avoided getting blocked by websites only 2% of the time, according to our stealth benchmark.
> Our browsers avoid blocks 81% of the time on our stealth benchmark, and 84.8% on Halluminate BrowserBench, the highest of any provider.
Seems very unethical, no? Who uses service providers like this? The whole point of anti-bot measures is to get rid of bots - you are not wanted there.
These kinds of services inevitably make the web more human-hostile and expensive. Websites will continue pushing back on automated usage, meaning more hurdles to access content.
No doubt part of why we see this push for verified ID on the web - not just age gating and "protect the children", but also protect sites from bots, and protect ad revenue (not a statement of support; just seems like an obvious higher order effect)
[-]
- mikeocool 11 minutes ago
  Whether or not scrapping publically available websites is unethical is probably up for debate. In some cases at least, courts have found it to be legal, even when the site is throwing up technical barriers or issues cease and desists.
  What is likely unethical is the fact that they offer residential proxies. The residential providers of those proxies are frequently not aware they’ve been opted in to provide such a service.
- embedding-shape 49 minutes ago
  > Seems very unethical, no? Who uses service providers like this? The whole point of anti-bot measures is to get rid of bots - you are not wanted there.
  Unethical just because it does something someone else doesn't want? I guess it depends on why and what the intention is. I don't have time to sit 24/7 in front of a computer to get a ticket to some events, does that mean it's unethical for me to use my own bot so I can purchase a ticket to bands I'm a fan of? Probably not. But if I did so for scalping purposes? Then yeah, I'd agree it's unethical.
  The whole point of anti-anti-bot measures is to be able to do things even if others don't think that thing should be automated, so from the hacker news audience, I think quite a lot of us have at one point or another engaged in stuff like that. Doing so merely for profits of course stinks, but for you to be able to have a fighting chance against scalpers? Probably OK.
  [-]
  - turtlebits 8 minutes ago
    Its unethical because you're intentionally bypassing restrictions. Just because others do it doesn't mean its okay.
    If you saw a sign in a store that said "1 per person" or "for registered guests only", would you ignore it?
    [-]
    - orf 3 minutes ago
      No, but what if the sign said “white people only”?
      The point is that the context matters: both the users context and the context of the restriction. It’s not as clear cut as “ignoring restrictions = bad”.
      To take it to the extreme: was Rosa Parks unethical for sitting down on a bus?
    - embedding-shape 5 minutes ago
      > Its unethical because you're intentionally bypassing restrictions
      I'd still consider why the restriction is there and why I'm thinking of breaking it, before deciding if it's unethical or not.
      It depends, basically. Generally I follow the rules and restrictions, but maybe see them more as guidelines or suggestions.
  - joatmon-snoo 44 minutes ago
    An example I ran into recently: I wanted to scrape pricing data for used cars, to better inform a friend's decision about what to purchase.
    I know there's a relationship between mileage and depreciation, but wanted to have a better sense of what that relationship is to know whether a given car was over or underpriced.
    Similarly, if I was pulling that data to build a service of my own to offer to users... is that unethical?
    [-]
    - sroussey 14 minutes ago
      All of these questions are easily answered by the question: can I run the bot on the same PC I use regularly? If so, then do it there. If not, then don’t do it at all.
  - mystifyingpoi 35 minutes ago
    > even if others don't think that thing should be automated
    It's an interesting thought that can be further explored. Could anything that's considered "unwanted" by a third party considered unethical, if I do it anyway?
    If the hotel self-service restaurant has a sign "don't take the food out" and I take 1 apple in my pocket for a snack, is it unethical? Or maybe the sign is just for people that would otherwise take $100 of watermelons out of the cantina daily and try to resell it on the beach.
  - skybrian 22 minutes ago
    What do you think of Anubis and Cloudflare? If they block your bot, is that unethical?
    Seems like doing business with other people should normally be based on mutual consent, not whatever you can get away with technically.
- mystifyingpoi 43 minutes ago
  > Seems very unethical, no?
  I don't think one can judge it ethically without considering the context. Are we talking about mass automated scraping? Or are we talking about me trying to get a good deal by scraping local used car dealership listing once per day for my personal need (just so I don't have to do it manually)?
  One of these is strictly more ethical, but both will be blocked by Cloudflare for example. I'd happily use such service in my personal case.
- nateb2022 45 minutes ago
  > Seems very unethical, no? Who uses service providers like this? The whole point of anti-bot measures is to get rid of bots - you are not wanted there.
  I'm familiar with companies automating access to software only accessible via the web with poor/no API support. This is software they pay (usually a lot of money) for, and usually has built in captchas to guard logins. They aren't a large enough customer to ask the removal of these captchas or whitelabelled (just one out of many SaaS tenants), so they simply work around that restriction.
- ge96 32 minutes ago
  I briefly tried to do his job where it was scraping steam for CS GO skins (think a knife skin for $2,000.00) and yeah trying to find proxy poviders/get around the ip limit... tough one but market for it people paying for the tool (not mine).
- wnevets 1 hour ago
  > Who uses service providers like this?
  People who don't want their headless browser to get blocked?
- sillysaurusx 48 minutes ago
  (I haven't tried this out yet.) My use case would be to take a snapshot of each HN story. This is surprisingly hard, because most websites prevent bots from doing that.
  For example, Claude has a lot of trouble reading HN's front page. HN itself is fine, but the moment you ask it to pick out an article, it often chokes. The website has put up a verification captcha, or it's a paywall, etc. Paywalls can be bypassed by reading HN comments and looking for archive links. But those archives often block bots too, so you're back to square one.
  Whether it's unethical is an interesting question. I believe I should have the right to do what I want with internet content, as long as I'm not abusive. Merely having a bot isn't abusive. It would be one thing if the bot is hammering a server or vacuuming up training data, but having a bot at all is presently very hard.
  This service caught my attention because it could potentially solve the problem I'm running into. Simply taking snapshots of articles that hit HN shouldn't be so hard, but it is. HN sends millions of views to websites; one bot taking a snapshot isn't going to make a difference. I don't think it counts as "unethical" just because we're going against the website owner's wishes. When you post content to the internet, you sign up to share that content with everyone, other than what's denied by robots.txt. If it's not blacklisted by robots.txt, it should be possible for well-behaved bots to access.
  I don't expect very many people here to care about the poor bot creators. Most of the bot creators are malicious anyway. But I personally lament the loss of being able to write a program that can process information from the browser in arbitrary ways. You should be able to, yet we're buying into the notion that it's okay for website owners to say "this content is only accessible by approved bots like Google, and everyone else can sod off."
  HN proves it doesn't need to be like that. It gets dozens of millions of page views a day, a lot of which is bot traffic. HN only uses captchas for creating accounts or logging in. You're free to scrape any content as long as you respect the crawl delay of 30 seconds specified in robots.txt, and don't try to visit links that perform actions a human would take (like adding things to favorites or voting). That's how the internet should work: just deliver content.
  [-]
- cute_boi 45 minutes ago
  Exactly these crappy companies like browser use is causing more captcha etc.. All these scraper companies should've been regulated heavily. They use residential proxy creating incentive for hacking IOT devices etc..
- stogot 43 minutes ago
  I wish simpler bots existed for consumers. I want to know when someone replies to me, when a price drops, when airlines open new seat reservations, when a new seat opens for a college class, when a concert is coming to my area for a musician I listen to, when my local grocer has new stock, when a new Hyatt offer is available in a city I want to visit, etc. doesn’t mean I’m abusive. I can have it check once a day. In almost all those cases, I want to spend money with the business but I don’t want to manually check
- ranger_danger 48 minutes ago
  Web archival/preservation services/projects that need to get past captchas and other bot checks are a prime target for a service like this... but I think their main customers are people just mass scraping parts of the internet for less altruistic reasons.
- zuzululu 14 minutes ago
  Once again I'd like to remind that violating Terms of Service isn't the same as violating some moral ethics. They are literally just expectations with no enforceable or legal boundaries.
  For example I could write in my Terms of Service that you do not view more than one page on my website and expect you to send me a written permission to read the rest. I don't expect anybody to follow and I sure don't think less of those that do.
  The push for verified IDs is not related to this, its more of a politically motivated attempt at selling fear to justify more surveillance.
swazzy 13 minutes ago
> During a burst in traffic, the system, instead of reacting on its own, required humans to adjust it.
Isn't this solvable with autoscaling? how is this not an issue with Firecracker as well?
CompuIves 1 hour ago
Very cool to see more use of userfaultfd, really powerful API because you can fully control how and from where memory is loaded during a pagefault.
wewewedxfgdf 34 minutes ago
But Firecracker is not compatible with GPU for Chrome, is that right?
That means Chrome is slow - quite the tradeoff.
[-]
- Reformedot 25 minutes ago
  Our browsers beat competitors in performance too. Chrome uses mainly CPU, not GPU
  We support GPU via software tho
rbbydotdev 1 hour ago
> The catch is that regular EC2 is already a VM. AWS runs our host inside its own isolation layer, and then we run browser VMs inside that host. In other words, every browser is a VM inside a VM.
yes but i think there is specifically some ec2s which give you hypervisor access and thereby firecracker too - someone correct me if im wrong?
[-]
- roboben 1 hour ago
  yes only c8i, m8i and r8i instance types support it. It is called nested virtualization[1]
  [1] https://aws.amazon.com/about-aws/whats-new/2026/02/amazon-ec...
  [-]
  - thundergolfer 1 hour ago
    Unfortunately supply is quite limited. If you want to horizontally scale on these instances you need to have a good relationship with AWS so they'll give you a big allocation before c9i is a thing.
    [-]
    - roboben 56 minutes ago
      also i found them much less stable than metal instances running into weird kvm failures
      [-]
      - Reformedot 52 minutes ago
        Yes, it is. It was a challenge to make it work smooth without metal. The scaling out speed was one of the main reasons
- torginus 48 minutes ago
  When we had need of quite big machines (AWS metal instances), we've found the performance differential between metal, and the equivalent size VM was 10-20% for CPU heavy workloads.
gozzoo 1 hour ago
The article doesn't mention docker at all. I don't understand why containers are not viable solution for headless browsers.
[-]
- kevmo314 56 minutes ago
  Their competitive advantage is not so much running the browser but rather making the browser undetectable.
  They boast a large residential proxy network too, which tells you all you need to know.
  [-]
  - sroussey 9 minutes ago
    Yeah, where is the blog post on the residential network?
- dizhn 17 minutes ago
  Startup time probably. They can start firecracker from a snapshot state.
- torginus 51 minutes ago
  Or processes. Chrome has builtin process isolation for every browser tab. It starts up darn near instantly, and scores as 'pretty good' as far as sandboxing is concerned.
- roboben 55 minutes ago
  docker is not a security boundary but a resource boundary.
  [-]
  - cute_boi 43 minutes ago
    It is security boundary but a weak one. Escaping from docker is very hard.
- Reformedot 53 minutes ago
  Docker does not isolate, consumes more resources and is slower
jauntywundrkind 16 minutes ago
I love that they start no no core pinning, then switch-over to having cores pinned.
This could be a bit of a tricky one, but I'd expect Checkpoint Restore In Userspace eventually tackles a lot of this. An image of a running Chromium process on a tmpfs (in-memory filesystem) that can just be launched endlessly tackles the memory slowdown problem, eliminates conventional startup costs. This feels like an ideal CRIU use case.
I imagine there's a lot of things Chrome needs to run though, bits of state to save/restore.
rbbydotdev 1 hour ago
crazy that the maker of chrome(google) and also the owner of a massive amount of cloud services has not made a cloud product identical to this yet
[-]
- _pdp_ 9 minutes ago
  not google but cloudflare has a similar product - though I am not sure how good it is
- bfeynman 37 minutes ago
  they kind of do.. gcp has their lambda equivalent which i believe comes with chromium preinstalled, its how major search tools like jina work, sure thre problaby somethign about session management that they probably neuter to prevent abuse though
- ranger_danger 38 minutes ago
  They have IMO: https://web.archive.org/web/20180823072111/https://cloud.goo...
  They just don't have access to giant pools of residential IPs, so too many sites end up blocking all the cloud providers by IP range/ASN anyway, even if they could get through a captcha.
stogot 1 hour ago
How do you handle browser sessions?
[-]
- Reformedot 49 minutes ago
  We persist profiles to maintain sessions if needed, this includes cookies, session storage and everything needed to keep your account logged in
nisten 45 minutes ago
fancy terms aside... they likely just run alpine linux.
eptcyka 1 hour ago
[flagged]
fsuts 1 day ago
“ click this button, type this text, read this page, take this screenshot.”
You left in the Ai’s instructions. lol
Interesting read though, thanks
[-]
- gregpr07 1 day ago
  well that's how browser agents work in a nutshell lol