> This showed up to Internet users trying to access our customers' sites as an error page indicating a failure within Cloudflare's network.
As a visitor to random web pages, I definitely appreciated this—much better than their completely false “checking the security of your connection” message.
> The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions
Also appreciate the honesty here.
> On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. […]
> Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.
Why did this take so long to resolve? I read through the entire article, and I understand why the outage happened, but when most of the network goes down, why wasn't the first step to revert any recent configuration changes, even ones that seem unrelated to the outage? (Or did I just misread something and this was explained somewhere?)
Of course, the correct solution is always obvious in retrospect, and it's impressive that it only took 7 minutes between the start of the outage and the incident being investigated, but it taking a further 4 hours to resolve the problem and 8 hours total for everything to be back to normal isn't great.
Because we initially thought it was an attack. And then when we figured it out we didn’t have a way to insert a good file into the queue. And then we needed to reboot processes on (a lot) of machines worldwide to get them to flush their bad files.
Thanks for the explanation! This definitely reminds me of CrowdStrike outages last year:
- A product depends on frequent configuration updates to defend against attackers.
- A bad data file is pushed into production.
- The system is unable to easily/automatically recover from bad data files.
(The CrowdStrike outages were quite a bit worse though, since it took down the entire computer and remediation required manual intervention on thousands of desktops, whereas parts of Cloudflare were still usable throughout the outage and the issue was 100% resolved in a few hours)
It'd be fun to read more about how you all procedurally respond to this (but maybe this is just a fixation of mine lately). Like are you tabletopping this scenario, are teams building out runbooks for how to quickly resolve this, what's the balancing test for "this needs a functional change to how our distributed systems work" vs. "instead of layering additional complexity on, we should just have a process for quickly and maybe even speculatively restoring this part of the system to a known good state in an outage".
We incorrectly thought at the time it was attack traffic coming in via WARP into LHR. In reality it was just that the failures started showing up there first because of how the bad file propagated and where it was working hours in the world.
Probably because it was the London team that was actively investigating the incident and initially came to the conclusion that it may be a DDoS while being unable to authenticate to their own systems.
Question from a casual bystander, why not have a virtual/staging mini node that receives these feature file changes first and catches errors to veto full production push?
Or you do have something like this but the specific db permission change in this context only failed in production
I think the reasoning behind this is because of the nature of the file being pushed - from the post mortem:
"This feature file is refreshed every few minutes and published to our entire network and allows us to react to variations in traffic flows across the Internet. It allows us to react to new types of bots and new bot attacks. So it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly."
In this case, the file fails quickly. A pretest that consists of just attempting to load the file would have caught it. Minutes is more than enough time to perform such a check.
Just asking out of curiosity, but roughly how many staff would've been involved in some way in sorting out the issue? Either outside regular hours or redirected from their planned work?
Is it though? Or is it, oh, this is such a simple change that we really don't need to test it attitude? I'm not saying this applies to TFA, but some people are so confident that no pressure is felt.
However, you forgot that the lighting conditions are where only red lights from the klaxons are showing so you really can't differentiate the colors of the wires
Side thought as we're working on 100% onchain systems (for digital assets security, different goals):
Public chains (e.g. EVMs) can be a tamper‑evident gate that only promotes a new config artifact if (a) a delay or multi‑sig review has elapsed, and (b) a succinct proof shows the artifact satisfies safety invariants like ≤200 features, deduped, schema X, etc.
That could have blocked propagation of the oversized file long before it reached the edge :)
> much better than their completely false “checking the security of your connection” message
The exact wording (which I can easily find, because a good chunk of the internet gives it to me, because I’m on Indian broadband):
> example.com needs to review the security of your connection before proceeding.
It bothers me how this bald-faced lie of a wording has persisted.
(The “Verify you are human by completing the action below.” / “Verify you are human” checkbox is also pretty false, as ticking the box in no way verifies you are human, but that feels slightly less disingenuous.)
I'm curious about how their internal policies work such that they are allowed to publish a post mortem this quickly, and with this much transparency.
Any other large-ish company, there would be layers of "stakeholders" that will slow this process down. They will almost always never allow code to be published.
Well… we have a culture of transparency we take seriously. I spent 3 years in law school that many times over my career have seemed like wastes but days like today prove useful. I was in the triage video bridge call nearly the whole time. Spent some time after we got things under control talking to customers. Then went home. I’m currently in Lisbon at our EUHQ. I texted John Graham-Cumming, our former CTO and current Board member whose clarity of writing I’ve always admired. He came over. Brought his son (“to show that work isn’t always fun”). Our Chief Legal Officer (Doug) happened to be in town. He came over too. The team had put together a technical doc with all the details. A tick-tock of what had happened and when. I locked myself on a balcony and started writing the intro and conclusion in my trusty BBEdit text editor. John started working on the technical middle. Doug provided edits here and there on places we weren’t clear. At some point John ordered sushi but from a place with limited delivery selection options, and I’m allergic to shellfish, so I ordered a burrito. The team continued to flesh out what happened. As we’d write we’d discover questions: how could a database permission change impact query results? Why were we making a permission change in the first place? We asked in the Google Doc. Answers came back. A few hours ago we declared it done. I read it top-to-bottom out loud for Doug, John, and John’s son. None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate. I sent a draft to Michelle, who’s in SF. The technical teams gave it a once over. Our social media team staged it to our blog. I texted John to see if he wanted to post it to HN. He didn’t reply after a few minutes so I did. That was the process.
> I texted John to see if he wanted to post it to HN. He didn’t reply after a few minutes so I did
Damn corporate karma farming is ruthless, only a couple minute SLA before taking ownership of the karma. I guess I'm not built for this big business SLA.
We're in a Live Fast Die Young karma world. If you can't get a TikTok ready with 2 minutes of the post modem drop, you might as well quit and become a barista instead.
> I read it top-to-bottom out loud for Doug, John, and John’s son. None of us were happy — we were embarrassed by what had happened — but we declared it true and accurate.
I'm so jealous. I've written postmortems for major incidents at a previous job: a few hours to write, a week of bikeshedding by marketing and communication and tech writers and ... over any single detail in my writing. Sanitizing (hide a part), simplifying (our customers are too dumb to understand), etc, so that the final writing was "true" in the sense that it "was not false", but definitely not what I would call "true and accurate" as an engineer.
How do you guys handle redaction? I'm sure even when trusted individuals are in charge of authoring, there's still a potential of accidental leakage which would probably be best mitigated by a team specifically looking for any slip ups.
Team has a good sense, typically. In this case, the names of the columns in the Bot Management feature table seemed sensitive. The person who included that in the master document we were working from added a comment: “Should redact column names.” John and I usually catch anything the rest of the team may have missed. For me, pays to have gone to law school, but also pays to have studied Computer Science in college and be technical enough to still understand both the SQL and Rust code here.
I mean the CEO posted the post-mortem so there aren't that many layers of stakeholders above. For other post-mortems by engineers, Matthew once said that the engineering team is running the blog and that he wouldn't event know how to veto even if he wanted [0]
The person who posted both this blog article and the hacker news post, is Matthew Prince, one of highly technical billionaire founders of cloudflare. I'm sure if he wants something to happen, it happens.
There’s lots of things we did while we were trying to track down and debug the root cause that didn’t make it into the post. Sorry the WARP takedown impacted you. As I said in a comment above, it was the result of us (wrongly) believing that this was an attack targeting WARP endpoints in our UK data centers. That turned out to be wrong but based on where errors initially spiked it was a reasonable hypothesis we wanted to rule out.
Why give this sort of content more visibility/reach?
I'm sure that's not your intent, so I hope my comment gives you an opportunity to reflect on the effects of syndicating such stupidity, no matter what platform it comes from.
Mainly to make others aware of what’s happening in the context of this Cloudflare outage. Sure I can avoid giving it visibility/reach but it’s growing and proliferating on its own, and I think ignoring it isn’t going to stop it so I am hoping awareness will help. I’ve noticed a huge rise in open racism against Chinese and Indian and workers of other origin, even when they’re here on a legal visa that we have chosen as a nation to grant for our own benefit.
The legislation that MTG (Marjorie Taylor Green) just proposed a few days ago to ban H1B entirely, and the calls to ban other visa types, is going to have a big negative impact on the tech industry and American innovation in general. The social media stupidity is online but it gives momentum to the actual real life legislation and other actions the administration might take. Many congress people are seeing the online sentiment and changing their positions in response, unfortunately.
I'm not the person you were replying to, but there is a rule I often see about not directly replying/quote tweeting because "engagement" appears to boost support for the ideas expressed. The recommendation then, would be to screenshot it (often with the username removed) and link to that.
FWIW it seems pretty obvious that this was ragebait. OP's profile is pretty much non-stop commentary on politics with nearly zero comments or submissions pertaining to the broader tech industry.
Posts like that deserve to be flagged if the sum of their substance is jingoist musing & ogling dumb people on Twitter.
It feels like their list of after actions is lacking a bit to me.
How about
1. The permissions change project is paused or rolled back until
2. All impacted database interactions (SQL queries) are evaluated for improper assumptions or better
3. Their design that depends on database metainfo and schema is replaced with ones that use specific tables and rows in tables instead of using the meta info as part of their application.
4. All hard coded limits are centralized in a single global module and referenced from their users and then back propagated to any separate generator processes that validate against the limit before pushing generated changes
The unwrap: not great, but understandable. Better to silently run with a partial config while paging oncall on some other channel, but that's a lot of engineering for a case that apparently is supposed to be "can't happen".
The lack of canary: cause for concern, but I more or less believe Cloudflare when they say this is unavoidable given the use case. Good reason to be extra careful though, which in some ways they weren't.
The slowness to root cause: sheer bad luck, with the status page down and Azure's DDoS yesterday all over the news.
The broken SQL: this is the one that I'd be up in arms about if I worked for Cloudflare. For a system with the power to roll out config to ~all of prod at once while bypassing a lot of the usual change tracking, having this escape testing and review is a major miss.
The query is surely faulty: Even if this wasn’t a huge distributed database with who-knows-what schemas and use cases, looking up a specific table by its unqualified name is sloppy.
But the architectural assumption that the bot file build logic can safely obtain this operationally critical list of features from derivative database metadata vs. a SSOT seems like a bigger problem to me.
IMO: there should be explicit error path for invalid configuration, so the program would abort with specific exit code and/or message. And there should be a superviser which would detect this behaviour, rollback old working config and wait for few minutes before trying to apply new config again (of course with corresponding alerts).
So basically bad config should be explicitly processed and handled by rolling back to known working config.
You don’t even need all the ceremony. If the config gets updated every 5 minutes, it surely is being hot-reloaded. If that’s the case, the old config is already in memory when the new config is being parsed. If that’s the case, parsing shouldn’t have panicked, but logged a warning, and carried on with the old config that must already be in memory.
> If that’s the case, the old config is already in memory when the new config is being parsed
I think that's explicitly a non-goal. My understanding is that Cloudflare prefers fail safe (blocking legitimate traffic) over fail open (allowing harmful traffic).
It's probably not ok to silently run with a partial config, which could have undefined semantics. An old but complete config is probably ok (or, the system should be designed to be safe to run in this state).
Even if you want it to crash, you almost never unwrap. At the very least you would use .expect() so you get a reasonable error message -- or even better you handle the potential error.
This wasn't a runtime property that could not be validated at compile time. And you don't need to fall back on "OS level security and reliability" when your type system is enforcing an application-level invariants.
In fact I'd argue that crashing is bad. It means you failed to properly enumerate and express your invariants, hit an unanticipated state, and thus had to fail in a way that requires you to give up and fall back on the OS to clean up your process state.
[edit]
Sigh, HN and its "you're posting too much". Here's my reply:
> Why? The end user result is a safe restart and the developer fixes the error.
Look at the thread your commenting on. The end result was a massive world-wide outage.
> That’s what it’s there for. Why is it bad to use its reliable error detection and recovery mechanism?
Because you don't have to crash at all.
> We don’t want to enumerate all possible paths. We want to limit them.
That's the exact same thing. Anything not "limited" is a possible path.
> If my program requires a config file to run, crash as soon as it can’t load the config file. There is nothing useful I can do (assuming that’s true).
Of course there's something useful you can do. In this particular case, the useful thing to do would have been to fall back on the previous valid configuration. And if that failed, the useful thing to do would be to log an informative, useful error so that nobody has to spend four hours during a worldwide outage to figure out what was going wrong.
The world wide outage was actually caused by deploying several incorrect programs in an incorrect system.
The root one was actually a bad query as outlined in the article.
Let’s get philosophical for a second. Programs WILL be written incorrectly - you will deploy to production something that can’t possibly work.
What should you do with a program that can’t work? Pretend this can’t happen? Or let you know so you can fix it?
This is the multi-million dollar .unwrap() story. In a critical path of infrastructure serving a significant chunk of the internet, calling .unwrap() on a Result means you're saying "this can never fail, and if it does, crash the thread immediately."The Rust compiler forced them to acknowledge this could fail (that's what Result is for), but they explicitly chose to panic instead of handle it gracefully. This is textbook "parse, don't validate" anti-pattern.
I know, this is "Monday morning quarterbacking", but that's what you get for an outage this big that had me tied up for half a day.
I’ve led multiple incident responses at a FAANG, here’s my take. The fundamental problem here is not Rust or the coding error. The problem is:
1. Their bot management system is designed to push a configuration out to their entire network rapidly. This is necessary so they can rapidly respond to attacks, but it creates risk as compared to systems that roll out changes gradually.
2. Despite the elevated risk of system wide rapid config propagation, it took them 2 hours to identify the config as the proximate cause, and another hour to roll it back.
SOP for stuff breaking is you roll back to a known good state. If you roll out gradually and your canaries break, you have a clear signal to roll back. Here was a special case where they needed their system to rapidly propagate changes everywhere, which is a huge risk, but didn’t quite have the visibility and rapid rollback capability in place to match that risk.
While it’s certainly useful to examine the root cause in the code, you’re never going to have defect free code. Reliability isn’t just about avoiding bugs. It’s about understanding how to give yourself clear visibility into the relationship between changes and behavior and the rollback capability to quickly revert to a known good state.
Cloudflare has done an amazing job with availability for many years and their Rust code now powers 20% of internet traffic. Truly a great team.
How can you write the proxy without handling the config containing more than the maximum features limit you set yourself?
How can the database export query not have a limit set if there is a hard limit on number of features?
Why do they do non-critical changes in production before testing in a stage environment?
Why did they think this was a cyberattack and only after two hours realize it was the config file?
Why are they that afraid of a botnet? Does not leave me confident that they will handle the next Aisuru attack.
I'm migrating my customers off Cloudflare. I don't think they can swallow the next botnet attacks and everyone on Cloudflare go down with the ship, so it will be safer to not be behind Cloudflare when it hits.
Exactly. The only way this could happen in the first place was _because_ they failed at so many levels. And as a result, more layers of Swiss cheese will be added, and holes in existing ones will be patched. This process is the reason flying is so safe, and the reason why Cloudflare will be a little bit more resilient tomorrow than it was yesterday.
> Why do they do non-critical changes in production before testing in a stage environment?
I guess the noncritical change here was the change to the database? My experience has been a lot of teams do a poor job having a faithful replica of databases in stage environments to expose this type of issue.
In part because it is somewhere between really hard and impossible. Is your staging DB going to be as big? Seeing the same RPS as prod? Seeing the same scenarios?
Permissions stuff might be caught without a completely faithful replica, but there are always going to be attributes of the system that only exist in prod.
I don't think these are realistic requirements for any engineered system to be honest. Realistic is to have contingencies for such cases, which are simply errors.
But the case for Cloudflare here is complicated. Every engineer is very free to make a better system though.
What is not realistic? To do simple input validation on data that has the potential to break 20% of the internet? To not have a system in place to rollback to the latest known state when things crash?
Cloudflare builds a global scale system, not an iphone app. Please act like it.
> To do simple input validation on data that has the potential to break 20% of the internet?
There will always be bugs in code, even simple code, and sometimes those things don't get caught before they cause significant trouble.
The failing here was not having a quick rollback option, or having it and not hitting the button soon enough (even if they thought the problem was probably something else, I think my paranoia about my own code quality is such that I would have been rolling back much sooner just in case I was wrong about the “something else”).
Cloudflares success was simplicity to build a distributed system in different data centers around the world to be implemented by third party IT workers while Cloudflare were a few people. There are probably a lot of shitty iPhone apps that do less important work and are vastly more complex than the former Cloudflare server node configuration.
Every system has a non-reducible risk and no data rollback is trivial, especially for a CDN.
Yeah, I don't quite understand the people cutting Cloudflare massive slack. It's not about nailing blame on a single person or a team, it's about keeping a company that is THE closest thing to a public utility for the web accountable. They more or less did a Press Release with a call to action to buy or use their services at the end and everybody is going "Yep, that's totally fine. Who hasn't sent a bug to prod, amirite?".
It goes over my head why Cloudflare is HN's darling while others like Google, Microsoft and AWS don't usually enjoy the same treatment.
>It goes over my head why Cloudflare is HN's darling while others like Google, Microsoft and AWS don't usually enjoy the same treatment.
Do the others you mentioned provide such detailed outage reports, within 24 hours of an incident? I’ve never seen others share the actual code that related to the incident.
Or the CEO or CTO replying to comments here?
>Press Release
This is not press release, they always did these outage posts from the start of the company.
> Do the others you mentioned provide such detailed outage reports, within 24 hours of an incident? I’ve never seen others share the actual code that related to the incident.
The code sample might as well be COBOL for people not familiar with Rust and its error handling semantics.
> Or the CEO or CTO replying to comments here?
I've looked around the thread and I haven't seen the CTO here nor the CEO, probably I'm not familiar with their usernames and that's on me.
> This is not press release, they always did these outage posts from the start of the company.
My mistake calling them press releases. Newspapers and online publications also skim this outage report to inform their news stories.
I wasn't clear enough on my previous comment. I'd like all major players in the internet and web infrastructure to be held to higher standards. As it stands when it comes to them or the tech department of a retail store the retail store must answer to more laws when surface area of combined activities is took into account.
Yes, Cloudflare excels where others don't or barely bother and I too enjoyed the pretty graphs, diagrams and I've learned some nifty Rust tricks.
EDIT: I've removed some unwarranted snark from my comment which I apologize for.
Name me global, redundant systems that have not (yet) failed.
And if you used cloudflare to protect against botnet and now go off cloudflare... you are vulnerable and may experience more downtime if you cannot swallow the traffic.
I mean no service have 100% uptime - just that some have more nines than others.
We had better uptime with AWS WAF in us-east-1 than we've had in the last 1.5 years of Cloudflare.
I do like the flat cost of Cloudflare and feature set better but they have quite a few outages compared to other large vendors--especially with Access (their zero trust product)
I'd lump them into GitHub levels of reliability
We had a comparable but slightly higher quote from an Akamai VAR.
> There are many self-hosted alternatives to protect against botnet.
What would some good examples of those be? I think something like Anubis is mostly against bot scraping, not sure how you'd mitigate a DDoS attack well with self-hosted infra if you don't have a lot of resources?
On that note, what would be a good self-hosted WAF? I recall using mod_security with Apache and the OWASP ruleset, apparently the Nginx version worked a bit slower (e.g. https://www.litespeedtech.com/benchmarks/modsecurity-apache-... ), there was also the Coraza project but I haven't heard much about it https://coraza.io/ or maybe the people who say that running a WAF isn't strictly necessary also have a point (depending on the particular attack surface).
There is haproxy-protection, which I believe is the basis of Kiwiflare. Clients making new connections have to solve a proof-of-work challenge that take about 3 seconds of compute time.
Well if you self host DDoS protection service, that would be VERY expensive. You would need rent rack space along with a very fast internet connection at multiple data centers to host this service.
As yourself more the question, is your service that important to need 99.999% uptime? Because i get the impression that people are so fixated on this uptime concept, that the idea of being down for a few hours is the most horrible issue in the world. To the point that they rather hand over control of their own system to a 3th party, then accept a downtime.
The fact that cloudflare can literally ready every bit of communication (as it sits between the client and your server) is already plenty bad. And yet, we accept this more easily, then a bit of downtime. We shall not ask about the prices for that service ;)
To me its nothing more then the whole "everybody on the cloud" issue, when most do not need the resource that cloud companies like AWS provide (and the bill), and yet, get totally tied down to this one service.
If you're buying transit, you'll have a hard time getting away with less than 10% commit, i.e. you'll have to pay for 10 Gbps of transit to have a 100 Gbps port, which will typically run into 4 digits USD / month. You'll need a few hundred Gbps of network and scrubbing capacity to handle common DDoS attacks using amplification from script kids with a 10 Gbps uplink server that allow spoofing, and probably on the order of 50+ Tbps to handle Aisuru.
If you're just renting servers instead, you have a few options that are effectively closer to a 1% commit, but better have a plan B for when your upstreams drop you if the incoming attack traffic starts disrupting other customers - see Neoprotect having to shut down their service last month.
But at the same time, what value do they add if they:
* Took down the the customers sites due to their bug.
* Never protected against an attack that our infra could not have handled by itself.
* Don't think that they will be able to handle the "next big ddos" attack.
It's just an extra layer of complexity for us. I'm sure there are attacks that could help our customers with, that's why we're using them in the first place. But until the customers are hit with multiple ddos attacks that we can not handle ourself then it's just not worth it.
> • Took down the the customers sites due to their bug.
That is always a risk with using a 3rd party service, or even adding extra locally managed moving parts. We use them in DayJob, and despite this huge issue and the number of much smaller ones we've experienced over the last few years their reliability has been pretty darn good (at least as good as the Azure infrastructure we have their services sat in front of).
> • Never protected against an attack that our infra could not have handled by itself.
But what about the next one… Obviously this is a question sensitive to many factors in our risk profiles and attitudes to that risk, there is no one right answer to the “but is it worth it?” question here.
On a slightly facetious point: if something malicious does happen to your infrastructure, that it does not cope well with, you won't have the “everyone else is down too” shield :) [only slightly facetious because while some of our clients are asking for a full report including justification for continued use of CF and any other 3rd parties, which is their right both morally and as written in our contracts, most, especially those who had locally managed services affected, have taken the “yeah, half our other stuff was affected to, what can you do?” viewpoint].
> • Don't think that they will be able to handle the "next big ddos" attack.
It is a war of attrition. At some point a new technique, or just a new botnet significantly larger than those seen before, will come along that they might not be able to deflect quickly. I'd be concerned if they were conceited enough not to be concerned about that possibility. Any new player is likely to practise on smaller targets first before directly attacking CF (in fact I assume that it is rather rare that CF is attacked directly) or a large enough segment of their clients to cause them specific issues. Could your infrastructure do any better if you happen to be chosen as one of those earlier targets?
Again, I don't know your risk profile so can say which is the right answer, if there even is an easy one other than “not thinking about it at all” being a truly wrong answer. Also DDoS protection is not the only service many use CF for, so those need to be considered too if you aren't using them for that one thing.
Does their ring based rollout really truly have to be 0->100% in a few seconds?
I don’t really buy this requirement. At least make it configurable with a more reasonable default for “routine” changes. E.g. ramping to 100% over 1 hour.
As long as that ramp rate is configurable, you can retain the ability to respond fast to attacks by setting the ramp time to a few seconds if you truly think it’s needed in that moment.
The configuration file is updated every five minutes, so clearly they have some past experience where they’ve decided an hour is too long. That said, even a roll out over five minutes can be helpful.
This was not about DDoS defense but the Bot Management feature, which is a paid Enterprise-only feature not enabled by default to block automated requests regardless of whether an attack is going on.
Bots can also cause a DoS/DDoS. We use the feature to restrict certain AI scraper tools by user agent that adversly impact performance (they have a tendency to hammer "export all the data" endpoints much more than regular users do)
It would still fail if you were unluckily on the new proxy (it's not very clear why if the feature was not enabled, indeed):
> Unrelated to this incident, we were and are currently migrating our customer traffic to a new version of our proxy service, internally known as FL2. Both versions were affected by the issue, although the impact observed was different.
> Customers deployed on the new FL2 proxy engine, observed HTTP 5xx errors. Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero. Customers that had rules deployed to block bots would have seen large numbers of false positives. Customers who were not using our bot score in their rules did not see any impact.
Maybe, but in that case maybe have some special casing logic to detect that yes indeed we're under a massive DDOS at this very moment, do a rapid rollout of this thing that will mitigate said DDOS. Otherwise use the default slower one?
Of course, this is all so easy to say after the fact..
I don't understand why they didn't validate and sanitize the new config file revision.
If bad(whatever that reason is) throw an error and revert back to previous version. You don't need to take down the whole internet for that.
Same as for almost every bug I think: the dev in question hadn't considered that the input could be bad in the way that it turned out to be. Maybe they were new, or maybe they hadn't slept much because of a newborn baby, or maybe they thought it was a reasonable assumption that there would never be more than 200 ML features in the array in question. I don't think this developer will ever make the same mistake again at least.
Let those who have never written a bug before cast the first stone.
> Maybe they were new, or maybe they hadn't slept much because of a newborn baby
Reminds me of House of Dynamite, the movie about nuclear apocalypse that really revolves around these very human factors. This outage is a perfect example of why relying on anything humans have built is risky, which includes the entire nuclear apparatus. “I don’t understand why X wasn’t built in such a way that wouldn’t mean we live in an underground bunker now” is the sentence that comes to mind.
> I don't understand why they didn't validate and sanitize the new config file revision.
The new config file was not (AIUI) invalid (syntax-wise) but rather too big:
> […] That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.
> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
I've also led a team of Incident Commanders at a FAANG.
If this was a routine config change, I could see how it could take 2 hours to start the mediation plan. However they should have dashboards that correlate config setting changes with 500 errors (or equivalent). It gets difficult when you have many of of these going out at the same time and they are slowly rolled out.
The root cause document is mostly for high level and the public. The details on this specific outage will be in a internal document with many action items, some of them maybe quarter long projects including fixing this specific bug and maybe some linter/monitor to prevent it from happening again.
I would say that whilst this is a good top down view, that `.unwrap()` should have been caught at code-review and not allowed. Clippy rule could have saved a lot of money.
That and why the hell wasn't their alerting showing up colossal amount of panics in their bot manager thing?
Yes the lack of observability is really the disturbing bit here. You have panics in a bunch of your core infrastructure, you would expect there to be a big red banner on the dashboard that people look at when they first start troubleshooting an incident.
This is also a pretty good example why having stack traces by default is great. That error could have been immediately understood just from a stack trace and a basic exception message.
Exactly the right take. Even when you want to have rapid changes on your infra, do it at least by region. You can start with the region where the least amount of users are impacted and if everything is fine, there is no elevated number of crashes for example, you can move forward. It was a standard practice at $RANDOM_FAANG when we had such deployments.
Thank you. I am sympathetic to CF’s need to deploy these configs globally fast and don’t think slowing down their DDoS mitigation is necessarily a good trade off. What I am saying is this presents a bigger reliability risk and needs correspondingly fine crafted observability around such config changes and a rollback runbook. Greater risk -> greater attention.
But the rapid deployment mechanism for bot features wasn’t where the bug was introduced.
In fact, the root bug (faulty assumption?) was in one or more SQL catalog queries that were presumably written some time ago.
(Interestingly the analysis doesn’t go into how these erroneous queries made it into production OR whether the assumption was “to spec” and it’s the security principal change work that was faulty. Seems more likely to be the former.)
The "coding error" is a somewhat deliberate choice to fail eagerly that is usually safe but doesn't align with the need to do something (propagation of the configuration file) without failing.
I'm sure that there are misapplied guidelines to do that instead of being nice to incoming bot management configuration files, and someone might have been scolded (or worse) for proposing or attempting to handle them more safely.
In a productive way, this view also shifts the focus to improving the system (visibility etc), empowering the team, rather than focusing on the code which broke (probably strikes fear in the individuals, to do anything!)
You can write the safest code in the world, but if you're shipping config changes globally every few minutes without a robust rollback plan or telemetry that pinpoints when things go sideways, you're flying blind
The bot is efficient. This is by design. It will push out mistakes just as efficiently as it pushes out good changes. Good or bad... the plane of control is unchanged.
This is the danger of automated control systems. If they get hacked or somehow push out bad things (CloudStrike), they will have complete control and be very efficient.
It is just 2 different layers. Of course the code is also a problem, if it is in fact as the GP describes it. You are taking the higher level view, which is the second layer of dealing with not only this specific mistake, but also other mistakes, that can be related to arbitrary code paths.
Both are important, and I am pretty sure, that someone is gonna fix that line of code pretty soon.
Partial disagree. There should be lints against 'unwrap's. An 'expect' at least forces you to write down why you are so certain it can't fail. An unwrap is not just hubris, it's also laziness, and has no place in sensitive code.
And yes, there is a lint you can use against slicing ('indexing_slicing') and it's absolutely wild that it's not on by default in clippy.
I use unwrap a lot, and my most frequent target is unwrapping the result of Mutex::lock. Most applications have no reasonable way to recover from lock poisoning, so if I were forced to write a match for each such use site to handle the error case, the handler would have no choice but to just call panic anyway. Which is equivalent to unwrap, but much more verbose.
Perhaps it needs a scarier name, like "assume_ok".
I use locks a lot too, and I always return a Result from lock access. Sometimes an anyhow::Result, but still something to pass up to the caller.
This lets me do logging at minimum. Sometimes I can gracefully degrade. I try to be elegant in failure as possible, but not to the point where I wouldn't be able to detect errors or would enter a bad state.
That said, I am totally fine with your use case in your application. You're probably making sane choices for your problem. It should be on each organization to decide what the appropriate level of granularity is for each solution.
My worry is that this runtime panic behavior has unwittingly seeped into library code that is beyond our ability and scope to observe. Or that an organization sets a policy, but that the tools don't allow for rigid enforcement.
Pretty much - the time spent ruling out the hypothesis that it was a cyberattack would have been time spent investigating the uptick in deliberately written error logs, since you would expect alerts to be triggered if those exceed a threshold.
I imagine it would also require less time debugging a panic. That kind of breadcrumb trail in your logs is a gift to the future engineer and also customers who see a shorter period of downtime.
> Their bot management system is designed to push a configuration out to their entire network rapidly.
Once every 5m is not "rapidly". It isn't uncommon for configuration systems to do it every few seconds [0].
> While it’s certainly useful to examine the root cause in the code.
Believe the issue is as much an output from a periodic run (clickhouse query) caused by (on the surface, an unrelated change) causing this failure. That is, the system that validated the configuration (FL2) was different to the one that generated it (ML Bot Management DB).
Ideally, it is the system that vends a complex configuration that also vends & tests the library to consume it, or the system that consumes it, does so as if it was "tasting" the configuration first before devouring it unconditionally [1].
Of course, as with all distributed system failures, this is all easier said and done in hindsight.
Isn't rapidly more of how long it takes to get from A to Z rather than how often it is performed? You can push out a configuration update every fortnight but if it goes through all of your global servers in three seconds, I'd call it quite rapid.
It seems people have a blind spot for unwrap, perhaps because it's so often used in example code. In production code an unwrap or expect should be reviewed exactly like a panic.
It's not necessarily invalid to use unwrap in production code if you would just call panic anyway. But just like every unsafe block needs a SAFETY comment, every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.
Yes, I always thought it was wrong to use unwrap in examples. I know, people want to keep examples simple, but it trains developers to use unwrap() as they see that everywhere.
Yes, there are places where it's ok as that blog post explains so well: https://burntsushi.net/unwrap/
But most devs IMHO don't have the time to make the call correctly most of the time... so it's just better to do something better, like handle the error and try to recover, or if impossible, at least do `expect("damn it, how did this happen")`.
There is a prevailing mentality that LLMs make it easy to become productive in new languages, if you are already proficient in one. That's perhaps true until you suddenly bump up against the need to go beyond your superficial understanding of the new language and its idiosyncrasies. These little collisions with reality occur until one of them sparks an issue of this magnitude.
In theory, experienced human code reviewers can course correct newer LLM-guided devs work before it blows up. In practice, reviewers are already stretched thin and submitters absolute to now rapidly generate more and more code to review makes that exhaustion effect way worse. It becomes less likely they spot something small but obvious amongst the haystack of LLM generated code bailing there way.
> at least do `expect("damn it, how did this happen")`
That gives you the same behavior as unwrap with a less useful error message though. In theory you can write useful messages, but in practice (and your example) expect is rarely better than unwrap in modern rust
I disagree with that characterization. Using unwrap() like you suggest in your blog post is an intentional, well-thought-out choice. Using unwrap() the way Cloudflare did it is, with hindsight, a bad choice, that doesn't utilize the language's design features.
Note that they're not criticizing the language. I read "Rust developers" in this context as developers using Rust, not those who develop the language and ecosystem. (In particular they were not criticizing you.)
I think it's reasonable to question the use of unwrap() in this context. Taking a cue from your blog post^ under runtime invariant violations, I don't think this use matches any of your cases. They assumed the size of a config file is small, it wasn't, so the internet crashed.
Echelon's comment was "We shouldn't be using unwrap() or expect() at all. [...] unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers". Even in my most generous interpretation I can't see how that is anything except a rejection of all unwraps (and equivalent constructs like expect()).
I fully agree with burntsushi that echelon is taking an extreme and arguably wrong stance. His sentiment becomes more and more correct as Rust continues to evolve ways to avoid unwrap as an ergonomic shortcut, but I don't think we are quite there yet for general use. There absolutely is code that should never panic, but that involves tradeoffs and design choices that aren't true for every project (or even the majority of them)
> We shouldn't be using unwrap() or expect() at all.
So the context of their comment is not some specific nuanced example. They made a blanket statement.
> Note that they're not criticizing the language. I read "Rust developers" in this context as developers using Rust, not those who develop the language and ecosystem.
I have the same interpretation.
> I think it's reasonable to question the use of unwrap() in this context. Taking a cue from your blog post^ under runtime invariant violations, I don't think this use matches any of your cases. They assumed the size of a config file is small, it wasn't, so the internet crashed.
Yes? I didn't say it wasn't reasonable to question the use of unwrap() here. I don't think we really have enough information to know whether it was inappropriate or not.
unwrap() is all about nuance. I hope my blog post conveyed that. Because unwrap() is a manifestation of an assertion on a runtime invariant. A runtime invariant can be arbitrarily complicated. So saying things like, "we shouldn't be using unwrap() or expect() at all" is an extreme position to carve out that is also way too generalized.
I stand by what I said. They are factually mistaken in their characterization of the use of unwrap()/expect() in general.
> So the context of their comment is not some specific nuanced example. They made a blanket statement.
That is their opinion, I disagree with it, but I don't think it's an insulting or invalid opinion to have. There are codebases that ban nulls in other languages too.
> They are factually mistaken in their characterization of the use of unwrap()/expect() in general.
It's an opinion about a stylistic choice. I don't see what fact there is here that could be mistaken.
I'm finding this exchange frustrating, and now we're going in circles. I'll say this one last time in as clear language as I can. They said this:
> unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers or Rust developers not utilizing the language's design features.
The factually incorrect part of this is the statement that use of `unwrap()`, `expect()` and so on is caused by X or Y, where X is "lazy Rust developers" and Y is "Rust developers not utilizing the language's design features." But there are, factually, other causes than X or Y for use of `unwrap()`, `expect()` and so on. So stating that it is all caused by X or Y is factually incorrect. Moreover, X is 100% insulting when applied to any one specific individual. Y can be insulting when applied to any one specific individual.
Now this:
> We shouldn't be using unwrap() or expect() at all.
That's an opinion. It isn't factually incorrect. And it isn't insulting.
I'm sorry I'm frustrating you. It was not my intention. For what it's worth, I use ripgrep every day, and it's made my life appreciably better. (Same goes for Astral products.) Thank you for that, and I wish your day improves.
> unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers or Rust developers not utilizing the language's design features
I just read that line as shorthand for large outages caused by misuse of unwrap(), expect(), bad math etc. - all caused by...
That's also an opinion, by my reading.
I assumed we were talking specifically about misuses, not all uses of unwrap(), or all bad bugs. Anyway, I think we're ultimately saying the same thing. It's ironic in its own way.
I have to disagree that unwrap is ever OK. If you have to use unwrap, your types do not match your problem. Fix them. You have encoded invariants in your types that do not match reality.
Change your API boundary, surface the discrepancy between your requirements and the potential failing case at the edges where it can be handled.
If you need the value, you need to handle the case that it’s not available explicitly. You need to define your error path(s)
`slice[i]` is also a hole in the type system, but at least it’s generally relying on a local invariant, immediate to the surrounding context, that does not require lying about invariants across your API surface.
The blog post doesn’t address the issue, it simply pretends it’s not a real problem.
Also from the post: “If we were to steelman advocates in favor of this style of coding, then I think the argument is probably best limited to certain high reliability domains. I personally don’t have a ton of experience in said domains …”
`slice[i]` is just sugar for `slice.get(i).unwrap()`. And whether it's a "local" invariant or not is orthogonal. And `unwrap()` does not "require lying about invariants across your API surface."
> The blog post doesn’t address the issue, it simply pretends it’s not a real problem.
It very explicitly addresses it! It even gives real examples.
> Also from the post: “If we were to steelman advocates in favor of this style of coding, then I think the argument is probably best limited to certain high reliability domains. I personally don’t have a ton of experience in said domains …”
>
> Enough said.
Ad hominem... I don't have experience working on, e.g., medical devices upon which someone's life depends. So the point of that sentence is to say, "yes, I acknowledge this advice may not apply there." You also cherry picked that quote and left off the context, which is relevant here.
And note that you said:
> I have to disagree that unwrap is ever OK.
That's an extreme position. It isn't caveated to only apply to certain contexts.
This is a failure caused by lazy Rust programming and not relying on the language's design features.
It's a shame this code can even be written. It is surprising and escapes the expected safety of the language.
I'm terrified of some dependency using unwrap() or expect() and crashing for something entirely outside of my control.
We should have an opt-in strict Cargo.toml declaration that forbids compilation of any crate that uses entirely preventable panics. The only panics I'll accept are those relating to memory allocation.
This is one of the sharpest edges in the language, and it needs to be smoothed away.
> If you have to use unwrap, your types do not match your problem
The problem starts with Rust stdlib. It panics on allocation failure. You expect Rust programmers to look at stdlib and not imitate it?
Sure, you can try to taboo unwrap(), but 1) it won't work, and 2) it'll contort program design in places where failure really is a logic bug, not a runtime failure, and for which unwrap() is actually appropriate.
The real solution is to go back in time, bonk the Rust designers over the head with a cluebat, and have them ship a language that makes error propagation the default and syntactically marks infallible cleanup paths --- like C++ with noexcept.
Of course it will. I've built enormous systems, including an entire compiler, without once relying on the local language equivalent of `.unwrap()`.
> 2) it'll contort program design in places where failure really is a logic bug, not a runtime failure, and for which unwrap() is actually appropriate.
That's a failure to model invariants in your API correctly.
> ... have them ship a language that makes error propagation the default and syntactically marks infallible cleanup paths --- like C++ with noexcept.
Unchecked exceptions aren't a solution. They're a way to avoid taking the thought, time, and effort to model failure paths, and instead leave that inherent unaddressed complexity until a runtime failure surprises users. Like just happened to Cloudflare.
Dunno, I think the alternatives have their own pretty significant downsides. All would require front loading more in-depth understanding of error handling and some would just be quite a bit more verbose.
IMO making unwrap a clippy lint (or perhaps a warning) would be a decent start. Or maybe renaming unwrap.
This strikes me as a culture issue more than one of language.
A tenet of systems code is that every possible error must be handled explicitly and exhaustively close to the point of occurrence. It doesn’t matter if it is Rust, C, etc. Knowing how to write systems code is unrelated to knowing a systems language. Rust is a systems language but most people coming into Rust have no systems code experience and are “holding it wrong”. It has been a recurring theme I’ve seen with Rust development in a systems context.
C is pretty broken as a language but one of the things going for it is that it has a strong systems code culture surrounding it that remembers e.g. why we do all of this extra error handling work. Rust really needs systems code practice to be more strongly visible in the culture around the language.
Unwrap _is_ explicitly handling an error at the point of occurrence. You have explicitly decided to panic, which is sometimes a valid choice. I use it (on startup only) when server configs are missing or invalid or in CLI tools when the options aren't valid. Crashing a pod on startup before it goes Ready is a valid pattern in k8s and generally won't cause an outage because the previous pod will continue working.
Yes? Funnily enough, I don't often use indexed access in Rust. Either I'm looping over elements of a data structure (in which case I use iterators), or I'm using an untrusted index value (in which case I explicitly handle the error case). In the rare case where I'm using an index value that I can guarantee is never invalid (e.g. graph traversal where the indices are never exposed outside the scope of the traversal), then I create a safe wrapper around the unsafe access and document the invariant.
If that's the case then hats off. What you're describing is definitely not what I've seen in practice. In fact, I don't think I've ever seen a crate or production codebase that documents infallibility of every single slice access. Even security-critical cryptography crates that passed audits don't do that. Personally, I found it quite hard to avoid indexing for graph-heavy code, so I'm always on the lookout for interesting ways to enforce access safety. If you have some code to share that would be very interesting.
My rule of thumb is that unchecked access is okay in scenarios where both the array/map and the indices/keys are private implementation details of a function or struct, since an invariant is easy to manually verify when it is tightly scoped as such. I've seen it used it in:
* Graph/tree traversal functions that take a visitor function as a parameter
> I don't think I've ever seen a crate or production codebase that documents infallibility of every single slice access.
The smoltcp crate typically uses runtime checks to ensure slice accesses made by the library do not cause a panic. It's not exactly equivalent to GP's assertion, since it doesn't cover "every single slice access", but it at least covers slice accesses triggered by the library's public API. (i.e. none of the public API functions should cause a panic, assuming that the runtime validation after the most recent mutation succeeds).
I think this goes against the Rust goals in terms of performance. Good for safe code, of course, but usually Rust users like to have compile time safety to making runtime safety checks unnecessary.
Sure, these days I'm mostly working on a few compilers. Let's say I want to make a fixed-size SSA IR. Each instruction has an opcode and two operands (which are essentially pointers to other instructions). The IR is populated in one phase, and then lowered in the next. During lowering I run a few peephole and code motion optimizations on the IR, and then do regalloc + asm codegen. During that pass the IR is mutated and indices are invalidated/updated. The important thing is that this phase is extremely performance-critical.
One normal "trick" is phantom typing. You create a type representing indices and have a small, well-audited portion of unsafe code handling creation/unpacking, where the rest of the code is completely safe.
The details depend a lot on what you're doing and how you're doing it. Does the graph grow? Shrink? Do you have more than one? Do you care about programmer error types other than panic/UB?
Suppose, e.g., that your graph doesn't change sizes, you only have one, and you only care about panics/UB. Then you can get away with:
1. A dedicated index type, unique to that graph (shadow / strong-typedef / wrap / whatever), corresponding to whichever index type you're natively using to index nodes.
2. Some mechanism for generating such indices. E.g., during graph population phase you have a method which returns the next custom index or None if none exist. You generated the IR with those custom indexes, so you know (assuming that one critical function is correct) that they're able to appropriately index anywhere in your graph.
3. You have some unsafe code somewhere which blindly trusts those indices when you start actually indexing into your array(s) of node information. However, since the very existence of such an index is proof that you're allowed to access the data, that access is safe.
Techniques vary from language to language and depending on your exact goals. GhostCell [0] in Rust is one way of relegating literally all of the unsafe code to a well-vetted library, and it uses tagged types (via lifetimes), so you can also do away with the "only one graph" limitation. It's been awhile since I've looked at it, but resizes might also be safe pretty trivially (or might not be).
The general principle though is to structure your problem in such a way that a very small amount of code (so that you can more easily prove it correct) can provide promises that are enforceable purely via the type system (so that if the critical code is correct then so is everything else).
That's trivial by itself (e.g., just rely on option-returning .get operators), so the rest of the trick is to find a cheap place in your code which can provide stronger guarantees. For many problems, initialization is the perfect place (e.g., you can bounds-check on init and then not worry about it again) (e.g., if even bounds-checking on initialization is too slow then you can still use the opportunity at initialization to write out a proof of why some invariant holds and then blindly/unsafely assert it to be true, but you then immediately pack that hard-won information into a dedicated type so that the only place you ever have to think about it is on initialization).
I do use a combination of newtyped indices + singleton arenas for data structures that only grow (like the AST). But for the IR, being able to remove nodes from the graph is very important. So phantom typing wouldn't work in that case.
Usually you'd want to write almost all your slice or other container iterations with iterators, in a functional style.
For the 5% of cases that are too complex for standard iterators? I never bother justifying why my indexes are correct, but I don't see why not.
You very rarely need SAFETY comments in Rust because almost all the code you write is safe in the first place. The language also gives you the tool to avoid manual iteration (not just for safety, but because it lets the compiler eliminate bounds checks), so it would actually be quite viable to write these comments, since you only need them when you're doing something unusual.
I didn't restate the context from the code we're discussing: it must not panic. If you don't care if the code panics, then go ahead and unwrap/expect/index, because that conforms to your chosen error handling scheme. This is fine for lots of things like CLI tools or isolated subprocesses, and makes review a lot easier.
So: first, identify code that cannot be allowed to panic. Within that code, yes, in the rare case that you use [i], you need to at least try to justify why you think it'll be in bounds. But it would be better not to.
There are a couple of attempts at getting the compiler to prove that code can't panic (e.g., the no-panic crate).
What about memory allocation - how will you stop that from panicking ? `Vec::resize` will always panic in Rust. And this is just one example out of thousands in the Rust stdlib.
Unless the language addresses no-panic in its governing design or allows try-catch, not sure how you go about this.
That is slowly being addressed, but meanwhile it’s likely you have a reliable upper bound on how much heap your service needs, so it’s a much smaller worry. There are also techniques like up-front or static allocation if you want to make more certain.
This is ridiculous. We're probably going to start seeing more of these. This was just the first, big highly visible instance.
We should have a name for this similar to "my code just NPE'd". I suggest "unwrapped", as in, "My Rust app just unwrapped a present."
I think we should start advocating for the deprecation and eventual removal of the unwrap/expect family of methods. There's no reason engineers shouldn't be handling Options and Results gracefully, either passing the state to the caller or turning to a success or fail path. Not doing this is just laziness.
Indexing is comparatively rare given the existence of iterators, IMO. If your goal is to avoid any potential for panicking, I think you'd have a harder time with arithmetic overflow.
Your pair of posts is very interesting to me. Can you share with me: What is your programming environment such that you are "fine with allocation failures"? I'm not doubting you, but for me, if I am doing systems programming with C or C++, my program is doomed if a malloc fails! When I saw your post, I immediately thought: Am I doing it wrong? If I get a NULL back from malloc(), I just terminate with an error message.
I mean, yeah, if I am using a library, as an user of this library, I would like to be able to handle the error myself. Having the library decide to panic, for example, is the opposite of it.
If I can't allocate memory, I'm typically okay with the program terminating.
I don't want dependencies deciding to unwrap() or expect() some bullshit and that causing my entire program to crash because I didn't anticipate or handle the panic.
Code should be written, to the largest extent possible, to mitigate errors using Result<>. This is just laziness.
I want checks in the language to safeguard against lazy Rust developers. I don't want their code in my dependency tree, and I want static guarantees against this.
edit: I just searched unwrap() usage on Github, and I'm now kind of worried/angry:
Something that allows me to tag annotate a function (or my whole crate) as "no panic", and get a compile error if the function or anything it calls has a reachable panic.
This will allow it to work with many unmodified crates, as long as constant propagation can prove that any panics are unreachable. This approach will also allow crates to provide panicking and non panicking versions of their API (which many already do).
I think the most common solution at the moment is dtolnay's no_panic [0]. That has a bunch of caveats, though, and the ergonomics leave something to be desired, so a first-party solution would probably be preferable.
Yes, I want that. I also want to be able to (1) statically apply a badge on every crate that makes and meets these guarantees (including transitively with that crate's own dependencies) so I can search crates.io for stronger guarantees and (2) annotate my Cargo.toml to not import crates that violate this, so time isn't wasted compiling - we know it'll fail in advance.
On the subject of this, I want more ability to filter out crates in our Cargo.toml. Such as a max dependency depth. Or a frozen set of dependencies that is guaranteed not to change so audits are easier. (Obviously we could vendor the code in and be in charge of our own destiny, but this feels like something we can let crate authors police.)
I would be fine just getting rid of unwrap(), expect(), etc. That's still a net win.
Look at how many lazy cases of this there are in Rust code [1].
Some of these are no doubt tested (albeit impossible to statically guarantee), but a lot of it looks like sloppiness or not leaning on the language's strong error handling features.
It's disappointing to see. We've had so much of this creep into the language that eventually it caused a major stop-the-world outage. This is unlikely to be the last time we see it.
I don't write Rust so I don't really know, but from someone else's description here it sounds similar to `fromJust` in Haskell which is a common newbie footgun. I think you're right that this is a case of not using the language properly, though I know I was seduced into the idea that Haskell is safe by default when I was first learning, which isn't quite true — the safety features are opt-in.
A language DX feature I quite like is when dangerous things are labelled as such. IIRC, some examples of this are `accursedUnutterablePerformIO` in Haskell, and `DO_NOT_USE_OR_YOU_WILL_BE_FIRED_EXPERIMENTAL_CREATE_ROOT_CONTAINERS` in React.js.
I would be in favor of renaming unwrap() and its family to `unwrap_do_not_use_or_you_will_break_the_internet()`
I still think we should remove them outright or make production code fail to compile without a flag allowing them. And we also need tools to start cleaning up our dependency tree of this mess.
For iteration, yes. But there's other cases, like any time you have to deal with lots of linked data structures. If you need high performance, chances are that you'll have to use an index+arena strategy. They're also common in mathematical codebases.
It's the same blind spot people have to Java's checked exceptions. People commonly resort to Pokemon exception handling and either blindly ignoring or rethrowing as a runtime exception. When Rust got popular, I was a bit confused by people talking about how great Result it's essentially a checked exception without a stack trace.
"Checked Exceptions Are Actually Good" gang, rise up! :p
I think adoption would have played out very different if there had only been some more syntactic-sugar. For example, an easy syntax for saying: "In this method, any (checked) DeepException e that bubbles up should immediately be replaced by a new (checked) MylayerException(e) that contains the original one as a cause.
We might still get lazy programmers making systems where every damn thing goes into a generic MylayerException, but that mess would still be way easier to fix later than a hundred scattered RuntimeExceptions.
Exception handling would be better than what we're seeing here.
The problem is that any non-trivial software is composition, and encapsulation means most errors aren't recoverable.
We just need easy ways to propagate exceptions out to the appropriate reliability boundary, ie. the transaction/ request/ config loading, and fail it sensibly, with an easily diagnosable message and without crashing the whole process.
C# or unchecked Java exceptions are actually fairly close to ideal for this.
The correct paradigm is "prefer throw to catch" -- requiring devs to check every ret-val just created thousands of opportunities for mistakes to be made.
By contrast, a reliable C# or Java version might have just 3 catch clauses and handle errors arising below sensibly without any developer effort.
There is also the problem that they decided to make all references nullable, so `NullPointerException`s could appear everywhere. This "forced" them to introduce the escape hatch of `RuntimeException`, which of course was way overused immediately, normalizing it.
I'm with you! Checked exceptions are actually good and the hate for them is super short sighted. The exact same criticisms levied at checked exceptions apply to static typing in general, but people acknowledge the great value static types have for preventing errors at compile time. Checked exceptions have that same value, but are dunked on for some reason.
1. in most cases they don't want to handle `InterruptedException` or `IOException` and yet need to bubble them up. In that case the code is very verbose.
2. it makes lambdas and functions incompatible. So eg: if you're passing a function to forEach, you're forced to wrap it in runtime exception.
3. Due to (1) and (2), most people become lazy and do `throws Exception` which negates most advantages of having exceptions in the first place.
In line-of-business apps (where Java is used the most), an uncaught exception is not a big deal. It will bubble up and gets handled somewhere far up the stack (eg: the server logger) without disrupting other parts of the application. This reduces the utility of having every function throw InterruptedException / IOException when those hardly ever happen.
Java checked exceptions suffer from a lack of generic exception types ("throws T", where T can be e.g. "Exception", "Exception1|Exception2", or "never") This would also require union types and a bottom type.
Without generics, higher order functions are very hard to use.
In my experience, it actually is a big deal, leaving a wake of indeterminant state behind after stack unrolling. The app then fails with heisenbugs later, raising more exceptions that get ignored, compounding the problem.
People just shrug off that unreliability as an unavoidable cost of doing business.
Yeah, in both cases it's a layering situation, where it's the duty of your code to decide what layers of abstraction need to be be bridged, and to execute on that decision. Translating/wrapping exception-types from deeper functions is the same as translating/wrapping return-types the same places.
I think it comes down to a psychological or use-case issue: People hate thinking about errors and handling them, because it's that hard stuff that always consumes more time than we'd like to think. Not just digitally, but in physical machines too. It's also easier to put off "for later."
Checked exceptions in theory were good, but Java simply did not add facilities to handle or support them well in many APIs. Even the new API's in Java - Streams, etc do not support checked exceptions.
It's a lot lighter: a stack trace takes a lot of overhead to generate; a result has no overhead for a failure. The overhead (panic) only comes once the failure can't be handled. (Most books on Java/C# don't explain that throwing exceptions has high performance overhead.)
Exceptions force a panic on all errors, which is why they're supposed to be used in "exceptional" situations. To avoid exceptions when an error is expected, (eof, broken socket, file not found,) you either have to use an unnatural return type or accept the performance penalty of the panic that happens when you "throw."
In Rust, the stack trace happens at panic (unwrap), which is when the error isn't handled. IE, it's not when the file isn't found, it's when the error isn't handled.
Exceptions do not force panic at all. In most practical situations, an exception unhandled close to where it was thrown will eventually get logged. It's kind of a "local" panic, if you will, that will terminate the specific function, but the rest of the program will remain unaffected. For example, a web server might throw an exception while processing a specific HTTP request, but other HTTP requests are unaffected.
Throwing an exception does not necessarily mean that your program is suddenly in an unsupported state, and therefore does not require terminating the entire program.
> Throwing an exception does not necessarily mean that your program is suddenly in an unsupported state, and therefore does not require terminating the entire program.
That's not what a panic means. Take a read through Go's panic / resume mechanism; it's similar to exceptions, but the semantics (with multiple return values) make it clear that panic is for exceptional situations. (IE, panic isn't for "file not found," but instead it's for when code isn't written to handle "file not found.")
Sure, but the same is true of any error handling strategy.
When you work with exceptions, the key is to assume that every line can throw unless proven otherwise, which in practice means almost all lines of code can throw. Once you adopt that mental model, things get easier.
Explicit error handling strategies allow you to not worry about all the code paths that explicitly cannot throw -- which is a lot of them. It makes life a lot easier in the non-throwing case, and doesn't complicate life any more in the throwing case as compared to exception-based error handling.
It also makes errors part of the API contract, which is where they belong, because they are.
It can and that optimization has existed for a while.
Actually it can also just turn off the collection of stack traces entirely for throw sites that are being hit all the time. But most Java code doesn't need this because code only throws exceptions for exceptional situations.
> it's essentially a checked exception without a stack trace
In theory, theory and practice are the same. In practice...
You can't throw a checked exception in a stream, this fact actually underlines the key difference between an exception and a Result: Result is in return position and exceptions are a sort of side effect that has its own control flow. Because of that, once your method throws an Exception or you are writing code in a try block that catches an exception, you become blind to further exceptions of that type, even if you might be able to or required to fix those errors. Results are required to be handled individually and you get syntactic sugar to easily back propagate.
It is trivial to include a stack trace, but stack traces are really only useful for identifying where something occurred, and generally what is superior is attaching context as you back propagate which trivially occurs with judicious use of custom error types with From impls. Doing this means that the error message uniquely defines the origin and paths it passed through without intermediate unimportant stack noise. With exceptions you would always need to catch each exception and rethrow a new exception containing the old to add contextual information, then to avoid catching to much you need variables that will be initialized inside the try block defined outside of the try block. So stack traces are basically only useful when you are doing Pokemon exception handling.
checked exceptions failed because when used properly they fossilize method signatures. they're fine if your code will never be changed and they're fine when you control 100% of users of the throwing code. if you're distributing a library... no bueno.
That’s just not true. They required that you use hierarchical exception types and define your own library exception type that you declare at the boundary.
The same is required for any principled error handling.
> When Rust got popular, I was a bit confused by people talking about how great Result it's essentially a checked exception without a stack trace.
It's not a checked exception without a stack trace.
Rust doesn't have Java's checked or unchecked exception semantics at the moment. Panics are more like Java's Errors (e.g. OOM error). Results are just error codes on steroids.
Really not! This is a huge faceplant for writing things in Rust. If they had been writing their code in Java/Kotlin instead of Rust, this outage either wouldn't have happened at all (a failure to load a new config would have been caught by a defensive exception handler), or would have been resolved in minutes instead of hours.
The most useful thing exceptions give you is not static compile time checking, it's the stack trace, error message, causal chain and ability to catch errors at the right level of abstraction. Rust's panics give you none of that.
Look at the error message Cloudflare's engineers were faced with:
thread fl2_worker_thread panicked: called Result::unwrap() on an Err value
That's useless, barely better than "segmentation fault". No wonder it took so long to track down what was happening.
A proxy stack written in a managed language with exceptions would have given an error message like this:
com.cloudflare.proxy.botfeatures.TooManyFeaturesException: 200 > 60
at com.cloudflare.proxy.botfeatures.FeatureLoader(FeatureLoader.java:123)
at ...
and so on. It'd have been immediately apparent what went wrong. The bad configs could have been rolled back in minutes instead of hours.
In the past I've been able to diagnose production problems based on stack traces so many times I was been expecting an outage like this ever since the trend away from providing exceptions in new languages in the 2010s. A decade ago I wrote a defense of the feature and I hope we can now have a proper discussion about adding exceptions back to languages that need them (primarily Go and Rust):
That has nothing to do with exceptions, just the ability to unwind the stack. Rust can certainly give you a backtrace on panics; you don’t even have to write a handler to get it. I would find it hard to believe Cloudflare’s services aren’t configured to do it. I suspect they just didn’t put the entire message in the post.
tldr: Capturing a backtrace can be a quite expensive runtime operation, so the environment variables allow either forcibly disabling this runtime performance hit or allow selectively enabling it in some programs.
It's one of the problems with using result types. You don't distinguish between genuinely exceptional events and things that are expected to happen often on hot paths, so the runtime doesn't know how much data to collect.
Alternatively you can look at actually innovative programming languages to peek at the next 20 years of innovation.
I am not sure that watching the trendy forefront successfully reach the 1990s and discuss how unwrapping Option is potentially dangerous really warm my heart. I can’t wait for the complete meltdown when they discover effect systems in 2040.
To be more serious, this kind of incident is yet another reminder that software development remains miles away from proper engineering and even key providers like Cloudfare utterly fail at proper risk management.
Celebrating because there is now one popular language using static analysis for memory safety feels to me like being happy we now teach people to swim before a transatlantic boat crossing while we refuse to actually install life boats.
To me the situation has barely changed. The industry has been refusing to put in place strong reliability practices for decades, keeps significantly under investing in tools mitigating errors outside of a few fields where safety was already taken seriously before software was a thing and keeps hiding behind the excuse that we need to move fast and safety is too complex and costly while regulation remains extremely lenient.
I mean this Cloudfare outage probably cost millions of dollars of damage in aggregate between lost revenue and lost productivity. How much of that will they actually have to pay?
Let's try to make effect systems happen quicker than that.
> I mean this Cloudfare outage probably cost millions of dollars of damage in aggregate between lost revenue and lost productivity. How much of that will they actually have to pay?
Probably nothing, because most paying customers of cloudflare are probably signing away their rights to sue Cloudflare for damages by being down for a while when they purchase Cloudflare's services (maybe some customers have SLAs with monetary values attached, I dunno). I honestly have a hard time suggesting that those customers are individually wrong to do so - Cloudflare isn't down that often, and whatever amount it cost any individual customer by being down today might be more than offset by the DDOS protection they're buying.
Anyway if you want Cloudflare regulated to prevent this, name the specific regulations you want to see. Should it be illegal under US law to use `unwrap` in Rust code? Should it be illegal for any single internet services company to have more than X number of customers? A lot of the internet also breaks when AWS goes down because many people like to use AWS, so maybe they should be included in this regulatory framework too.
> I honestly have a hard time suggesting that those customers are individually wrong to do so - Cloudflare isn't down that often, and whatever amount it cost any individual customer by being down today might be more than offset by the DDOS protection they're buying.
We have collectively agreed to a world where software service providers have no incentive to be reliable as they are shielded from the consequences of their mistakes and somehow we see it as acceptable that software have a ton of issues and defects. The side effect is that research on actually lowering the cost of safety has little return on investment. It doesn't have be so.
> Anyway if you want Cloudflare regulated to prevent this, name the specific regulations you want to see.
I want software provider to be liable for the damage they cause and minimum quality regulation on par with an actual engineering discipline. I have always been astounded that nearly all software licences start with extremely broad limitation of liability provisions and people somehow feel fine with it. Try to extend that to any other product you regularly use in your life and see how that makes you fell.
How to do proper testing, formal methods and resilient design have been known for decades. I would personnaly be more than okay with let's move less fast and stop breaking things.
I find myself in the odd position of agreeing with you both.
That we’re even having this discussion is a major step forward. That we’re still having this discussion is a depressing testament to how slow slowly the mainstream has adopted better ideas.
> I can’t wait for the complete meltdown when they discover effect systems in 2040
Zig is undergoing this meltdown. Shame it's not memory safe. You can only get so far in developing programming wisdom before Eternal September kicks in and we're back to re-learning all the lessons of history as punishment for the youthful hubris that plagues this profession.
That's kind of what I'm saying with the blind spot comment. The words "unwrap" and "expect" should be just as much a scary red flag as the word "panic", but for some reason it seems a lot of people don't see them that way.
Even in lowly Java, they later added to Optional the orElseThrow() method since the name of the get() method did not connote the impact of unwrapping an empty Optional.
I've found both methods very useful. I'm using `get()` when I've checked that the value is present and I don't expect any exceptions. I'm using `orElseThrow()` when I actually expect that value can be absent and throwing is fine. Something like
if (userOpt.isPresent()) {
var user = userOpt.get();
var accountOpt = accountRepository.selectAccountOpt(user.getId());
var account = accountOpt.orElseThrow();
}
Idea checks it by default and highlights if I've used `get()` without previous check. It's not forced at compiler level, but it's good enough for me.
The `unsafe` keyword means something specific in Rust, and panicking isn't unsafe by Rust's definition. Sometimes avoiding partial functions just isn't feasible, and an unwrap (or whatever you want to call the method) is a way of providing a (runtime-checked) proof to the compiler that the function is actually total.
unwrap() should effectively work as a Result<> where the user must manually invoke a panic in the failure branch. Make special syntax if a match and panic is too much boilerplate.
This is like an implicit null pointer exception that cannot be statically guarded against.
I want a way to statically block any crates doing this from my dependency chain.
Same thing that would happen if it did a match statement and panicked. The problem is the panic, not the unwrap.
I don’t think you can ever completely eliminate panics, because there are always going to be some assumptions in code that will be surprisingly violated, because bugs exist. What if the heap allocator discovers the heap is corrupted? What if you reference memory that’s paged out and the disk is offline? (That one’s probably not turned into a panic, but it’s the same principle.)
That would require an effects system[0] like Koka's[1]. Then one could not only express the absence of panics but also allocations, infinite loops and various other undesirable effects within some call-trees.
This is a desirable feature, but an enormous undertaking.
Not sure what you're saying with the "work as a Result<>" part...unwrap is a method on Result. I think you're just saying the unwrap/expect methods should be eliminated?
Than they are going to write None | Err => yolo() that has the same impact. It is not the syntax or the semantic meaning is the problem here but the fact that there is no monitoring around the elevated error counts after a deployment.
Software engineers tend to get stuck in software problems and thinking that everything should be fixed in code. In reality there are many things outside of the code that you can do to operate unreliable components safely.
Exactly. People are very hung up on "unwrap" but even if it wasn't there at all, you will have devs just manually writing the match. Or, even more likely, using a trivial 'unwrap!" macro.
There's also an assumption here that if the unwrap wasn't there, the caller would have handled the error properly. But if this isn't part of some common library at CF, then chances are the caller is the same person who wrote the panicking function in the first place. So if a new error variant they introduced was returned they'd probably still abort the thread either by panicking at that point or breaking out of the thread's processing loop.
It's not about whether you should ban unwrap() in production. You shouldn't. Some errors are logic bugs beyond which a program can't reasonably continue. The problem is that the language makes it too easy for junior developers (and AI!) to ignore non-logic-bug problems with unwrap().
Programmers early in their careers will do practically anything to avoid having to think about errors and they get angry when you tell them about it.
> In production code an unwrap or expect should be reviewed exactly like a panic.
An unwrap should never make it to production IMHO. It's fine while prototyping, but once the project gets closer to production it's necessary to just grep `uncheck` in your code and replace those that can happen with a proper error management and replace those that cannot happen with `expect`, with a clear justification of why they cannot happen unless there's a bug somewhere else.
I would say, sure, if you feel the same way about panic calls making to production. In other words, review all of them the same way. Because writing unwrap/expect is exactly the same as writing “if error, panic”.
I don't understand your point: panic! is akin to expect: you think about it consciously, use it explicitly and you write down a panic message explaining its rational.
It should be. If you aren’t treating it exactly the same as panic and expect, that’s what I’m calling the “blind spot”. And why should you have to make up a message every time when the backtrace is going to tell you what was wrong?
Isn't the point of this article that pieces of infrastructure don't go down to root causes, but due to bad combinations of components that are correct individually? After reading "engineering a safer world", I find root cause analysis rather reductionistic, because it wasn't just an unwrap, it was that the payload was larger than normal, because of a query that didn't select by database, because a clickhouse made more databases visible. Hard to say "it was just due to an unwrap" imo. Especially in terms of how to fix an issue going forwards. I think the article lists a lot of good ideas, that aren't just "don't unwrap", like enabling more global kill switches for features, or eliminating the ability for core dumps or other error reports to overwhelm system resources.
You're right. A good postmortem/root cause analysis would START from "unwrap" and continue from there.
You might start with a basic timeline of what happened, then you'd start exploring: why did this change affect so many customers (this would be a line of questioning to find a potential root cause), why did it take so long to discover or recover (this might be multiple lines of questioning), etc.
> This is the multi-million dollar .unwrap() story.
That's too semantic IMHO. The failure mode was "enforced invariant stopped being true". If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.
> If the `.unwrap()` was replaced with `.expect("Feature config is too large!")` it would certainly make the outage shorter.
It wouldn't, not meaningfully. The outage was caused by change in how they processed the queries. They had no way to observe the changes, nor canaries to see that change is killing them. Plus, they would still need to manually feed and restart services that ingested bad configs.
`expect` would shave a few minutes; you would still spend hours figuring out and fixing it.
Granted, using expect is better, but it's not a silver bullet.
A billion alerts in DD/Sentry/whatever saying the exact problem that coincide with the exact graph of failures would probably be helpful if someone looked at them.
In general for unexpected errors like these the internal function should log the error, and I assume it was either logged or they can quickly deduce reason based on the line number.
I'm not sure if this is serious or not, but to take it at face value: the value of this sort of thing in Rust is not that it prevents crashes altogether but rather that it prevents _implicit_ failures. It forces a programmer to make the explicit choice of whether to crash.
There's lots of useful code where `unwrap()` makes sense. On my team, we first try to avoid it (and there are many patterns where you can do this). But when you can't, we leave a comment explaining why it's safe.
The language semantics do not surface `unwrap` usage or make any guarantees. It should be limited to use in `unsafe` blocks.
> There's lots of useful code where `unwrap()` makes sense. On my team, we first try to avoid it (and there are many patterns where you can do this). But when you can't, we leave a comment explaining why it's safe.
I would prefer the boiler plate of a match / if-else / if let, etc. to call attention to it. If you absolutely must explicitly panic. Or better - just return an error Result.
It doesn't matter how smart your engineers are. A bad unwrap can sneak in through refactors, changing business logic, changing preconditions, new data, etc.
Restricting unwrap to unsafe blocks adds negative value to the language. It won't prevent unwrap mistakes (people who play fast and loose with it today will just switch to "foo = unsafe { bar.unwrap() };" instead). And it'll muddy the purpose of unsafe by adding in a use that has nothing to do with memory safety. It's not a good idea.
That would be a fairly significant expansion of what `unsafe` means in Rust, to put it lightly. Not to mention that I think doing so would not really accomplish anything; marking unwrap() `unsafe` would not "surface `unwrap` usage" or "make any guarantees", as it's perfectly fine for safe functions to contain `unsafe` blocks with zero indication of such in the function signature and.
> fairly significant expansion of what `unsafe` means in Rust
I want an expansion of panic free behavior. We'll never get all the way there due to allocations etc., but this is the class of error the language is intended to fix.
This turned into a null pointer, which is exactly what Rust is supposed to quench.
I'll go as far as saying I would like to statically guarantee none of my dependencies use the unwrap() methods. We should be able to design libraries that provably avoid panics to the greatest extent possible.
Sure, and I'd hardly be one to disagree that a first-party method to guarantee no panics would be nice, but marking unwrap() `unsafe` is definitely not an effective way to go about it.
> but this is the class of error the language is intended to fix.
Is it? I certainly don't see any memory safety problems here.
> This turned into a null pointer, which is exactly what Rust is supposed to quench.
There's some subtlety here - Rust is intended to eliminate UB due to null pointer dereferences. I don't think Rust was ever intended to eliminate panics. A panic may still be undesirable in some circumstances, but a panic is not the same thing as unrestricted UB.
> We should be able to design libraries that provably avoid panics to the greatest extent possible.
Yes, this would be nice indeed. But again, marking unwrap() `unsafe` is not an effective way to do so.
dtolnay's no_panic is the best we have right now IIRC, and there are some prover-style tools in an experimental stage which can accomplish something similar. I don't think either of those are polished enough for first-party adoption, though.
> The language says "safe" on the tin. It advertises safety.
Rust advertises memory safety (and other closely related things, like no UB, data race safety, etc.). I don't think it's made any promises about hard guarantees of other kinds of safety.
Rust has grown beyond its original design as a "memory safe" language. People are using this as an HTTP/RPC server programming language now. WASM serverless jobs, etc. Rust has found itself deployed in a lot of unexpected places.
These folks are not choosing Rust for the memory safety guarantees. They're choosing Rust for being a fast language with a nice type system that produces "safe" code.
Rust is widely known for producing relatively defect-free code on account of its strong type system and ergonomics. Safety beyond memory safety.
Unwrap(), expect(), and their kin are a direct affront to this.
There are only two uses cases for these: (1) developer laziness, (2) the engineer spent time proving the method couldn't fail, but unfortunately they're not using language design features that allow this to be represented in the AST with static guarantees.
In both of these cases, the engineer should instead choose to (1) pass the Result<T,E> or Option<T> to the caller and let the caller decide what to do, (2) do the same, but change the type to be more appropriate to the caller, (3) handle it locally so the caller doesn't have to deal with it, (4) silently turn it into a success. That's it. That's idiomatic Rust.
I'm now panicked (hah) that some dependency of mine will unwrap something and panic at runtime. That's entirely invisible to users. It's extremely dangerous.
Today a billion people saw the result of this laziness. It won't be the last time. And hopefully it never happens in safety-critical applications like aircraft. But the language has no say in this because it isn't taking a stand against this unreasonably sharp edge yet. Hopefully it will. It's a (relatively) easy fix.
>> This is the multi-million dollar .unwrap() story.
> That's too semantic IMHO. The failure mode was "enforced invariant stopped being true". If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.
Problem is, the enclosing function (`fetch_features`) returns a `Result`, so the `unwrap` on line #82 only serves as a shortcut a developer took due to assuming `features.append_with_names` would never fail. Instead, the routine likely should have worked within `Result`.
> Instead, the routine likely should have worked within `Result`.
But it's a fatal error. It doesn't matter whether it's implicit or explicit, the result is the same.
Maybe you're saying "it's better to be explicit", as a broad generalization I don't disagree with that.
But that has nothing to do with the actual bug here, which was that the invariant failed. How they choose to implement checking and failing the invariant in the semantics of the chosen language is irrelevant.
Of course it depends on the situation. But I don't see how you could think that in this case, crashing is better than stale config.
Crashing on a config update is usually only done if it could cause data corruption if the configs aren't in sync. That's obviously not the case here since the updates (although distributed in real time) are not coupled between hosts. Such systems usually are replicated state machines where config is totally ordered relative to other commands. Example: database schema and write operations (even here the way many databases are operated they don't strongly couple the two).
This is assuming that the process could have done anything sensible while it had the malformed feature file. It might be in this case that this was one configuration file of several and maybe the program could have been built to run with some defaults when it finds this specific configuration invalid, but in the general case, if a program expects a configuration file and can't do anything without it, panicking is a normal thing to do. There's no graceful handling (beyond a nice error message) a program like Nginx could do on a syntax error in its config.
The real issue is further up the chain where the malformed feature file got created and deployed without better checks.
I do not think that if the bot detection model inside your big web proxy has a configuration error it should panic and kill the entire proxy and take 20% of the internet with it. This is a system that should fail gracefully and it didn't.
> The real issue
Are there single "real issues" with systems this large? There are issues being created constantly (say, unwraps where there shouldn't be, assumptions about the consumers of the database schema) that only become apparent when they line up.
I don't know too much about how the feature file distribution works but in the event of failure to read a new file, wouldn't logging the failure and sticking with the previous version of the file be preferable?
That's exactly the point (ie just prior to distribution) where a simple sanity check should have been run and the config replacement/update pipeline stopped on failure. When they introduced the 200 entry limit memory optimised feature loader it should have been a no-brainer to insert that sanity check in the config production pipeline.
Yea, Rust is safe but it’s not magic. However Nginx doesn’t panic on malformed config. It exits with hopefully a helpful error code and message. The question is then could the cloudflare code have exited cleanly in a way that made recovery easier instead of just straight panicking.
> However Nginx doesn’t panic on malformed config. It exits with hopefully a helpful error code and message.
The thing I dislike most about Nginx is that if you are using it as a reverse proxy for like 20 containers and one of them is up, the whole web server will refuse to start up:
nginx: [emerg] host not found in upstream "my-app"
Obviously making 19 sites also unavailable just because one of them is caught in a crash loop isn't ideal. There is a workaround involving specifying variables, like so (non-Kubernetes example, regular Nginx web server running in a container, talking to other containers over an internal network, like Docker Compose or Docker Swarm):
location / {
resolver 127.0.0.11 valid=30s; # Docker DNS
set $proxy_server my-app;
proxy_pass http://$proxy_server:8080/;
proxy_redirect default;
}
Sadly, if you try to use that approach, then you just get:
nginx: [emerg] "proxy_redirect default" cannot be used with "proxy_pass" directive with variables
Sadly, switching the redirect configuration away from the default makes some apps go into a redirect loop and fail to load: mostly legacy ones, where Firefox shows something along the lines of "The page isn't redirecting properly". It sucks especially badly if you can't change the software that you just need to run and suddenly your whole Nginx setup is brittle. Apache2 and Caddy don't have such an issue.
That's to say that all software out there has some really annoying failure modes, even is Nginx is pretty cool otherwise.
Would expect with a message meet that criteria of exiting with a more helpful error message? From the postmortem it seems to me like they just didn’t know it even was panicing
One feature failing like this should probably log the error and fail closed. It shouldn't take down everything else in your big proxy that sits in front of your entire business.
To be fair, this failed in the non-rust path too because the bot management returned that all traffic was a bot. But yes, FL2 needs to catch panics from individual components but I’m not sure if failing open is necessarily that much better (it was in this case but the next incident could easily be the result of failing open).
But more generally you could catch the panic at the FL2 layer to make that decision intentional - missing logic at that layer IMHO.
Catching panic probably isn’t a great idea if there’s any unsafe code in the system. (Do the unsafe blocks really maintain heap invariants if across panics?)
Unsafe blocks have nothing to do with it. Yes - they maintain all the same invariants as safe blocks or those unsafe blocks are unsound regardless of panics. But there’s millions of way to architect this (eg a supervisor process that notices which layer in FL2 is crashing and just completely disables that layer when it starts up the proxy again. There’s challenges here because then you have to figure out what constitutes a perma crashing (eg what if it’s just 20% of all sites? Do you disable?). And in the general case you have the fail open/fail close decision anyway which you should just annotate individual layers with.
But the bigger change is to make sure that config changes roll out gradually instead of all at once. That’s the source of 99% of all widespread outages
The unwrap should be replaced by code that creates enough alerting to make a P0 incidident from their canary deployment immediately.
OR even, the bot code crashing should itself be generating alerts.
Canary deployment would be automatically rolled back until P0 incident resolved.
All of this could probably have happened and contained at their scale in less than a minute as they would likely generate enough "omg the proxy cannot handle its config" alerts off of a deployment of 0.001% near immediately.
I think the parent is implying that the panic should be "caught" via a supervisor process, Erlang-style, rather than implying the literal use of `catch_unwind` to resume within the same process.
Supervisor is the brutalist way. But catch_unwind may be needed for perf and other reasons.
But ultimately it’s not the panic that’s the problem but a failure to specify how panics within FL2 layers should be handled; each layer is at least one team and FL2’s job is providing a safe playground for everyone to safely coexist regardless of the misbehavior of any single component
But as always such failures are emblematic of multiple things going wrong at once. You probably want to end up using both catch_unwind for the typical case and the supervisor for the case where there’s a segfault in some unsafe code you call or native library you invoke.
I also mention the fundamental tension of do you want to fail open or closed. Most layers should probably fail open. Some layers (eg auth) it’s safer to fail closed.
I'm not a fan of rust, but I don't think that is the only takeaway. All systems have assumptions about their input and if the assumption is violated, it has to be caught somewhere. It seems like it was caught too deep in the system.
Maybe the validation code should've handled the larger size, but also the db query produced something invalid. That shouldn't have ever happened in the first place.
> It seems like it was caught too deep in the system.
Agreed, that's also my takeaway.
I don't see the problem being "lazy programmers shouldn't have called .unwrap()". That's reductive. This is a complex system and complex system failures aren't monocausal.
The function in question could have returned a smarter error rather than panicking, but what then? An invariant was violated, and maybe this system, at this layer, isn't equipped to take any reasonable action in response to that invariant violation and dying _is_ the correct thing to do.
But maybe it could take smarter action. Maybe it could be restarted into a known good state. Maybe this service could be supervised by another system that would have propagated its failure back to the source of the problem, alerting operators that a file was being generated in such a way that violated consumer invariants. Basically, I'm describing a more Erlang model of failure.
Regardless, a system like this should be able to tolerate (or at least correctly propagate) a panic in response to an invariant violation.
The takeaway here isn’t about Rust itself, but that the Rust marketing crew’s claims that we constantly read on HN and elsewhere about the Result type magically saving you from making mistakes is not a good message to send.
They would also tell you that .unwrap() has no place in production code, and should receive as much scrutiny as an `unsafe` block in code review :)
The point of option is the crash path is more verbose and explicit than the crash-free path. It takes more code to check for NULL in C or nil in Go; it takes more code in Rust to not check for Err.
> Today, many friends pinged me saying Cloudflare was down. As a core developer of the first generation of Cloudflare FL, I'd like to share some thoughts.
> This wasn't an attack, but a classic chain reaction triggered by “hidden assumptions + configuration chains” — permission changes exposed underlying tables, doubling the number of lines in the generated feature file. This exceeded FL2's memory preset, ultimately pushing the core proxy into panic.
> Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.
> Hats off to Cloudflare's engineers—those on the front lines putting out fires bear the brunt of such incidents.
> Technical details: Even handling the unwrap correctly, an OOM would still occur. The primary issue was the lack of contract validation in feature ingest. The configuration system requires “bad → reject, keep last-known-good” logic.
> Why did it persist so long? The global kill switch was inadequate, preventing rapid circuit-breaking. Early suspicion of an attack also caused delays.
> Why not roll back software versions or restart?
> Rollback isn't feasible because this isn't a code issue—it's a continuously propagating bad configuration. Without version control or a kill switch, restarting would only cause all nodes to load the bad config faster and accelerate crashes.
> Why not roll back the configuration?
> Configuration lacks versioning and functions more like a continuously updated feed. As long as the ClickHouse pipeline remains active, manually rolling back would result in new corrupted files being regenerated within minutes, overwriting any fixes.
This tweet thread invokes genuine despair in me. Do we really have to outsource even our tweets to LLMs? Really? I mean, I get spambots and the like tweeting mass-produced slop. But what compels a former engineer of the company in question to offer LLM-generated "insight" to the outage? Why? For what purpose?
* For clarity, I am aware that the original tweets are written in Chinese, and they still have the stench of LLM writing all over them; it's not just the translation provided in the above comment.
This particular excerpt is reeking of it with pretty much every line. I'll point out the patterns in the English translation, but all of these patterns apply cross-language.
"Classic/typical "x + y"", particularly when diagnosing an issue. This one is a really easy tell because humans, on aggregate, do not use quotation marks like this. There is absolutely no reason to quote these words here, and yet LLMs will do a combined quoted "x + y" where a human would simply write something natural like "hidden assumptions and configuration chains" without extraneous quotes.
> The configuration system requires “bad → reject, keep last-known-good” logic.
Another pattern with overeager usage of quotes is this ""x → y, z"" construct with very terse wording.
> This wasn't an attack, but a classic chain reaction
LLMs aggressively use "Not X, but Y". This is also a construct commonly used by humans, of course, but aside from often being paired with an em-dash, another tell is whether it actually contributes anything to the sentence. "Not X, but Y" is strongly contrasting and can add a dramatic flair to the thing being constrasted, but LLMs overuse it on things that really really don't need to be dramatised or contrasted.
> Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.
Two lists of three concepts back-to-back. LLMs enjoy, love, and adore this construct.
> Hats off to Cloudflare's engineers—those on the front lines putting out fires bear the brunt of such incidents.
This kind of completely vapid, feel-good word soup utilising a heroic analogy for something relatively mundane is another tell.
And more broadly speaking, there's a sort of verbosity and emptiness of actual meaning that permeates through most LLM writing. This reads absolutely nothing like what an engineer breaking down an outage looks like. Like, the aforementioned line of... "Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.". What is that actually communicating to you? It piles on technical lingo and high-level concepts in a way that is grammatically correct but contains no useful information for the reader.
Bad writing exists, of course. There's plenty of bad writing out there on the internet, and some of it will suffer from flaws like these even when written by a human, and some humans do like their em-dashes. But it's generally pretty obvious when the writing is taken on aggregate and you see recognisable pattern after pattern combined with em-dashes combined with shallowness of meaning combined with unnecessary overdramatisations.
While this is true, I wish that Rust had more of a first-class support for `no_panic`. Every solution we do have is hacky. I wish that I could guarantee that there were no panic calls anywhere in a code path.
By the way - does this discussion matter and were they wrong to use unwrap()?
The way they wrote the code means that having more than 200 features is a hard non-transient error - even if they recovered from it, it meant they'd have had the same error when the code got to the same place.
I'm sure when the process crashed, k8s restarted the pod or something - then it reran the same piece of code and crashed in the same place.
While I don't necessarily agree with crashing as business strategy, I don't think that doing anything other than either dropping the extra rules or allocating more memory - neither of which the original code was built to do (probably by design).
The code made the local hard assumption that there won't ever be more than 200 rules and its okay to crash if that count is exceeded.
If you design your code around an invariant never being violated (which is fine), you have to make it clear on a higher level that they did.
This isn't a Rust problem (though Rust does make it easy to do the wrong thing here imo)
Instead of crashing when applying the new config, it's more common to simply ignore the new config if it cannot be applied. You keep running in the last known good state. Operators then get alerts about the failures and can diagnose and resolve the underlying issue.
That's not always foolproof, e.g. a freshly (re)started process doesn't have any prior state it can fall back to, so it just hard crashes. But restarts are going to be rate limited anyways, so even then there is time to mitigate the issue before it becomes a large scale outage
Swift has implicit unwrap (!), and explicit unwrap (?).
I don't like to use implicit unwrap. Even things that are guaranteed to be there, I treat as explicit (For example, (self.view?.isEnabled ?? false), in a view controller, instead of self.view.isEnabled).
So what happens if it ends up being nil? How does your app react?
In this particular case, I would rather crash. It’s easier to spot in a crash report and you get a nice stack trace.
Silent failure is ultimately terrible for users.
Note: for the things I control I try to very explicitly model state in such a way as I never need to force unwrap at all. But for things beyond my control like this situation, I would rather end the program than continue with a state of the world I don’t understand.
Yeah @IBOutlets are generally the one thing that are allowed to be implicitly-unwrapped optionals. They go along with using storyboards & xibs files with Interface Builder. I agree that you really should just crash if you are attempting to access one and it is nil. Either you have done something completely incorrect with regards to initializing and accessing parts of your UI and want to catch that in development, or something has gone horribly, horribly, horribly with UIKit/AppKit and storyboard/xib files are not being loaded properly by the system.
A good tool for catching stuff during development, is the humble assert()[0]. We can use precondition()[1], to do the same thing, in ship code.
The main thing is, is to remain in control, as much as possible. Rather than let the PC leave the stack frame, throw the error immediately when it happens.
> Silent failure is ultimately terrible for users.
Agreed.
Unfortunately, crashes in iOS are “silent failures,” and are a loss of control.
What this practice does, is give me the option to handle the failure “noisily,” and in a controlled manner; even if just emitting a log entry, before calling a system failure. That can be quite helpful, in threading. Also, it gives me the option to have a valid value applied, if there’s a structural failure.
But the main reason that I do that with @IBOutlets, is that it forces me to acknowledge, throughout the rest of the code, that it’s an optional. I could always treat implicit optionals as if they were explicit, anyway. This just forces me to.
I have a bunch of practices that folks can laugh at, but my stuff works pretty effectively, and I sleep well.
Crashes are silent failures but as I mentioned: you can get a lot of your crashes reported via the App Store. This is why I prefer crashes in this situation: it gives me something actionable over silent failures on the client.
This is terrible. The whole reason they introduced this is because IBOutlets would get silently disconnected and then in the field a user would complain that a feature stopped working.
Crash early, crash often. Find the bugs and bad assumptions.
> without taking the time to understand why they do things, the way they do.
Oh I am aware. They do it because
A) they don’t have a mental model of correct execution. Events just happen to them with a feeling of powerlessness. So rather than trying to form one they just litter the code with cases things that might happen
> As I've gotten older, the sharp edges have been sanded off.
B) they have grown in bad organizations with bad incentives that penalize the appearance of making mistakes. So they learn to hide them.
For example there might be an initiative that rewards removing crashes in favor of silent error.
Say what you want exception haters, but at least in exceptions-as-default languages, the decision of a particular issues is fatal to the whole program can be decided centrally at a high level, and not every choice is forced to be up to individual discretion.
But I could screw it up in Go, if I made the same assumptions
fvs, err := features.AppendWithNames(..)
if err != nil {
// this will NEVER break
panic(err)
}
Ultimately I don't think language design can be the sole line of defence against system failures; it can only guide developers to think about error cases
Right, but the point isn't to make errors impossible; the point is to have them be 1) less likely to write, and 2) easier to spot on review.
People's biggest complaints about golang's errors:
1. You have to _TYPE_OUT_ what to do on EVERY.SINGLE.ERROR. SOO BOORING!
2. They clutter up the code and make it look ugly.
Rust is so much cleaner and more convenient (they say)! Just add ?, or .unwrap()!
Well, with ".unwrap()", you can type it fast enough that you're on to the next problem before it occurs to your brain to think about what to do if there is an error. Whereas, in golang, by the time you type in, "if err != nil {", you've broken the flow enough that now you're much more likely to be thinking, "Hmm, could this ever fail? What should we do if it does?" That break in flow is annoying, but necessary.
And ".unwrap()" looks so unassuming, it's easy to overlook on review; that "panic()" looks a lot more dangerous, and again, would be more likely to trigger a reviewer into thinking, "Wait, is it OK if this thing panics? Is this really so unlikely to happen?"
Renaming it `.unwrap_or_panic()` would probably help with both.
Unfortunately none of the meanings Wikipedia knows [1] seems to fit this usage. Did you perhaps mean "taboo"?
I disagree that "unwrap()" seems as scary as "panic()", but I will certainly agree to sibling commenters have a point when they say that "bar, _ := foo()" is a lot less scary than "unwrap()".
That may be me, but `.unwrap()` is much more obvious than `_`:
- it's literally written out that you're assuming it to be Ok
- there are no indications that the `_` is an error: it could very well be some other return value from the function. in your example, it could be the number of appended features, etc
That's why Go's error handling is indeed noisy: it's noise and you reduce noise by not handling errors. Rust's is terse yet verbose: if you add stuff it's because you're doing something wrong. You explicitly spelled out the error is being ignored.
> And it would be far more obvious that an error message is being ignored.
Haven't used Go so maybe I'm missing some consideration, but I don't see how ", _" is more obvious than ".unwrap()". If anything it seems less clear, since you need to check/know the function's signature to see that it's an error being ignored (wouldn't be the case for a function like https://pkg.go.dev/math#Modf).
I haven't been writing Rust for that long (about 2 years) but every time I see .unwrap() I read it as 'panic in production'. Clippy needs to have harder checks on unwrap.
Interesting to see Rust error handling flunk out in practice.
It may be that forcing handling at every call tends to makes code verbose, and devs insensitized to bad practice. And the diagnostic Rust provided seems pretty garbage.
There is bad practice here too -- config failure manifesting as request failure, lack of failing to safe, unsafe rollout, lack of observability.
Back to language design & error handling. My informed view is that robustness is best when only major reliability boundaries need to be coded.
This the "throw, don't catch" principle with the addition of catches on key reliability boundaries -- typically high-level interactions where you can meaningfully answer a failure.
For example, this system could have a total of three catch clauses "Error Loading Config" which fails to safe, "Error Handling Request" which answers 5xx, and "Socket Error" which closes the HTTP connection.
> It may be that forcing handling at every call tends to makes code verbose
Rust has a lot of helpers to make it less verbose, even that error they demonstrate could've been written in some form `...code()?` with `?` helper that would have propagated the error forwards.
However I do acknowledge that writing Error types is boring sometimes so people don't bother to change their error types and just unwrap. But even my dinghy little apps for my personal use I do simple serach `unwrap` and make sure I have as few as possible.
I don't understand how your takeaway is that this is a language flaw other than to assume that you have some underlying disdain for Rust. That's fine, but state it clearly please.
The end result would've been the exact same if they "handled" the error: a bunch of 500s. The language being used doesn't matter if an invariant in your system is broken.
This is why the Erlang/Elixir methodology of having supervision and letting things crash gracefully is so useful. You can either handle every single error gracefully or handle crashing gracefully - it's much easier and more realistic in large codebases to do the later.
This would not have helped: the code would crash before doing anything useful at all.
If anything, the "crash early" mentality may even be nefarious: instead of handling the error and keeping the old config, you would spin on trying to load a broken config on startup.
Continuing only makes sense for cases you know you can handle.
_In theory_ they could have used the old config, but maybe there are reasons that’s not possible in Cloudflare’s setup. Whether or not that’s an invariant violation or just an error that can be handled and recovered from is a matter of opinion in system design.
And crashing on an invariant violation is exactly the right thing to do rather than proceed in an undefined state.
Given the context and what the configuration file contains, I'd argue it's mission-critical for the software to keep running with the previous configuration. Otherwise you might shutdown the internet. Honestly, I'm pretty sure their pre-rewrite version had such logic, and it was forgotten or still on the TODO pile for the rewrite version.
At a previous job (cloud provider), we've had exactly this kind of issue, with exactly the same root cause. The entrypoint for the whole network had a set of rules (think a NAT gateway) that were reloaded periodically from the database. Someone rewrote that bit of plumbing from Python to Go. Someone else performed a database migration. Suddenly, the plumbing could not find the data, and pushed an empty file to prod. The rewrite lacked "if empty, do nothing and raise an alert", that the previous one had. I'll let you imagine what happened next :)
Not panicking code is tedious to write. It is not realistic to expect everything to be non panic. There is a reason that panicking exists in the first place.
Them calling unwrap on a limit check is the real issue imo. Everything that takes in external input should assume it is bad input and should be fuzz tested imo.
In the end, what is the point of having a limit check if you are just unwrapping on it
Using the question mark operator [1] and even adding in some anyhow::context goes a long way to being able to fail fast and return an Err rather then panicking.
Sure you need to handle Results all the way up the stack but it forces you to think about how those nested parts of your app will fail as you travel back up the stack.
I wonder if similar to infrastructure resilience, code resilience is also required for critical services that can never go down? Instead of relying on a single implementation for a critical service, have multiple independent implementations in different languages.
Back when I was running my own DNS servers, I did always ensure that primary and secondary were running on different platforms and different software.
Safe things should be easy, dangerous things should be hard.
This .unwrap() sounds too easy for what it does, certainly much easier than having an entire try..catch block with an explicit panic. Full disclosure: I don't actually know Rust.
Any project has to reason about what sort of errors can be tolerated gracefully and which cannot. Unwrap is reasonable in scenarios you expect to never be reached, because otherwise your code will be full of all sorts of possible permutations and paths that are harder to reason about and may cascade into extremely nuanced or subtle errors.
Rust also has a version of unwrap called "expect" where you provide a string that logs why the unwrap occurred. It's similar, but for pieces of code that are crucial it could be a good idea to require all 'unwraps' to instead be 'expects' so that people at least are forced to write down a reason why they believe the unwrap can never be reached.
That's such a bad take after reading the article. If you're going to write a system that preallocates and is based on hard assumptions about max size - the panic/unwrap approach is reasonable.
The config bug reaching prod without this being caught and pinpointed immediately is the strange part.
It's reasonable when testing protocols exercise the panic scenario. This is the problem with punting on error recovery. Nobody checks faults that propagate across domains of responsibility.
Some languages and style guides simply forbid throwing exceptions without catching / proper recovery. Google C++ bans exceptions and the main mechanism for propogating errors is `absl::Status` which the caller has to check. Not familiar with Rust but it seems unwrap is such a thing that would be banned.
> Not familiar with Rust but it seems unwrap is such a thing that would be banned.
Panics aren't exceptions, any "panic" in Rust can be thought of as an abort of the process (Rust binaries have the explicit option to implement panics as aborts). Companies like Dropbox do exactly this in their similar Rust-based systems, so it wouldn't surprise me if Cloudflare does the same.
"Banning exceptions" wouldn't have done anything here, what you're looking for is "banning partial functions" (in the Haskell sense).
Yeah I know but isn't unwrap basically a trivial way to: (1) give up catching the exception/error (the E part in `Result<T, E>`) that the callee throws; and also (2) escalate it to the point that nothing can catch it. It has such a benign name.
Unwrap is used in places where in C++ you would just have undefined behavior. It wouldn't make any more sense to blanket ban it than it would ban ever dereferencing a pointer just in case its null - even if you just checked that it wasn't null.
Rust's foo: Option<&T> is rust's rough equivalent to C++'s const T* foo. The C++ *foo is equivalent to the rust unsafe{ *foo.unwrap_unchecked() }, or in safe code *foo.unwrap() (which changes the undefined behavior to a panic).
Rust's unwrap isn't the same as std::expected::value. The former panics - i.e. either aborts the program or unwinds depending on context and is generally not meant to be handled. The latter just throws an exception that is generally expected to be handled. Panics and exceptions use similar machinery (at least they can depending on compiler options) but they are not equivalent - for example nested panics in destructors always abort the program.
In code that isn't meant to crash `unwind` should be treated as a sign saying that "I'm promising that this will never happen", but just like in C++ where you promise that pointers you deference are valid and signed integers you add don't overflow making promises like that is a necessary part of productive programming.
Linting is not good enough. The compiler should refuse to compile the code without it marked with an explicit annotation. Too much Rust code is panic happy since using casual use of `unwrap` is perma-etched into everyone's minds by the amount of demo code out there using unwrap.
> same like chess, engine is better than human grandmaster because its solvable math field
Might be worth noting that your description of chess is slightly incorrect. Chess technically isn't solved in the sense that the optimal move is known for any arbitrary position is known; it's just that chess engines are using what amounts to a fancy brute force for most of the game and the combination of hardware and search algorithm produces a better result than the human brain does. As such, chess engines are still capable of making mistakes, even if actually exploiting them is a challenge.
> because these thing called BEST MOVE and BAD MOVE there in chess
The thing is that there is no known general objective criteria for "best" and "bad" moves. The best we have so far is based on engine evaluations, but as I said before that is because chess engines are better at searching the board's state space than humans, not because chess engines have solved chess in the mathematical sense. Engines are quite capable of misevaluating positions, as demonstrated quite well by the Top Chess Engine Championship [0] where one engine thinks it made a good move while the other thinks that move is bad, and this is especially the case when resources are limited.
The closest we are to solving chess are via tablebases, which are far from covering the entire state space and are basically as much of an exemplar of pure brute force as you can get.
> "chess engines are still capable of making mistakes", I'm sorry no
If you think chess engines are infalliable, then why does the Top Chess Engine Championship exist? Surely if chess engines could not make mistakes they would always agree on a position's evaluation and what move should be made, and therefore such an exercise would be pointless?
> inaccurate yes but not mistake
From the perspective to attaining perfect play an inaccuracy is a mistake.
"The thing is that there is no known general objective criteria for "best" and "bad" moves."
are you playing chess or not?????? if you playing chess then its oblivious how to differentiate bad move and best move
Yes it is objective, these thing called best move not without reason
"If you think chess engines are infalliable, then why does the Top Chess Engine Championship exist?"
to create better chess engine like what do even talking about here????, are you saying just because there are older bad engine that mean this thing is pointless ????
if you playing chess up to a decent level 1700+ (like me), you know that these argument its wrong and I assure you to learn chess to a decent level
up until that point that you know high level chess is brute force games and therefore solvable math
> if you playing chess then its oblivious how to differentiate bad move and best move
The key words in what I said are "general" and "objective". Yes, it's possible to determine "good" or "bad" moves in specific positions. There's no known method to determine "good" or "bad" moves in arbitrary positions, as would be required for chess to be considered strongly solved.
Furthermore, if it's "obvious" how to differentiate good and bad moves then we should never see engines blundering, right?
So (for example) how do you explain this game between Stockfish and Leela where Stockfish blunders a seemingly winning position [0]? After 37... Rdd8 both Stockfish and Leela think white is clearly winning (Stockfish's evaluation is +4.00, while Leela's evaluation is +3.81), but after 38. Nxb5 Leela's evaluation plummets to +0.34 while Stockfish's evaluation remains at +4.00. In the end, it turns out Leela was correct after 40... Rxc6 Stockfish's evaluation also drops from +4.28 to 0.00 as it realizes that Leela has a forced stalemate.
Or this game also between Stockfish and Leela where Leela blunders into a forced mating sequence and doesn't even realize it for a few moves [1]?
Engines will presumably always play what they think is the "best" move, but clearly sometimes this "best" move is wrong. Evidently, this means differentiating "good" and "bad" moves is not always obvious.
> Yes it is objective, these thing called best move not without reason
If it's objective, then why is it possible for engines to disagree on whether a move is good or bad, as they do in the above example and others?
> to create better chess engine like what do even talking about here????
The ability to create better chess engines necessarily implies that chess engines can and do make mistakes, contrary to what you asserted.
> are you saying just because there are older bad engine that mean this thing is pointless ????
No. What I'm saying is that your explanation for why chess engines are better than humans is wrong. Chess engines are not better than humans because they have solved chess in the mathematical sense; chess engines are better than humans because they search the state space faster and more efficiently than humans (at least until you reach 7 pieces on the board).
> up until that point that you know high level chess is brute force games and therefore solvable math
"Solvable" and "solved" are two very different things. Chess is solvable, in theory. Chess is very far from being solved.
I'm not completely sure I agree. I mean, I do agree about the .unwrap() culture being a bug trap. But I don't think this example qualifies.
The root cause here was that a file was mildly corrupt (with duplicate entries, I guess). And there was a validation check elsewhere that said "THIS FILE IS TOO BIG".
But if that's a validation failure, well, failing is correct? What wasn't correct was that the failure reached production. What should have happened is that the validation should have been a unified thing and whatever generated the file should have flagged it before it entered production.
And that's not an issue with function return value API management. The software that should have bailed was somewhere else entirely, and even there an unwrap explosion (in a smoke test or pre-release pass or whatever) would have been fine.
It sounds to me like there was validation, but the system wasn't designed for the validation to ever fail - at which point crashing is the only remaining option. You've essentially turned it into an assertion error rather than a parsing/validation error.
Ideally every validation should have a well-defined failure path. In the case of a config file rotation, validation failure of the new config could mean keeping the old config and logging a high-priority error message. In the case of malformed user-provided data, it might mean dropping the request and maybe logging it for security analysis reasons. In the case of "pi suddenly equals 4" checks the most logical approach might be to intentionally crash, as there's obviously something seriously wrong and application state has corrupted in such a way that any attempt to continue is only going to make things worse.
But in all cases there's a reason behind the post-validation-failure behavior. At a certain point leaving it up to "whatever happens on .unwrap() failure" isn't good enough anymore.
tokio default behavior within a task is to ignore panics, such as an Err/None unwrap, and only crash that task, so it's impact limited so that's nice, maybe that's where the snowblindness came from.
it'd be kinda hard to amend the clippy lints to ignore coroutine unwraps but still pipe up on system ones. i guess.
edit: i think they'd have to be "solely-task-color-flavored" so definitely probably not trivial to infer
> This is textbook "parse, don't validate" anti-pattern.
How so? “Parse, don’t validate” implies converting input into typed values that prevent representation of invalid state. But the parsing still needs to be done correctly. An unchecked unwrap really has nothing to do with this.
This is a bummer. The unwrap()'ing function already returned a result and should have just propagated the error. Presumably the caller could have handled more sensibly than just panic'ing.
In addition, it looks like this system wasn't on any kind of 1%/10%/50%/100% rollout gating. Such a rollout would trivially have shown the poison input killing tasks.
To me it reads like there was a gradual rollout of the faulty software responsible for generating the config files, but those files are generated on approximately one machine, then propogated across the whole network every 5 minutes.
> Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.
It looks like changing the permissions triggered creation of a new feature file, and it was ingestion of that file leading to blowing a size limit that crashed the systems.
The file should be versioned and rollout of new versions should be staged.
(There is definitely a trade-off; often times in the security critical path, you want to go as fast as possible because changes may be blocking a malicious actor. But if you move too fast, you break things. Here, they had a potential poison input in the pathway for synchronizing this state and Murphy's Law suggests it was going to break eventually, so the question becomes "How much damage can we tolerate when it does?")
> It looks like changing the permissions triggered creation of a new feature file, and it was ingestion of that file leading to blowing a size limit that crashed the systems.
That feature file is generated every 5 minutes at all times; the change to permissions was rolled out gradually over the clickhouse cluster, and whether a bad version of that file was generated depended on whether the part of the cluster that had the bad permissions generated the file.
If the error had been an exception instead of a result, could have bubbled up
I have been saying for years that Rust botched error handling in unfixable ways. I will go to the grave believing Rust fumbled.
The design of the Rust language encourages people to use unwrap() to turn foreseeable runtime problems into fatal errors. It's the path of least resistance, so people will take it.
Rust encourages developers to consider only the happy path. No wonder it's popular among people who've never had to deal with failure.
All of the concomitant complexity--- Result, ?, the test thing, anyhow, the inability for stdlib to report allocation failure --- is downstream of a fashion statement against exceptions Rust cargo-culted from Go.
The funniest part is that Rust does have exceptions. It just calls them panics. So Rust code has to deal with the ergonomic footgun of Result but pays anyway for the possibility of exceptions. (Sure, you can compile with panic=abort. You can't count on it.)
I could not be more certain that Rust should have been a language with exceptions, not Result, and that error objects are a gross antipattern we'll regret for decades.
Errors work just like exceptions especially if you use the ? operator and let the error bubble up the chain. This is the Rust equivalent of an unhandled exception and the ripcord being pulled.
In C++, functions are error-colored by default. You write "noexcept" if you want your function to be infallible-colored instead.
(You usually want to make a function infallible if you're using your noexcept function as part of a cleanup path or part of a container interface that allows for more optimizations of it knows certain container operations are infallible.)
Rust makes infallibility the syntactic default and makes you write Result to indicate fallibility. People often don't want to color their functions this way. Guess what happens when a programmer is six levels deep in infallible-colored function calls and does something that can fail.
.unwrap()
Guess what, in Rust, is fallible?
Mutex acquire.
Guess what you need to do often on infallible cleanup paths?
At Facebook they name certain "escape hatch" functions in a way that inescapably make them look like a GIANT EYESORE. Stuff like DANGEROUSLY_CAST_THIS_TO_THAT, or INVOKE_SUPER_EXPENSIVE_ACTION_SEE_YOU_ON_CODE_REVIEW. This really drives home the point that such things must not be used except in rare extraordinary cases.
If unwrap() were named UNWRAP_OR_PANIC(), it would be used much less glibly. Even more, I wish there existed a super strict mode when all places that can panic are treated as compile-time errors, except those specifically wrapped in some may_panic_intentionally!() or similar.
React.__SECRET_INTERNALS_DO_NOT_USE_OR_YOU_WILL_BE_FIRED comes to mind. I did have to reach to this before, but it certainly works for keeping this out of example code and other things like reading other implementations without the danger being very apparent.
At some point it was renamed to __CLIENT_INTERNALS_DO_NOT_USE_OR_WARN_USERS_THEY_CANNOT_UPGRADE which is much less fun.
right and if the language designers named it UNWRAP_OR_PANIC() then people would rightfully be asking why on earth we can't just use a try-catch around code and have an easier life
But a panic can be caught and handled safely (e.g. via std:: panic tools). I'd say that this is the correct use case for exceptions (ask Martin Fowler, of all people).
There is already a try/catch around that code, which produces the Result type, which you can presumptuously .unwrap() without checking if it contains an error.
Instead, one should use the question mark operator, that immediately returns the error from the current function if a Result is an error. This is exactly similar to rethrowing an exception, but only requires typing one character, the "?".
How so? An exception is a value that's given the closest, conceptually appropriate, point that was decided to handle the value, allowing you to keep your "happy path" as clean code, and your "exceptional circumstances" path at the level of abstraction that makes sense.
It's way less book-keeping with exceptions, since you, intentionally, don't have to write code for that exceptional behavior, except where it makes sense to. The return by value method, necessarily, implements the same behavior, where handling is bubbled up to the conceptually appropriate place, through returns, but with much more typing involved. Care is required for either, since not properly bubbling up an exception can happen in either case (no re-raise for exceptions, no return after handling for return).
There are many many pages of text discussing this topic, but having programmed in both styles, exceptions make it too easy for programmer to simply ignore them. Errors as values force you to explicitly handle it there, or toss it up the stack. Maybe some other languages have better exception handling but in Python it’s god awful. In big projects you can basically never know when or how something can fail.
I would claim the opposite. If you don't catch an exception, you'll get a halt.
With return values, you can trivially ignore an exception.
let _ = fs::remove_file("file_doesn't_exist");
or
value, error = some_function()
// carry on without doing anything with error
In the wild, I've seen far more ignoring return errors, because of the mechanical burden of having type handling at every function call.
This is backed by decades of writing libraries. I've tried to implement libraries without exceptions, and was my admittedly cargo-cult preference long ago, but ignoring errors was so prevalent among the users of all the libraries that I now always include a "raise" type boolean that defaults to True for any exception that returns an error value, to force exceptions, and their handling, as default behavior.
> In big projects you can basically never know when or how something can fail.
How is this fundamentally different than return value? Looking at a high level function, you can't know how it will fail, you just know it did fail, from the error being bubbled up through the returns. The only difference is the mechanism for bubbling up the error.
Maybe some water is required for this flame war. ;)
I'd categorize them more as "event handlers" than "hidden". You can't know where the execution will go at a lower level, but that's the entire point: you don't care. You put the handlers at the points where you care.
Correction: unchecked exceptions are hidden control flow. Checked exceptions are quite visible, and I think that more languages should use them as a result.
...and you can? try-catch is usually less ergonomic than the various ways you can inspect a Result.
try {
data = some_sketchy_function();
} catch (e) {
handle the error;
}
vs
result = some_sketchy_function();
if let Err(e) = result {
handle the error;
}
Or better yet, compare the problematic cases where the error isn't handled:
data = some_sketchy_function();
vs
data = some_sketchy_function().UNWRAP_OR_PANIC();
In the former (the try-catch version that doesn't try or catch), the lack of handling is silent. It might be fine! You might just depend on your caller using `try`. In the latter, the compiler forces you to use UNWRAP_OR_PANIC (or, in reality, just unwrap) or `data` won't be the expected type and you will quickly get a compile failure.
What I suspect you mean, because it's a better argument, is:
which is fair, although how often is it really the right thing to let all the errors from 4 independent sources flow together and then get picked apart after the fact by inspecting `e`? It's an easier life, but it's also one where subtle problems constantly creep in without the compiler having any visibility into them at all.
Unwrap isn't a synonym for laziness, it's just like an assertion, when you do unwrap() you're saying the Result should NEVER fail, and if does, it should abort the whole process. What was wrong was the developer assumption, not the use of unwrap.
It also makes it very obvious in the code, something very dangerous is happening here. As a code reviewer you should see an unwrap() and have alarm bells going off. While in other languages, critical errors are a lot more hidden.
> What was wrong was the developer assumption, not the use of unwrap.
How many times can you truly prove that an `unwrap()` is correct and that you also need that performance edge?
Ignoring the performance aspect that often comes from a hat-trick, to prove such a thing you need to be wary of the inner workings of a call giving you a `Return`. That knowledge is only valid at the time of writing your `unwrap()`, but won't necessarily hold later.
Also, aren't you implicitly forcing whoever changes the function to check for every smartass dev that decided to `unwrap` at their callsite? That's bonkers.
I doubt that this unwrap was added for performance reasons; I suspect it was rather added because the developer temporarily didn't want to deal with what they thought was an unlikely error case while they were working on something else; and no other system recognized that the unwrap was left in and flagged it before it was deployed on production servers.
If I were Cloudflare I would immediately audit the codebase for all uses of unwrap (or similar rust panic idioms like expect), ensure that they are either removed or clearly documented as to why it's worth crashing the program there, and then add a linter to their CI system that will fire if anyone tries to check in a new commit with unwrap in it.
Panics are for unexpected error conditions, like your caller passed you garbage. Results are for expected errors, like your caller passed you something but it's your job to tell if it's garbage.
So the point of unwrap() is not to prove anything. Like an assertion it indicates a precondition of the function that the implementer cannot uphold. That's not to say unwrap() can't be used incorrectly. Just that it's a valid thing to do in your code.
> No more than returning an int by definition means the method can return -2.
What? Returning an int does in fact mean that the method can return -2. I have no idea what your argument is with this, because you seem to be disagreeing with the person while actually agreeing with them.
The difference is functions which return Result have explicitly chosen to return a Result because they can fail. Sure, it might not fail in the current implementation and/or configuration, but that could change later and you might not know until it causes problems. The type system is there to help you - why ignore it?
As a hypothetical example, when making a regex, I call `Regex::new(r"/d+")` which returns a result because my regex could be malformed and it could miscompile. It is entirely reasonable to unwrap this, though, as I will find out pretty quickly that it works or fails once I test the program.
Because it would be a huge hassle to go into that library and write an alternate version that doesn't return a Result. So you're stuck with the type system being wrong in some way. You can add error-handling code upfront but it will be dead code at that point in time, which is also not good.
Yeah, I think I expressed wrongly here. A more correct version would be: "when you do unwrap() you're saying that an error on this particular path shouldn't be recoverable and we should fail-safe."
It's a little subtler than this. You want it to be easy to not handle an error while developing, so you can focus on getting the core logic correct before error-handling; but you want it to be hard to deploy or release the software without fully handling these checks. Some kind of debug vs release mode with different lints seems like a reasonable approach.
It reads a lot like the Crowdstrike SNAFU. Machine-generated configuration file b0rks-up the software that consumes it.
The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside.
The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.
At my employer, we have a small script that automatically checks such generated config files. It does a diff between the old and the new version, and if the diff size exceeds a threshold (either total or relative to the size of the old file), it refuses to do the update, and opens a ticket for a human to look over it.
It has somewhat regularly saved us from disaster in the past.
I don't think this system is best thought of as "deployment" in the sense of CI/CD; it's a control channel for a distributed bot detection system that (apparently) happens to be actuated by published config files (it has a consul-template vibe to it, though I don't know if that's what it is).
That's why I likened it Crowdstrike. It's a signature database that blew up the consumer of said database. (You probably caught my post mid-edit, too. You may be replying to the snarky paragraph I felt better of and removed.)
Edit: Similar to Crowdstrike, the bot detector should have fallen-back to its last-known-good signature database after panicking, instead of just continuing to panic.
Code and Config should be treated similarly. If you would use a ring based rollout, canaries, etc for safely changing your code, then any config that can have the same impact must also use safe rollout techniques.
You're the nth person on this thread to say that and it doesn't make sense. Events that happen multiple times per second change data that you would call "configuration" in systems like these. This isn't `sendmail.cf`.
If you want to say that systems that light up hundreds of customers, or propagate new reactive bot rules, or notify a routing system that a service has gone down are intrinsically too complicated, that's one thing. By all means: "don't build modern systems! computers are garbage!". I have that sticker on my laptop already.
But like: handling these problems is basically the premise of large-scale cloud services. You can't just define it away.
> That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.
> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
I'm no FAANG 10x engineer, and I appreciate things can be obvious in hindsight, but I'm somewhat surprised that engineering at the level of Cloudflare does not:
1. Push out files A/B to ensure the old file is not removed.
2. Handle the failure of loading the file (for whatever reason) by automatically reloading the old file instead and logging the error.
Yep, a decent canary mechanism should have caught this. There's a trade off between canarying and rollout speed, though. If this was a system for fighting bots, I'd expect it to be optimized for the latter.
Presumably optimal rollout speed entails something like or as close to ”push it everywhere all at once and activate immediately” that you can get — that’s fine if you want to risk short downtime rather than delays in rollout, what I don’t understand is why the nodes don’t have any independent verification and rollback mechanism. I might be underestimating the complexity but it really doesn’t sound much more involved than a process launching another process, concluding that it crashed and restarting it with different parameters.
I think they need to strongly evaluate if they need this level of rollout speed. Even spending a few minutes with an automated canary gives you a ton of safety.
Even if the servers weren't crashing it is possible that a bet set of parameters results in far too many false positives which may as well be complete failure.
Everyone is hating on unwrap, but to me the odd and more interesting part is that it took 3 hours to figure this out? Even with a DDoS red herring, shouldn’t there have been a crash log or telemetry anomaly correlated? Also, shouldn’t the next steps and resolution focus more on this aspect, since it’s a high leverage tool for identifying any outage caused by a panic rather than just preventing a recurrence of random weird edge case #9999999?
I have nowhere near the experience managing such complex systems, but I can empathize with this. In a high-pressure situations the most obvious things get missed. If someone is convinced System X is at fault, your mind can make leaps to justify every other degraded system is a downstream effect of that. Cause and effect can get switched.
Sometimes you have smart people in the room who dig deeper and fish it out, but you cannot always rely on that.
I have plenty of empathy, having been in plenty of similar situations. It's not a matter of "I can't BELIEVE it took that long" (although it is a bit surprising) so much as that I disagree with the key takeaways here in the HN comments section and in the blog itself, which focus strongly on fixing rare edge case issues (the bad ClickHouse query and a bad config file causing a panic via unwrap), rather than reducing MTTR for all issues by improving the debug and monitoring experience.
I'm also suspicious that
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
from the blog had a lot more to do with the issue than perhaps the narrative is letting on.
Yes this is the weird part for me. With good monitoring, the panic at unwrap should have been detected immediately. I assume they weren't looking at the right place, but still. If you use Sentry for example, a brand new panic should be pretty visible.
Indeed, nothing about the root issues are particular surprising but why they missed a critical service panicing across their fleet is not bubbling up.
My best guess is too many alerts firing without a clear hierarchy and possibilities to seprate cause from effect. It's a typical challenge but I wish they would shed some light on that. And its a bit concerning that improving observability is not part of their follow up steps.
This took one of the three hours; it seems to have taken from 11:28 to 13:37 to recognize that the configuration file panic was the cause of the issue.
> work has already begun on how we will harden them against failures like this in the future. In particular we are:
> Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
> Enabling more global kill switches for features
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
> Reviewing failure modes for error conditions across all core proxy modules
Absent from this list are canary deployments and incremental or wave-based deployment of configuration files (which are often as dangerous as code changes) across fault isolation boundaries -- assuming CloudFlare has such boundaries at all. How are they going to contain the blast radius in the future?
This is something the industry was supposed to learn from the CrowdStrike incident last year, but it's clear that we still have a long way to go.
Also, enabling global anything (i.e., "enabling global kill switches for features") sounds like an incredibly risky idea. One can imagine a bug in a global switch that transforms disabling a feature into disabling an entire system.
They require the bot management config to update and propagate quickly in order to respond to attacks - but this seems like a case where updating a since instance first would have seen the panic and stopped the deploy.
I wonder why clickhouse is used to store the feature flags here, as it has it's own duplication footguns[0] which could have also easily lead to a query blowing up 2/3x in size. oltp/sqlite seems more suited, but i'm sure they have their reasons
I don't think sqlite would come close to their requirements for permissions or resilience, to name a couple. It's not the solution for every database issue.
Also, the link you provided is for eventual deduplication at the storage layer, not deduplication at query time.
I think you're oversimplifying the problem they had, and I would encourage you to dive in to the details in the article. There wasn't a problem with the database, it was with the query used to generate the configs. So if an analogous issue arose with a query against one of many ad-hoc replicated sqlite databases, you'd still have the failure.
I love sqlite for some things, but it's not The One True Database Solution.
Global configuration is useful for low response times to attacks, but you need to have very good ways to know when a global config push is bad and to be able to rollback quickly.
In this case, the older proxy's "fail-closed" categorization of bot activity was obviously better than the "fail-crash", but every global change needs to be carefully validated to have good characteristics here.
Having a mapping of which services are downstream of which other service configs and versions would make detecting global incidents much easier too, by making the causative threads of changes more apparent to the investigators.
It seems they had this continous rollout for the config service, but the services consuming this were affected even by small percentage of these config providers being faulty, since they were auto updating every few minutes their configs. And it seems there is a reason for these updating so fast, presumably having to react to threat actors quickly.
It's in everyone's interest to mitigate threats as quickly as possible. But it's of even greater interest that a core global network infrastructure service provider not DOS a significant proportion of the Internet by propagating a bad configuration too quickly. The key here is to balance responsiveness against safety, and I'm not sure they struck the right balance here. I'm just glad that the impact wasn't as long and as severe as it could have been.
In my 30 years of reliability engineering, I've come to learn that this is a distinction without a difference.
People think of configuration updates (or state updates, call them what you will) as inherently safer than code updates, but history (and today!) demonstrates that they are not. Yet even experienced engineers will allow changes like these into production unattended -- even ones who wouldn't dare let a single line of code go live without being subject to the full CI/CD process.
They narrowed down the actual problem to some Rust code in the Bot Management system that enforced a hard limit on the number of configuration items by returning an error, but the caller was just blindly unwrapping it.
A dormant bug in the code is usually a condition precedent to incidents like these. Later, when a bad input is given, the bug then surfaces. The bug could have laid dormant for years or decades, if it ever surfaced at all.
The point here remains: consider every change to involve risk, and architect defensively.
If they're going to yeet configs into production, they ought to at least have plenty of mitigation mechanisms, including canary deployments and fault isolation boundaries. This was my primary point at the root of this thread.
And I hope fly.io has these mechanisms as well :-)
It's great that you're working on regionalization. Yes, it is hard, but 100x harder if you don't start with cellular design in mind. And as I said in the root of the thread, this is a sign that CloudFlare needs to invest in it just like you have been.
I recoil from that last statement not because I have a rooting interest in Cloudflare but because the last several years of working at Fly.io have drilled Richard Cook's "How Complex Systems Fail"† deep into my brain, and what you said runs aground of Cook #18: Failure free operations require experience with failure.
If the exact same thing happens again at Cloudflare, they'll be fair game. But right now I feel people on this thread are doing exactly, precisely, surgically and specifically the thing Richard Cook and the Cook-ites try to get people not to do, which is to see complex system failures as predictable faults with root causes, rather than as part of the process of creating resilient systems.
Thank you for saying it. I’m getting exasperated at how many people in the comments are making some variant of the “lazy programmer wrote code that took a shortcut” argument.
Complex system failures are not monocausal! Complex systems are in a continuous state of partial failure!
Suppose they did have the cellular architecture today, but every other fact was identical. They'd still have suffered the failure! But it would have been contained, and the damage would have been far less.
Fires happen every day. Smoke alarms go off, firefighters get called in, incident response is exercised, and lessons from the situation are learned (with resulting updates to the fire and building codes).
Yet even though this happens, entire cities almost never burn down anymore. And we want to keep it that way.
As Cook points out, "Safety is a characteristic of systems and not of their components."
What variant of cellular architecture are you referring to? Can you give me a link or few? I'm fascinated by it and I've led a team to break up a monolithic solution running on AWS to a cellular architecture. The results were good, but not magic. The process of learning from failures did not stop, but it did change (for the better).
No matter what architecture, processes, software, frameworks, and systems you use, or how exhaustively you plan and test for every failure mode, you cannot 100% predict every scenario and claim "cellular architecture fixes this". This includes making 100% of all failures "contained". Not realistic.
If your AWS service is properly regionalized, that’s the minimum amount of cellular architecture required. Did your service ever fail in multiple regions simultaneously?
Cellular architecture within a region is the next level and is more difficult, but is achievable if you adhere to the same principles that prohibit inter-regional coupling:
It wasn't worth thinking about. I'm not going to defend myself against arguments and absolute claims I didn't make. The key word here is mitigation, not perfection.
> If your AWS service is properly regionalized, that’s the minimum amount of cellular architecture required
Amazon has had multi-region outages due to pushing bad configs, so it’s extremely difficult to believe whatever you are proposing solves that exact problem by relying on multi-regions.
Come to think of it, Cloudflare’s outage today is another good counterexample.
It has been a very, very long time since AWS had a simultaneous failure across multiple regions. Even customers impacted by the loss of Route 53 control plane functionality in last month’s us-east-1 were able to gracefully fail over to a backup region if they configured failover records in advance, had Application Recovery Controller set up, or fronted their APIs or websites with Global Accelerator.
Customers survive incidents on a daily basis by failing over across regions (even in the absence of an AWS regional failure, they can fail due to a bad deployment or other cause). The reason you don’t hear about it is because it works.
Reframe this problem: instead of bot rules being propagated, it's the enrollment of a new customer or a service at an existing customer --- something that must happen at Cloudflare several times a second. Does it still make sense to you to think about that in terms of "pushing new configuration to prod"?
Those aren't the facts before us. Also, CRUD operations relating to a specific customer or user tend not to cause the sort of widespread incidents we saw today.
it's always a config push. people rollout code slowly but don't have the same mechanisms for configs. But configs are code, and this is a blind spot that causes an outsized percentage of these big outages.
"Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page."
Unfortunately they do not share, what caused the status page to went down as well. (Does this happen often? Otherwise a big coincidence it seems)
The status page is hosted on AWS Cloudfront, right? It sure looks like Cloudfront was overwhelmed by the traffic spike, which is a bit concerning. Hope we'll see a post from their side.
CloudFront has quotas[0] and they likely just hit those quota limits. To request higher quotas requires a service ticket. If they have access logs enabled in CloudFront they could see what the exact error was.
And since it seems this is hosted by Atlassian, this would be up to Atlassian.
It looks a lot like a CloudFront error we randomly saw today from one of our engineers in South America. I suspect there was a small outage in AWS but can't prove it.
Probably non zero number of companies use cloudfront and other cdns as fallback for cloudflare or running a blended cdn so not surprising to see other cdns hit with a thundering herd when cloudflare went down
This situation reminds me of risk assessment, where you sometimes assume two rare events are independent, but later learn they are actually highly correlated.
it seems like a good chance that despite thinking their status page was completely independent of cloudfront, enough of the internet is dependent on cloudfront now that they're simply wrong about the status page's independence.
Based on this writeup it seems that Cloudflare defaults to a 0 score for bot prevention if there's a failure. Could instead it default to a passing score? Default open instead of default closed? This would have been a non-event to a lot of websites if that change was made.
When I first read about it I assumed it would have been a "poison pill" - a bad config where the ingestion of the config leads the process to crash/restart. And due to that crash on startup, there is no automated possibility to revert to a good config. These things are the worst issues that all global control planes have to deal with.
The report actually seems to confirm this - it was indeed a crash on ingesting the bad config.
However I'm actually surprised that the long duration didn't come from "it takes a long time to restart the fleet manually" or "tooling to restart the fleet was bad".
The problem mostly seems to have been "we didn't knew whats going on". Some look into the proxy logs would hopefully have shown the stacktrace/unwrap, and metrics about the incoming requests would hopefully have shown that there's no abnormal amount of requests coming in.
Why does cloudflare allow unwraps in their code? I would've assumed they'd have clippy lints stopping that sort of thing. Why not just match with { ok(value) => {}, Err(error) => {} } the function already has a Result type.
At the bare minimum they could've used an expect("this should never happen, if it does database schema is incorrect").
The whole point of errors as values is preventing this kind of thing.... It wouldn't have stopped the outage but it would've made it easy to diagnose.
If anyone at cloudflare is here please let me in that codebase :)
Not a cloudflare employee but I do write a lot of Rust. The amount of things that can go wrong with any code that needs to make a network call is staggeringly high. unwrap() is normal during development phase but there are a number of times I leave an expect() for production because sometimes there's no way to move forward.
Yeah it seems likely that even if there wasn't an unwrap, there would have been some error handling that wouldn't have panicked the process, but would have still left it inoperable if every request was instead going through an error path.
I'm in a similar boat, at the very leas an expect can give hits to what happened. However this can also be problematic if your a library developer. Sometimes rust is expected to never panic especially in situations like WASM. This is a major problem for companies like Amazon Prime Video since they run in a WASM context for their TV APP. Any panic crashes everything. Personally I usually just either create a custom error type (preferred) or erase it away with Dyn Box Error (no other option). Random unwraps and expects haunt my dreams.
unwrap() is only the most superficial part of the problem. Merely replacing `unwrap()` with `return Err(code)` wouldn't have changed the behavior. Instead of "error 500 due to panic" the proxy would fail with "error 500 due to $code".
Unwrap gives you a stack trace, while retuned Err doesn't, so simply using a Result for that line of code could have been even harder to diagnose.
`unwrap_or_default()` or other ways of silently eating the error would be less catastrophic immediately, but could still end up breaking the system down the line, and likely make it harder to trace the problem to the root cause.
The problem is deeper than an unwrap(), related to handling rollouts of invalid configurations, but that's not a 1-line change.
We don't know what the surrounding code looks like, but I'd expect it handles the error case that's expressed in the type signature (unless they `.unwrap()` there too).
The problem is that they didn't surface a failure case, which means they couldn't handle rollouts of invalid configurations correctly.
The use of `.unwrap()` isn't superficial at all -- it hid an invariant that should have been handled above this code. The failure to correctly account for and handle those true invariants is exactly what caused this failure mode.
Lots of people here are (perhaps rightfully) pointing to the unwrap() call being an issue. That might be true, but to me the fact that a reasonably "clean" panic at a defined line of code was not quickly picked up in any error monitoring system sounds just as important to investigate.
Assuming something similar to Sentry would be in use, it should clearly pick up the many process crashes that start occurring right as the downtime starts. And the well defined clean crashes should in theory also stand out against all the random errors that start occuring all over the system as it begins to go down, precisely because it's always failing at the exact same point.
As an IT person, I wonder what it's like to work for a company like this. Where presumably IT stuff has a priority. Unlike the companies I've worked for where IT takes a backseat to everything until something goes wrong. Company I work had a huge new office built, with the plan it would be big enough for future growth, yet despite repeated attempts to reserve a larger space, our server room and infrastructure is actually smaller than our old building and has no room to grow.
As a former CF employee, I'd say it's a mixed bag.
There are plenty of resources , yet it's somehow never enough. You do tons of pretty amazing things with pretty amazing tools that also have notable shortcomings.
You're surround by smart people who do lots of great work, but you also end up in incident reviews where you find facepalm-y stuff. Sometimes you even find out it was a known corner case that was deemed too unlikely to prioritize.
The last incident for my team that I remember dealing with there ended up with my coworker and I realizing the staging environment we'd taken down hours earlier was actually the source of data for a production dashboard, so we'd lost some visibility and monitoring for a bit.
I've also worked at Facebook (pre-Meta days) and at Datadog, and I'd say it was about the same. Most things are done quite well, but so much stuff is happening that you still end up with occasional incidents that feel like they shouldn't have happened.
There's (obviously) a lot of discussion around the use of `unwrap` in production code. I feel like I'm watching comments speak past each other right now.
I'd agree that the use of `unwrap` could possibly make sense in a place where you do want the system to fail hard. There's lot of good reasons to make the system fail hard. I'd lean towards an `expect` here, but whatever.
That said, the function already returns a `Result` and we don't know what the calling code looks like. Maybe it does do an `unwrap` there too, or maybe there is a save way for this to log and continue that we're not aware of because we don't have enough info.
Should a system as critical as the CF proxy fail hard? I don't know. I'd say yes if it was the kind of situation that could revert itself (like an incremental rollout), but this is such an interesting situation since it's a config being rolled out. Hindsight is 20:20 obviously, but it feels like there should've been better logging, deployment, rollback, and parsing/validation capabilities, no matter what the `unwrap`/`Result` option is.
Also, it seems like the initial Clickhouse changes could've been testing much better, but I'm sure the CF team realizes that.
On the bright side, this is a very solid write up so quickly after the outage. Much better than those times we get it two weeks later.
> thread fl2_worker_thread panicked: called Result::unwrap() on an Err value
I don't use Rust, but a lot of Rust people say if it compiles it runs.
Well Rust won't save you from the usual programming mistake. Not blaming anyone at cloudflare here. I love Cloudflare and the awesome tools they put out.
end of day - let's pick languages | tech because of what we love to do. if you love Rust - pick it all day. I actually wanna try it for industrial robot stuff or small controllers etc.
there's no bad language - just occassional hiccups from us users who use those tools.
You misunderstand what Rust’s guarantees are. Rust has never promised to solve or protect programmers from logical or poor programming. In fact, no such language can do that, not even Haskell.
Unwrapping is a very powerful and important assertion to make in Rust whereby the programmer explicitly states that the value within will not be an error, otherwise panic. This is a contract between the author and the runtime. As you mentioned, this is a human failure, not a language failure.
Pause for a moment and think about what a C++ implementation of a globally distributed network ingress proxy service would look like - and how many memory vulnerabilities there would be… I shudder at the thought… (n.b. nginx)
This is the classic example of when something fails, the failure cause over indexes on - while under indexing on the quadrillions of memory accesses that went off without a single hitch thanks to the borrow checker.
I postulate that whatever the cost in millions or hundreds of millions of dollars by this Cloudflare outage, it has paid for more than by the savings of safe memory access.
Well, no, most Rust programmers misunderstand what the guarantees are because they keep parroting this quote. Obviously the language does not protect you from logic errors, so saying "if it compiles, it works" is disingenuous, when really what they mean is "if it compiles, it's probably free of memory errors".
No, the "if it compiles, it works" is genuinely about the program being correct rather than just free of memory errors, but it's more of a hyperbolic statement than a statement of fact.
It's a common thing I've experienced and seen a lot of others say that the stricter the language is in what it accepts the more likely it is to be correct by the time you get it to run. It's not just a Rust thing (although I think Rust is _stricter_ and therefore this does hold true more of the time), it's something I've also experienced with C++ and Haskell.
So no, it's not a guarantee, but that quote was never about Rust's guarantees.
I have definitely noticed this when I've tried doing Advent of Code in Rust - by the time my code compiles it typically send out the right answer. It doesn't help me once I don't know whatever algorithm I need to reach for in order to solve it before the heat death of the universe, but it is a somewhat magical feeling when it lasts.
> Pause for a moment and think about what a C++ implementation of a globally distributed network ingress proxy service would look like - and how many memory vulnerabilities there would be… I shudder at the thought
I mean thats an unfalsifiable statement, not really fair. C is used to successfully launch spaceships.
Whereas we have a real Rust bug that crashed a good portion of the internet for a significant amount of time. If this was a C++ service everyone would be blaming the language, but somehow Rust evangelicals are quick to blame it on "unidiomatic Rust code".
A language that lets this easily happen is a poorly designed language. Saying you need to ban a commonly used method in all production code is broken.
Only formal proof languages are immune to such properties. Therefore all languages are poorly designed by your metric.
Consider that the set of possible failures enabled by language design should be as small as possible.
Rust's set is small enough while also being productive. Until another breakthrough in language design as impactful as the borrow checker is invented, I don't imagine more programmers will be able to write such a large amount of safe code.
> Rust won't save you from the usual programming mistake.
Disagree. Rust is at least giving you an "are you sure?" moment here. Calling unwrap() should be a red flag, something that a code reviewer asks you to explain; you can have a linter forbid it entirely if you like.
No language will prevent you from writing broken code if you're determined to do so, and no language is impossible to write correct code in if you make a superhuman effort. But most of life happens in the middle, and tools like Rust make a huge difference to how often a small mistake snowballs into a big one.
> Disagree. Rust is at least giving you an "are you sure?" moment here. Calling unwrap() should be a red flag, something that a code reviewer asks you to explain; you can have a linter forbid it entirely if you like.
No one treats it like that and nearly every Rust project is filled with unwraps all over the place even in production system like Cloudflare's.
It's literally not, Rust tutorials are littered with `.unwrap()` calls. It might be Rust 102, but the first impression given is that the language is surprisingly happy with it.
If you haven't read the Rust Book at least, which is effectively Rust 101, you should not be writing Rust professionally. It has a chapter explaining all of this.
Yep, unwrap() and unsafe are escape hatches that need very good justifications. It's fine for casual scripts where you don't care if it crashes. For serious production software they should be either banned, or require immense scrutiny.
> you can have a linter forbid it entirely if you like.
It would be better if that would be the other way round "linter forbids it unless you ask it not to". Never wrong to allow users to shoot themself in the foot, but it should be explicit.
> Well Rust won't save you from the usual programming mistake
This is not a Rust problem. Someone consciously chose to NOT handle an error, possibly thinking "this will never happen". Then someone else conconciouly reviewed (I hope so) a PR with an unwrap() and let it slide.
And people doing testing failed to ignore their excuse of this never happening and still testing it. With this kind of systems you need the separate group that just ignores any "this will never happen" and still checks what happens if it does.
Now it might be that it was tested, but then ignored or deprioritised by management...
What people are saying is that idiomatic prod rust doesn't use unwrap/expect (both of which panic on the "exceptional" arm of the value) --- instead you "match" on the value and kick the can up a layer on the call chain.
What happens to it up the callstack? Say they propagated it up the stack with `?`. It has to get handled somewhere. If you don't introduce any logic to handle the duplicate databases, what else are you going to do when the types don't match up besides `unwrap`ing, or maybe emitting a slightly better error message? You could maybe ignore that module's error for that request, but if it was a service more critical than bot mitigation you'd still have the same symptom of getting 500'd.
as they say in the post, these files get generated every 5 minutes and rolled out across their fleet.
so in this case, the thing farther up the callstack is a "watch for updated files and ingest them" component.
that component, when it receives the error, can simply continue using the existing file it loaded 5 minutes earlier.
and then it can increment a Prometheus metric (or similar) representing "count of errors from attempting to load the definition file". that metric should be zero in normal conditions, so it's easy to write an alert rule to notify the appropriate team that the definitions are broken in some way.
that's not a complete solution - in particular it doesn't necessarily solve the problem of needing to scale up the fleet, because freshly-started instances won't have a "previous good" definition file loaded. but it does allow for the existing instances to fail gracefully into a degraded state.
in my experience, on a large enough system, "this could never happen, so if it does it's fine to just crash" is almost always better served by a metric for "count of how many times a thing that could never happen has happened" and a corresponding "that should happen zero times" alert rule.
Given that the bug was elsewhere in the system (the config file parser spuriously failed), it’s hard to justify much of what you suggested.
Panics should be logged, and probably grouped by stack trace for things like prometheus (outside of process). That handles all sorts of panic scenarios, including kernel bugs and hardware errors, which are common at cloudflare scale.
Similarly, mitigating by having rapid restart with backoff outside the process covers far more failure scenarios with far less complexity.
One important scenario your approach misses is “the watch config file endpoint fell over”, which probably would have happened in this outage if 100% of servers went back to watching all of a sudden.
Sure, you could add an error handler for that too, and for prometheus is being slow, and an infinite other things. Or, you could just move process management and reporting out of process.
The way I’ve seen this on a few older systems was that they always keep the previous configuration around so it can switch back. The logic is something like this:
1. At startup, load the last known good config.
2. When signaled, load the new config.
3. When that passes validation, update the last-known-good pointer to the new version.
That way something like this makes the crash recoverable on the theory that stale config is better than the service staying down. One variant also recorded the last tried config version so it wouldn’t even attempt to parse the latest one until it was changed again.
For Cloudflare, it’d be tempting to have step #3 be after 5 minutes or so to catch stuff which crashes soon but not instantly.
The config file subsystem was where the bug lived, not the code with the unwrap, so this sort of change is a special case of “make the unwrap never fail and then fix the API so it is not needed”.
"if it compiles it runs" - this is indeed an inaccurate marketing slogan. A more precise formulation would be "if it compiles then the static type system, pattern matching, explicit errors, Send bounds, etc. will have caught a lot of bugs that in other languages would have manifested as runtime errors".
Anecdotally I can write code for several hours, deploy it to a test sandbox without review or running tests and it will run well enough to use it, without silly errors like null pointer exceptions, type mismatches, OOBs etc. That doesn't mean it's bug-free. But it doesn't immediately crash and burn either.
Recently I even introduced a bug that I didn't immediately notice because careful error handling in another place recovered from it.
> I don't use Rust, but a lot of Rust people say if it compiles it runs.
Do you grok what the issue was with the unwrap, though...?
Idiomatic Rust code does not use that. The fact that it's allowed in a codebase says more about the engineering practices of that particular project/module/whatever. Whoever put the `unwrap` call there had to contend with the notion that it could panic and they still chose to do it.
It's a programmer error, but Rust at least forces you to recognize "okay, I'm going to be an idiot here". There is real value in that.
While I agree that Rust got it right by being more explicit, a lot of bugs in C/C++ can also easily avoided with good engineering practices. The Rust argument that it is mainly the fault of the programming language with C/C++ was always a huge and unfair exaggeration. Now with this entirely predictable ".unwrap" desaster (in general, not necessarily this exact scenarious), the "no true Rustacean would have put unwrap in production" fallacy is sad and funny at the same time.
Unwrap is controversial. The problem is that if you remove it, it makes the bar even higher for newcomers to Rust. One solution is to make it unsafe (along with panic).
Long time ago Google had a very similar incident where ddos protection system ingested a bad config and took everything down. Except it was auto resolved in like four minutes by an automatic rollback system before oncall was even able to do anything. Perhaps Cloudflare should invest in a system like that
absolutely not normal, this is why in my opinion it took them so long to understand the core issue. instead of a nice error message and a backtrace saying something like "failed to parse config feature names" they thought they were under attack because the service was just crashing instead.
The most surprising thing to me here is that it took 3 hours to root cause, and points to a glaring hole in the platform observability. Even taking into account the fact that the service was failing intermittently at first, it still took 1.5 hours after it started failing consistently to root cause. But the service was crashing on startup. If a core service is throwing a panic at startup like that, it should be raising alerts or at least easily findable via log aggregation. It seems like maybe there was some significant time lost in assuming it was an attack, but it also seems strange to me that nobody was asking "what just changed?", which is usually the first question I ask during an incident.
That’s not accurate. As with any incident response there were a number of theories of the cause we were working in parallel. The feature file failure was one identified as potential in the first 30 minutes. However, the theory that seemed the most plausible based on what we were seeing (intermittent, initially concentrated in the UK, spike in errors for certain API endpoints) as well as what else we’d been dealing with (a bot net that had escalated DDoS attacks from 3Tbps to 30Tbps against us and others like Microsoft over the last 3 months). We worked multiple theories in parallel. After an hour we ruled out the DDoS theory. We had other theories also running in parallel, but at that point the dominant theory was that the feature file was somehow corrupt. One thing that made us initially question the theory was nothing in our changelogs seemed like it would have caused the feature file to grow in size. It was only after the incident that we realized the database permissions change had caused it, but that was far from obvious. Even after we identified the problem with the feature file, we did not have an automated process to role the feature file back to a known-safe previous version. So we had to shut down the reissuance and manually insert a file into the queue. Figuring out how to do that took time and waking people up as there are lots of security safeguards in place to prevent an individual from easily doing that. We also needed to double check we wouldn’t make things worse. The propagation then takes some time especially because there are tiers of caching of the file that we had to clear. Finally we chose to restart the FL2 processes on all the machines that make up our fleet to ensure they all loaded the corrected file as quickly as possible. That’s a lot of processes on a lot of machines. So I think best description was it took us an hour for the team to coalesce on the feature file being the cause and then another two to get the fix rolled out.
Thank you for the clarification and insight, with that context it does make more sense to me. Is there anything you think can be done to improve the ability to identify issues like this more quickly in the future?
Any "limits" on system should be alerted... like at 70% or 80% threshold.. it might be worth it for a SRE to revisit the system limits and ensuring threshold based alerting around it..
If one actually looks at the current pingora API, it has limited ability to initialize async components at startup - the current pattern seems to be to lazily initialize on first call. An obvious downside of this is that a service can startup in a broken state. e.g. https://github.com/cloudflare/pingora/issues/169
I can imagine that this could easily lead to less visibility into issues.
We shouldn't be having critical internet-wide outages on a monthly basis. Something is systematically wrong with the way we're architecting our systems.
Cloudflare, Azure, and other single points of failure are solving issues inherent to webhosting, and those problems have become incredibly hard due to the massive scale of bad actors and the massive complexity of managing hardware and software.
What would you propose to fix it? The fixed cost of being DDoS-proof is in the hundreds of millions of dollars.
> Costs to architect systems that serve millions of request daily have gone down. Not up.
I never said serving millions of requests is more expensive. Protecting your servers is more expensive.
> Hell, I would be very curious to know the costs to keep HackerNews running. They probably serve more users than my current client.
HN uses Cloudflare. You're making my point for me. If you included the fixed costs that Cloudflare's CDN/proxy is giving to HN incredibly cheaply, then running HN at the edge with good performance (and protecting it from botnets) would costs hundreds of millions of dollars.
> People want to chase the next big thing to write it on their CV, not architect simple systems that scale. (Do they even need to scale?)
Again, attacking your own straw men here.
Writing high-throughput web applications is easier than ever. Hosting them on the open web is harder than ever.
From the ping output, I can see HN is using m5hosting.com. This is why HN was up yesterday, even though everything on CF was down.
> Writing high-throughput web applications is easier than ever. Hosting them on the open web is harder than ever.
Writing proper high-throughput applications was never easy and will never be. It is a little bit easier because we have highly optimized tools like nginx or nodejs so we can offset critical parts. And hosting is "harder than ever" if you complicate the matter, which is a quite common pattern these days. I saw people running monstrosities to serve some html & js in the name of redundancy. You'd be surprised how much a single bare-metal (hell, even a proper VM from DigitalOcean or Vultr) can handle.
Having the feature table pivoted (with 200 feature1, feature2, etc columns) meant they had to do meta queries to system.columns to get all the feature columns which made the query sensitive to permissioning changes (especially duplicate databases).
A Crowdstrike style config update that affects all nodes but obviously isn't tested in any QA or staged rollout strategy beforehand (the application panicking straight away with this new file basically proves this).
Finally an error with bot management config files should probably disable bot management vs crash the core proxy.
I'm interested here why they even decided to name Clickhouse as this error could have been caused by any other database. I can see though the replicas updating causing flip / flopping of results would have been really frustrating for incident responders.
Right but also this is a pretty common pattern in distributed systems that publish from databases (really any large central source of truth); it might be like the problem in systems like this. When you're lucky the corner cases are obvious; in the big one we experienced last year, a new row in our database tripped an if-let/mutex deadlock, which our system dutifully (and very quickly) propagated across our entire network.
The solution to that problem wasn't better testing of database permutations or a better staging environment (though in time we did do those things). It was (1) a watchdog system in our proxies to catch arbitrary deadlocks (which caught other stuff later), (2) segmenting our global broadcast domain for changes into regional broadcast domains so prod rollouts are implicitly staged, and (3) a process for operators to quickly restore that system to a known good state in the early stages of an outage.
(Cloudflare's responses will be different than ours, really I'm just sticking up for the idea that the changes you need don't follow obviously from the immediate facts of an outage.)
I integrated Turnstile with a fail-open strategy that proved itself today. Basically, if the Turnstile JS fails to load in the browser (or in a few specific frontend error conditions), we allow the user to submit the web form with a dummy challenge token. On the backend, we process the dummy token like normal, and if there is an error or timeout checking Turnstile's siteverify endpoint, we fail open.
Of course, some users were still blocked, because the Turnstile JS failed to load in their browser but the subsequent siteverify check succeeded on the backend. But overall the fail-open implementation lessened impact to our customers nonetheless.
Fail-open with Turnstile works for us because we have other bot mitigations that are sufficient to fall back on in the event of a Cloudflare outage.
Only if they are able to block the siteverify check performed by our backend server. That's not the kind of attack we are trying to mitigate with Turnstile.
Thanks for the detailed writeup and explaining the root cause in details
However, I have a question from a release deployment process perspective. Why was this issue not detected during internal testing ? I didn't find the RCA analysis covering this aspect. Doesn't cloudflare have an internal test stage as part of its CICD pipeline. Looking the description of the issue, it should have been immediately detected in internal stage test environment.
People really like to hate on Rust for some reason. This wasn’t a Rust problem, no language would have saved them from this kind of issue. In fact, the compiler would have warned that this was a possible issue.
I get it, don’t pick languages just because they are trendy, but if any company’s use case is a perfect fit for Rust it’s cloudflare.
Yeah even if you handled this situation without unwrap() if you just went down an error path that didn't panic, the service would likely still be inoperable if every single request went down the error path.
The reason why people are criticizing is because Rust evangelicals say stuff like "if it compiles it works" or talk about how Rust's type system is so much better than other languages that it catches logic errors like this. You won't see Go or Java developers making such strong claims about their preferred languages.
Come on now. You can't blame the compiler when the programmer explicitly told the compiler to not worry about it. There is nothing in existence that can protect against something like that.
"Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero."
This simply means, the exception handling quality of your new FL2 is non-existent and is not at par / code logic wise similar to FL.
I hope it was not because of AI driven efficiency gains.
In most domains, silently returning 0 in a case where your logic didn't actually calculate the thing you were trying to calculate is far worse than giving a clear error.
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
but this is not mentioned at all in the timeline above. My best guess would be that the process got stuck in a tight restart loop and filled available disk space with logs, but I'm happy to hear other guesses for people more familiar with Rust.
> As well as returning HTTP 5xx errors, we observed significant increases in latency of responses from our CDN during the impact period. This was due to large amounts of CPU being consumed by our debugging and observability systems, which automatically enhance uncaught errors with additional debugging information.
While I heavily frown upon using `unwrap` and `expect` in Rust code and make sure to have Clippy tell me about every single usage of them, I also understand that without them Rust might have been seen as an academic curiosity language.
They are escape hatches. Without those your language would never take off.
But here's the thing. Escape hatches are like emergency exits. They are not to be used by your team to go to lunch in a nearby restaurant.
---
Cloudflare should likely invest in better linting and CI/CD alerts. Not to mention isolated testing i.e. deploy this change only to a small subset and monitor, and only then do a wider deployment.
Hindsight is 20/20 and we can all be smartasses after the fact of course. But I am really surprised because lately I am only using Rust for hobby projects and even I know I should not use `unwrap` and `expect` beyond the first iteration phases.
---
I have advocated for this before but IMO Rust at this point will benefit greatly from disallowing those unsafe APIs by default in release mode. Though I understand why they don't want to do it -- likely millions of CI/CD pipelines will break overnight. But in the interim, maybe a rustc flag we can put in our `Cargo.toml` that enables such a stricter mode? Or have that flag just remove all the panicky API _at compile time_ though I believe this might be a Gargantuan effort and is likely never happening (sadly).
In any case, I would expect many other failures from Cloudflare but not _this_ one in particular.
I agree the failure is in testing but what you can and should do is raise in alert in your APM system before the runtime panic, in the code path that is deemed impossible to hit.
I am not trashing on them, I've made such mistakes in the past, but I do expect more from them is all.
And you will not believe how many alerts I got for the "impossible" errors.
I do agree there was not too much that could have been done, yes. But they should have invested in more visibility and be more thorough. I mean, hobbyist Rust devs seem to do that better.
It was just a bit disappointing for me. As mentioned above, I'd understand and sympathise with many other mistakes but this one stung a bit.
There's certainly a discipline involved here, but it's usually something like guaranteeing all threads are unwind safe (via AssertUnwindSafe) and logging stack traces when your process keeps dying/can't be started after a fixed number of retries. Which would lead you to the culprit immediately.
I'm just pushing back a bit on the idea that unwrap() is unsafe - it's not, and I wouldn't even call it a foot gun. The code did what it was written to do, when it saw the input was garbage it crashed because it couldn't make sense of what to do next. That's a desirable property in reliable systems (of course monitoring that and testing it is what makes it reliable/fixable in the first place).
We don't disagree, my main point was a bit broader and admittedly hijacked the original topic a bit, namely: `unwrap` and `expect` make many Rust devs too comfortable and these two are very tempting mistresses.
Using those should be done in an extremely disciplined manner. I agree that there are many legitimate uses but in the production Rust code I've seen this has rarely been the case. People just want to move on and then forget to circle back and add proper error handling. But yes, in this case that's not quite true. Still, my point that an APM alert should have been raised on the "impossible" code path before panicking, stands.
Oh for sure. I even think there deserve to be lints like "no code path reachable from main() is unwind-unsafe" which is a heavy hammer for many applications (like one-off CLI utils) but absolutely necessary for something like a long-lived daemon or server that's responsible for critical infrastructure.
Say you panic, now you need to have an external system that catches this panic and reports back; and does something meanwhile to recover your system.
If you think about it, it’s not really different from handling the bubbled up error inside of Rust. You don’t (?) your results and your errors go away, they just move up the chain.
On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures
As of 17:06 all systems at Cloudflare were functioning as normal
The real take away is that so much functionality depends on a few players. This is a fundamental flaw in design that is getting worse by the year as the winner takes all winners win. Not saying they didn’t earn their wins. But the fact remains. The system is not robust. Then again, so what. It went down for a while. Maybe we shouldn’t depend on the internet being “up” all the time.
Makes me wonder which team is responsible for that feature generating query, and if they follow full engineering level QA. It might be deferred to an MLE team that is better than the data scientists but less rigorous than software needs to be.
Cloudflare Access is still experiencing weird issues for us (it’s asking users to SSO login to our public website even though our zone rules - set on a completely different zone - haven’t changed).
I don’t think the infrastructure has been as fully recovered as they think yet…
Interesting technical insight, but I would be curious to hear firsthand accounts from the teams on the ground, particularly regarding how the engineers felt the increasing pressure, frantically refreshing their dashboards, searching for phantom DDoS, scrolling codes updates...
Why call .unwrap() in a function which returns Result<_,_>?
For something so critical, why aren't you using lints to identify and ideally deny panic inducing code. This is one of the biggest strengths of using Rust in the first place for this problem domain.
Fly writes a lot of Rust, do you allow `unwrap()` in your production environment? At Modal we only allow `expect("...")` and the message should follow the recommended message style[1].
I'm pretty surprised that Cloudflare let an unwrap into prod that caused their worst outage in 6 years.
After The Great If-Let Outage Of 2024, we audited all our code for that if-let/rwlock problem, changed a bunch of code, and immediately added a watchdog for deadlocks. The audit had ~no payoff; the watchdog very definitely did.
I don't know enough about Cloudflare's situation to confidently recommend anything (and I certainly don't know enough to dunk on them, unlike the many Rust experts of this thread) but if I was in their shoes, I'd be a lot less interested in eradicating `unwrap` everywhere and more in making sure than an errant `unwrap` wouldn't produce stable failure modes.
But like, the `unwrap` thing is all programmers here have to latch on to, and there's a psychological self-soothing instinct we all have to seize onto some root cause with a clear fix (or, better yet for dopaminergia, an opportunity to dunk).
A thing I really feel in threads like this is that I'd instinctively have avoided including the detail about an `unwrap` call --- I'd have worded that part more ambiguously --- knowing (because I have a pathological affinity for this community) that this is exactly how HN would react. Maybe ironically, Prince's writing is a little better for not having dodged that bullet.
Sounds like if nothing else, additional attention around (their?) use of unwrap() is still warranted from where you're sitting then though, no? I don't think there's anything wrong with flagging that.
It's one thing to not want to be the one to armchair it, but that doesn't mean that one has to suppress their normal and obvious reactions. You're allowed to think things even if they're kitsch, you too are human, and what's kitsch depends and changes. Applies to everyone else here by extension too.
Fair. I agree that saying "it's the unwrap" and calling it a day is wrong. Recently actually we've done an exercise on our Worker which is "assume the worst kind of panic happens. make the Worker be ok with it".
But I do feel strongly that the expect pattern is a highly useful control and that naked unwraps almost always indicate a failure to reason about the reliability of a change. An unwrap in their core proxy system indicates a problem in their change management process (review, linting, whatever).
Rust has debug asserts for that. Using expect with a comment about why the condition should not/can't ever happen is idiomatic for cases where you never expect an Err.
This reads to me more like the error type returned by append with names is not (ErrorFlags, i32) and wasn't trivially convertible into that type so someone left an unwrap in place on an "I'll fix it later" basis, but who knows.
Oh absolutely, that's how it would have been treated.
Surely a unwrap_or_default() would have been a much better fit--if fetching features fails, continue processing with an empty set of rules vs stop world.
I am one of those old grey beards (or at least, I got started shipping C code in the 1990s), and I'd leave asserts in prod serverside code given the choice; better that than a totally unpredictable error path.
I don't think "implicitly panicked" is an accurate description since unwrap()'s entire reason for existing is to panic if you unwrap an error condition. If you use unwrap(), you're explicitly opting into the panicking behavior.
I suppose another way to think about it is that Result<T, E> is somewhat analogous to Java's checked exceptions - you can't get the T out unless you say what to do in the case of the E/checked exception. unwrap() in this context is equivalent to wrapping the checked exception in a RuntimeException and throwing that.
Yes it's meant to be used in test code. If you're sure it can't fail do then use .expect() that way it shows you made a choice and it wasn't just a dev oversight.
Limits in systems like these are generally good. They mention the reasoning around it explicitly. It just seems like the handling of that limit is what failed and was missed in review.
A lot of outages off late seem to be related to automated config management.
Companies seem to place a lot of trust is configs being pushed automatically without human review into running systems. Considering how important these configs are, shouldn't they perhaps first be deployed to a staging/isolated network for a monitoring window before pushing to production systems?
Not trying to pontificate here, these systems are more complicated than anything I have maintained. Just trying to think of best practices perhaps everyone can adopt.
Cloudflare tried to build their own feature store, and get a grade F.
I wrote a book on feature stores by O'Reilly. The bad query they wrote in Clickhouse could have been caused by another more error - duplicate rows in materialized feature data. For example, in Hopsworks it prevents duplicate rows by building on primary key uniqueness enforcement in Apache Hudi. In contrast, Delta lake and Iceberg do not enforce primary key constraints, and neither does Clickhouse. So they could have the same bug again due to a bug in feature ingestion - and given they hacked together their feature store, it is not beyond the bounds of possibility.
That's interesting. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network. It's like the issues with HOSTS.TXT needing to be copied among the network of the early internet to allow routing (taking days to download etc) and DNS having to be created to make that propagation less unwieldy.
May I just say that Matthew Prince is the CEO of Cloudflare and a lawyer by training (and a very nice guy overall). The quality of this postmortem is great but the fact that it is from him makes one respect the company even more.
kudos to getting this blog post out so fast, it’s well written and is appreciated.
i’m a little confused on how this was initially confused for an attack though?
is there no internal visibility into where 5xx’s are being thrown? i’m surprised there isn’t some kind of "this request terminated at the <bot checking logic>" error mapping that could have initially pointed you guys towards that over an attack.
also a bit taken aback that .unwrap()’s are ever allowed within such an important context.
1. Cloudflare is in the business of being a lightning rod for large and targeted DoS attacks. A lot of cases are attacks.
2. Attacks that make it through the usual defences make servers run at rates beyond their breaking point, causing all kinds of novel and unexpected errors.
Additionally, attackers try to hit endpoints/features that amplify severity of their attack by being computationally expensive, holding a lock, or trigger an error path that restarts a service — like this one.
If the software has a limit on the size of the feature file then the process that propagates the file should probably validate the size before propagating ..
Just a moment to reflect on how much freaking leverage computers give us today - a single permission change took down half the internet. Truly crazy times.
How many changes to production systems does Cloudflare make throughout a day? Are they a part of any change management process? That would be the first place I would check after a random outage, recent changes.
The Cloudflare outage on November 18, 2025 highlights how critical internet infrastructure dependencies can impact services globally. The post-mortem provides transparency on root causes and recovery, offering valuable lessons in resilience and incident management.
I don't get why that SQL query was even used in the first place. It seems it fetches feature names at runtime instead of using a static hardcoded schema. Considering this decides the schema of a global config, I don't think the dynamicity is a good idea.
Question: customer having issues also couldn't switch their dns to bypass the service, why is the control plane updated along the data plane here it seem a lot of use could save business continuity if they could change their dns entry temporarily
The outage sucked for everyone. The root cause also feels like something they could have caught much earlier in a canary rollout from my reading of this.
All that said, to have an outage reported turned around practically the same day, that is this detailed, is quite impressive. Here's to hoping they make their changes from this learning, and we don't see this exact failure mode again.
Wondering why they didn’t disable the bot management temporarily to recover. Websites could have survived temporarily without it compared to the outage itself.
Is dual sourcing CDNs feasible these days? Seems like having the capability to swap between CDN providers is good both from a negotiating perspective and a resiliency one.
Speaking of resiliency, the entire Bot Management module doesn't seems to be a critical part of the system, so for example, what happens if that module goes down for an hour? the other parts of the system should work. So I would rank every module and it's role in the system, and would design it in a way that when a non-critical module fails, other parts still can function.
Why have a limit on the file size if the thing that happens when you hit the limit is the entire network goes down? Surely not having a limit can't be worse?
>Currently that limit is set to 200, well above our current use of ~60 features. Again, the limit exists because for performance reasons we preallocate memory for the features.
So they basically hardcoded something, didn't bother to cover the overflow case with unit tests, didn't have basic error catching that would fallback and send logs/alerts to their internal monitoring system and this is why half of the internet went down?
Hold up ,- when I used a C or similar language for accessing a database and wanted to clamp down on memory usage to deterministically control how much I want to allocated, I would explicitly limit the number of rows in the query.
There never was an unbound "select all rows from some table" without a "fetch first N rows only" or "limit N"
If you knew that this design is rigid, why not leverage the query to actually do it ?
Because nothing forced them to and they didn't think of it. Maybe the people writing the code that did the query knew that the tables they were working with never had more than 60 rows and figured "that's small" so they didn't bother with a limit. Maybe the people who wrote the file size limit thought "60 rows isn't that much data" and made a very small file size limit and didn't coordinate with the first people.
Anyway regardless of which language you use to construct a SQL query, you're not obligated to put in a max rows
I imagine there's numerous ways to protect against it and protection should've been added by whoever decided on this optimization. In data layer, create some kind of view which never returns more than 200 rows from base table(s). In code, use some kind of iterator. I'm not a Rust guy, just a C defensive practices type of dude, but maybe they just missed a biggie during a code review.
> Given Cloudflare's importance in the Internet ecosystem any outage of any of our systems is unacceptable.
Excuse me, what you've just said? Who decided on “Cloudflare's importance in the Internet ecosystem”? Some see it differently, you know, there's no need for that self-assured arrogance of an inseminating alpha male.
Dear Matthew Prince, don't you think we (the ones affected by your staff's mistake) should get some sort of compensation??? Yours truly, a Cloudflare client who lost money during the November 18th outage.
Would be nice if their Turnstile could be turned off on their login page when something like this happens, so we can attempt to route traffic away from Cloudflare during the outage. Or at least have a simple app where this can be modified from.
> Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare.
also cloudflare:
> The Cloudflare Dashboard was also impacted due to both Workers KV being used internally and Cloudflare Turnstile being deployed as part of our login flow.
thanks for clarifying! i guess then they never explained why the status page went down, even though it's supposed to be running on independent infrastructure.
Yes, that was missing (along with the London WARP thing). Other comments mentioned that their status page is an Atlassian Statuspage solution, hosted on AWS CloudFront.
Unclear to me if it's an Atlassian-managed deployment they have, or if it's self-managed, I'm not familiar with Statuspage and their website isn't helping. Though if it's managed, I'm not sure how they can know for sure there's no interdependence. (Though I guess we could technically keep that rabbit hole going indefinitely.)
If you deploy a change to your system, and things start to go wrong that same day, the prime suspect (no matter how unlikely it might seem) should be the change you made.
I’ll be honest, I only understand about 30% of what is being said in this thread and that is probably generous. But it is very interesting seeing so many people respond to each other “it’s so simple! what went wrong was…” as they all disagree on what exactly went wrong.
Wow. What a post mortem.
Rather than Monday morning quarterbacking how many ways this could have been prevented, I'd love to hear people sound-off on things that unexpectedly broke. I, for one, did not realize logging in to porkbun to edit DNS settings would become impossible with a cloudflare meltdown
That's unfortunate. I'll need to investigate whether Porkbun plans on decoupling its auth from being reliant on CloudFlare, otherwise I will need to migrate a few domains off of that registrar.
While it's certainly worthwhile to discuss the Technical and Procedural elements that contributed to this Service Outage, the far more important (and mutually-exclusive aspect) to discuss should be:
Why have we built / permitted the building of / Subscribed to such a Failure-intolerant "Network"?
Who's "we"? This is not a trick question, what specific people do you think acted wrongly here? I don't use Cloudflare personally. I don't run any of the sites that do use it. The people who did make the decision to put thier websites behind Cloudflare could stop, and maybe some will, but presumably they're paying for it because they think, perhaps accurately, that they get value out of it. Should some power compel them not to use Cloudflare?
Cloudflare’s write-up is clear and to the point. A small change spread wider than expected, and they explained where the process failed. It’s a good reminder that reliability depends on strong workflows as much as infrastructure.
tl;dr
A permissions change in a ClickHouse database caused a query to return duplicate rows for a “feature file” used by Cloudflares Bot Management system, which doubled the file size. That oversized file was propagated to their core proxy machines, triggered an unhandled error in the proxy’s bot-module (it exceeded its pre-allocated limit), and as a result the network started returning 5xx errors. The issue wasn’t a cyber-attack — it was a configuration/automation failure.
Here's a random post from their blog by the same author from 2017 with an em dash:
> As we wrote before, we believe Blackbird Tech's dangerous new model of patent trolling — where they buy patents and then act their own attorneys in cases — may be a violation of the rules of professional ethics.
unwraps are so very easy to use and they have bit me so many times because you can nearly never run into a problem and suddenly crashes from an unwrap that almost always was fine
> Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system.
And here is the query they used ** (OK, so it's not exactly):
SELECT * from feature JOIN permissions on feature.feature_type_id = permissions.feature_type_id
someone added a new row to permissions and the JOIN started returning two dupe feature rows for each distinct feature.
** "here is the query" is used for dramatic effect. I have no knowledge of what kind of database they are even using much less queries (but i do have an idea).
more edits: OK apparently it's described later in the post as a query against clickhouse's table metadata table, and because users were granted access to an additional database that was actually the backing store to the one they normally worked with, some row level security type of thing doubled up the rows. Not sure why querying system.columns is part of a production level query though, seems overly dynamic.
Honestly... everyone shit themselves that internet doesn't work, but next week this outage will be forgotten by 99% of population. I was doing something on my PC when I saw clear information that Cloudflare is down, so I decided to just go take a nap, then read a book, then go for a walk. Once I was done, the internet was working again. Panic was not necessary on my side.
What I'm trying to say is that things would be much better if everyone took a chill pill and accepted the possibility that in rare instances, the internet doesn't work and that's fine. You don't need to keep scrolling TikTok 24/7.
It's funny how everyone seems to be having a meltdown over this. I didn't even notice anything was wrong until I read about it on Reddit 5 hours later, even though I was working all day. Sounds to me like people are too reliant on random websites.
> a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system ... to keep [that] system up to date with ever changing threats
> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail
A configuration error can cause internet-scale outages. What an era we live in
Edit: also, after finishing my reading, I have to express some surprise that this type of error wasn't caught in a staging environment. If the entire error is that "during migration of ClickHouse nodes, the migration -> query -> configuration file pipeline caused configuration files to become illegally large", it seems intuitive to me that doing this same migration in staging would have identified this exact error, no?
I'm not big on distributed systems by any means, so maybe I'm overly naive, but frankly posting a faulty Rust code snippet that was unwrapping an error value without checking for the error didn't inspire confidence for me!
It would have been caught only in stage if there was similar amount of data in the database. If stage has 2x less data it would have never occurred there. Not super clear how easy it would have been to keep stage database exactly as production database in terms of quantity and similarity of data etc.
I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.
But now consider how much extra data Cloudflare at its size would have to have just for staging, doubling or more their costs to have stage exactly as production. They would have to simulate similar amount of requests on top of themselves constantly since presumably they have 100s or 1000s of deployments per day.
In this case it seems the database table in question seemed modest in size (the features for ML) so naively thinking they could have kept stage features always in sync with prod at the very least, but could be they didn't consider that 55 rows vs 60 rows or similar could be a breaking point given a certain specific bug.
It is much easier to test with 20x data if you don't have the amount of data cloudflare probably handles.
That just means it takes longer to test. It may not be possible to do it in a reasonable timeframe with the volumes involved, but if you already have 100k servers running to serve 25M requests per second, maybe briefly booting up another 100k isn’t going to be the end of the world?
Either way, you don’t need to do it on every commit, just often enough that you catch these kinds of issues before they go to prod.
> maybe briefly booting up another 100k isn’t going to be the end of the world
Cloudflare doesn’t run in AWS. They are a cloud provider themselves and mostly run on bare metal. Where would these extra 100k physical servers come from?
The speed and transparency of Cloudflare publishing this port mortem is excellent.
I also found the "remediation and follow up" section a bit lacking, not mentioning how, in general, regressions in query results caused by DB changes could be caught in future before they get widely rolled out.
Even if a staging env didn't have a production-like volume of data to trigger the same failure mode of a bot management system crash, there's also an opportunity to detect that something has gone awry if there were tests that the queries were returning functionally equivalent results after the proposed permission change. A dummy dataset containing a single http_requests_features column would suffice to trigger the dupe results behaviour.
In theory there's a few general ways this kind of issue could be detected, e.g. someone or something doing a before/after comparison to test that the DB permission change did not regress query results for common DB queries, for changes that are expected to not cause functional changes in behaviour.
Maybe it could have been detected with an automated test suite of the form "spin up a new DB, populate it with some curated toy dataset, then run a suite of important queries we must support and check the results are still equivalent (after normalising row order etc) to known good golden outputs". This style of regression testing is brittle, burdensome to maintain and error prone when you need to make functional changes and update what then "golden" outputs are - but it can give a pretty high probability of detecting that a DB change has caused unplanned functional regressions in query output, and you can find out about this in a dev environment or CI before a proposed DB change goes anywhere near production.
They only recently rewrote their core in Rust (https://blog.cloudflare.com/20-percent-internet-upgrade/) -- given the newness of the system and things like "Over 100 engineers have worked on FL2, and we have over 130 modules" I won't be surprised for further similar incidents.
I can never get used to the error happening at call site rather than within the function where the early return of Err happened. It is not "much cleaner", you have no idea which line and file caused it at call site. By default Returning should have a way of setting a marker which can then be used to map back to the line() and file(). 10+ years and still no ergonomics.
So they made a newbie mistake in SQL that would not even pass an AI review. They did not verify the change in a test environment. And I guess the logs are so full of errors it is hard to pinpoint which matters. Yikes.
The internet hasn't been the internet in years. It was originally built to withstand wars. The whole idea of our IP based internet was to reroute packages should networks go down. Decentralisation was the mantra and how it differed from early centralised systems such as AOL et al.
This is all gone. The internet is a centralised system in the hand of just a few companies. If AWS goes down half the internet does. If Azure, Google Cloud, Oracle Cloud, Tencent Cloud or Alibaba Cloud goes down a large part of the internet does.
Yesterday with Cloudflare down half the sites I tried gave me nothing but errors.
Excellent write up. Cybersecurity professionals read the story and learn. It’s textbook lesson in post-mortem incident analysis - a mvp for what is expected from us all in a similar situation.
Reputationally this is extremely embarrassing for Cloudflare, but imo they seem to get their feet back on the ground. I was surprised to see not just one, but two apologies to the internet. This just cements how professional and dedicated the Cloudflare team is to ensure stable resilient internet and how embarrassed they must have been.
A reputational hit for sure, but outcome is lessons learned and hopefully stronger resilience.
So an unhandled error condition after an configuration update similar to Crowdstrike - if they had just used a programming language where this can't happen due to the superior type system such as Rust. Oh wait.
this is where change management really shines because in a change management environment this would have been prevented by a backout procedure and it would never have been rolled out to production before going into QA, with peer review happening before that... I don't know if they lack change management but it's definitely something to think about
i think that is data rather than code which is where it falls short, in a way you need stringent code and more safeguarded code; it's like if everyone sends you 64k posts as that's all your proxy layer lets in, someone checked sending 128kb and it gave an error before reaching your app - and then someone sends 128kb and the proxy layer has changed - and your app crashes as it was more than 64kb and your app had an assert against that. to actually track issues with erraneous data that overflows well and stuff isn't so much code test but more like fuzz testing, brute force testing etc. which i think people should do; but that's more like we need strong test networks, and also those test networks may need to be more internet like to reflect real issues too, so the whole testing infrastructure in itself becomes difficult to get right - like they have their own tunneling system etc, they could segregate some of their servers and make a test system with better error diagnosis etc potentially. but to my mind, if they had better error propogation back that really identified what was happening and where then that would be a lot better in general. sure, start doing that on a test network. this is something i've beeen tihnking about in general - i made a simple rpc system for being able to send real time rust tracing logs (it allows to just use the normal tracing framework and use a thin rpc layer) back from multiple end servers but that's mostly for granular debugging. i've never quite understood why systems like systemd-journald aren't more network centric when they're going to be big and complex kitchensink approaches - apparently there's dbus support, but to my mind something inbetween debugging level of code and warning/info. like even if it's doing things like 1/20 of log info it's too much volume if things like large files getting close to limits is increasing etc and we can see this as things run, and can see if it's localised or common etc it'd help have more resilient systems. something may already exist in this line but i didn't come across anything in a reasonably passive way - i mean there's debugging tools like dtrace etc that have been around for ages.
28M 500 errors/sec for several hours from a single provider. Must be a new record.
No other time in history has one single company been responsible for so much commerce and traffic. I wonder what some outage analogs to the pre-internet ages would be.
Something like a major telco going out, for example the AT&T 1990 outage of long distance calling:
> The standard procedures the managers tried first failed to bring the network back up to speed and for nine hours, while engineers raced to stabilize the network, almost 50% of the calls placed through AT&T failed to go through.
> Until 11:30pm, when network loads were low enough to allow the system to stabilize, AT&T alone lost more than $60 million in unconnected calls.
> Still unknown is the amount of business lost by airline reservations systems, hotels, rental car agencies and other businesses that relied on the telephone network.
Yes, all(most) eggs should not be in one basket. Perfect opportunity to setup a service that checks cloudflare then switches a site's DNS to akami as a backup.
Absolute volume maybe[1], as relative % of global digital communication traffic, the era of early telegraph probably has it beat.
In the pre digital era, East India Company dwarfs every other company in any metric like commerce controlled, global shipping, communication traffic, private army size, %GDP , % of workforce employed by considerable margins.
The default was large consolidated organization throughout history, like say Bell Labs, or Standard Oil before that and so on, only for a brief periods we have enjoyed benefits of true capitalism.
[1] Although I suspect either AWS or MS/Azure recent down-times in the last couple of years are likely higher
Best post mortem I've read in a while, this thing will be studied for years.
A bit ironic that their internal FL2 tool is supposed to make Cloudflare "faster and more secure" but brought a lot of things down. And yeah, as other have already pointed out, that's a very unsafe use of Rust, should've never made it to production.
This is the first significant outage that has involved Rust code, and as you can see the .unwrap is known to carry the risk of a panic and should never be used on production code.
I think you should give me a credit for all the income I lost due to this outage. Who authorized a change to the core infrastructure during the period of the year when your customers make the most income? Seriously, this is a management failure at the highest levels of decision-making. We don't make any changes to our server infrastructure/stack during the busiest time of the year, and neither should you. If there were an alternative to Cloudflare, I'd leave your service and move my systems elsewhere.
I think you should get exactly what the contract you signed said you'd get. Outages happen in all infrasturture. Planned and unplanned ones both. The SLA and SLO are literal acknowledgements of the fact that and part of the contract for that reason.
Its fair to be upset at their decision making - use that to renegotiate your contract.
Did some $300k chief of IT blame it all on some overworked secretary clicking a link in an email they should have run through a filter? Because that’s the MO.
> The change explained above resulted in all users accessing accurate metadata about tables they have access to. Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database:
SELECT
name,
type
FROM system.columns
WHERE
table = 'http_requests_features'
order by name;
Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database.
Here is a bit more context in addition to the quote above. A ClickHouse permissions change made a metadata query start returning duplicate column metadata from an extra schema, which more than doubled the size and feature count of a Bot Management configuration file. When this oversized feature file was deployed to edge proxies, it exceeded a 200-feature limit in the bot module, causing that module to panic and the core proxy to return 5xx errors globally
- Their database permissions changed unexpectedly (??)
- This caused a 'feature file' to be changed in an unusual way (?!)
- Their SQL query made assumptions about the database; their permissions change thus resulted in queries getting additional results, permitted by the query
- Changes were propagated to production servers which then crashed those servers (meaning they weren't tested correctly)
- They hit an internal application memory limit and that just... crashed the app
- The crashing did not result in an automatic backout of the change, meaning their deployments aren't blue/green or progressive
- After fixing it, they were vulnerable to a thundering herd problem
- Customers who were not using bot rules were not affected; CloudFlare's bot-scorer generated a constant bot score of 0, meaning all traffic is bots
In terms of preventing this from a software engineering perspective, they made assumptions about how their database queries work (and didn't validate the results), and they ignored their own application limits and didn't program in either a test for whether an input would hit a limit, or some kind of alarm to notify the engineers of the source of the problem.
From an operations perspective, it would appear they didn't test this on a non-production system mimicing production; they then didn't have a progressive deployment; and they didn't have a circuit breaker to stop the deployment or roll-back when a newly deployed app started crashing.
People jump to say things like "where's the rollback" and, like, probably yeah, but keep in mind that speculative rollback features (that is: rollbacks built before you've experienced the real error modes of the system) are themselves sources of sometimes-metastable distributed system failures. None of this is easy.
How about where's the most basic test to check if your config file will actually run at all in your application? It was a hard-coded memory limit; a git-hook test suite run a MacBook would have caught this. But nooo, let's not run the app for 0.01 seconds with this config before sending it out to determine the fate of the internet?
This is literally the CrowdStrike bug, in a CDN. This is the most basic, elementary, day 0 test you could possibly invent. Forget the other things they fucked up. Their app just crashes with a config file, and nobody evaluates it?! Not every bug is preventable, but an egregious lack of testing is preventable.
This is what a software building code (like the electrical code's UL listings that prevent your house from burning down from untested electrical components) is intended to prevent. No critical infrastructure should be legal without testing, period.
just before this outage i was exploring bunnycdn as the idea of cloudflare taking over dns still irks me slightly. there are competitors. but there's a certain amount of scale that cloudflare offers which i think can help performance in general. that said in the past i found cloudflare performance terrible when i was doings lots of testing. they are predominantly a pull based system not a push, so if content isn't current the cache miss performance can be kind of blah. i think their general backhaul paths have improved, but at least from new zealand they used to seem to do worse than hitting a los angeles proxy that then hits origin. (although google was in a similar position before, where both 8.8.8.8 and www.google.co.nz/.com were both faster via los angeles than via normal paths - i think google were doing asia parent, like if testing 8.8.8.8 misses it was super far away). i think now that we have http/3 etc though that performance is a bit simpler to achieve, and that ddos, bot protection is kind of the differentiator, and i think that cloudflare's bot protection may work reasonably well in general?
As a visitor to random web pages, I definitely appreciated this—much better than their completely false “checking the security of your connection” message.
> The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions
Also appreciate the honesty here.
> On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. […]
> Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.
Why did this take so long to resolve? I read through the entire article, and I understand why the outage happened, but when most of the network goes down, why wasn't the first step to revert any recent configuration changes, even ones that seem unrelated to the outage? (Or did I just misread something and this was explained somewhere?)
Of course, the correct solution is always obvious in retrospect, and it's impressive that it only took 7 minutes between the start of the outage and the incident being investigated, but it taking a further 4 hours to resolve the problem and 8 hours total for everything to be back to normal isn't great.
- A product depends on frequent configuration updates to defend against attackers.
- A bad data file is pushed into production.
- The system is unable to easily/automatically recover from bad data files.
(The CrowdStrike outages were quite a bit worse though, since it took down the entire computer and remediation required manual intervention on thousands of desktops, whereas parts of Cloudflare were still usable throughout the outage and the issue was 100% resolved in a few hours)
Outages are in a large majority of cases caused by change, either deployments of new versions or configuration changes.
https://how.complexsystems.fail/#18
It'd be fun to read more about how you all procedurally respond to this (but maybe this is just a fixation of mine lately). Like are you tabletopping this scenario, are teams building out runbooks for how to quickly resolve this, what's the balancing test for "this needs a functional change to how our distributed systems work" vs. "instead of layering additional complexity on, we should just have a process for quickly and maybe even speculatively restoring this part of the system to a known good state in an outage".
For London customers this made the impact more severe temporarily.
Or you do have something like this but the specific db permission change in this context only failed in production
"This feature file is refreshed every few minutes and published to our entire network and allows us to react to variations in traffic flows across the Internet. It allows us to react to new types of bots and new bot attacks. So it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly."
However, you forgot that the lighting conditions are where only red lights from the klaxons are showing so you really can't differentiate the colors of the wires
Side thought as we're working on 100% onchain systems (for digital assets security, different goals):
Public chains (e.g. EVMs) can be a tamper‑evident gate that only promotes a new config artifact if (a) a delay or multi‑sig review has elapsed, and (b) a succinct proof shows the artifact satisfies safety invariants like ≤200 features, deduped, schema X, etc.
That could have blocked propagation of the oversized file long before it reached the edge :)
The exact wording (which I can easily find, because a good chunk of the internet gives it to me, because I’m on Indian broadband):
> example.com needs to review the security of your connection before proceeding.
It bothers me how this bald-faced lie of a wording has persisted.
(The “Verify you are human by completing the action below.” / “Verify you are human” checkbox is also pretty false, as ticking the box in no way verifies you are human, but that feels slightly less disingenuous.)
Any other large-ish company, there would be layers of "stakeholders" that will slow this process down. They will almost always never allow code to be published.
Damn corporate karma farming is ruthless, only a couple minute SLA before taking ownership of the karma. I guess I'm not built for this big business SLA.
I'm so jealous. I've written postmortems for major incidents at a previous job: a few hours to write, a week of bikeshedding by marketing and communication and tech writers and ... over any single detail in my writing. Sanitizing (hide a part), simplifying (our customers are too dumb to understand), etc, so that the final writing was "true" in the sense that it "was not false", but definitely not what I would call "true and accurate" as an engineer.
0/10, get it right the first time, folks. (/s)
Fantastic for recruiting, too.
I'd consider applying based on this alone
Thanks for the insight.
[0] https://news.ycombinator.com/item?id=45588305
> Spent some time after we got things under control talking to customers. Then went home.
What did sama / Fidji say? ;) Turnstile couldn't have been worth that.
I'm sure that's not your intent, so I hope my comment gives you an opportunity to reflect on the effects of syndicating such stupidity, no matter what platform it comes from.
The legislation that MTG (Marjorie Taylor Green) just proposed a few days ago to ban H1B entirely, and the calls to ban other visa types, is going to have a big negative impact on the tech industry and American innovation in general. The social media stupidity is online but it gives momentum to the actual real life legislation and other actions the administration might take. Many congress people are seeing the online sentiment and changing their positions in response, unfortunately.
Posts like that deserve to be flagged if the sum of their substance is jingoist musing & ogling dumb people on Twitter.
How about 1. The permissions change project is paused or rolled back until 2. All impacted database interactions (SQL queries) are evaluated for improper assumptions or better 3. Their design that depends on database metainfo and schema is replaced with ones that use specific tables and rows in tables instead of using the meta info as part of their application. 4. All hard coded limits are centralized in a single global module and referenced from their users and then back propagated to any separate generator processes that validate against the limit before pushing generated changes
The lack of canary: cause for concern, but I more or less believe Cloudflare when they say this is unavoidable given the use case. Good reason to be extra careful though, which in some ways they weren't.
The slowness to root cause: sheer bad luck, with the status page down and Azure's DDoS yesterday all over the news.
The broken SQL: this is the one that I'd be up in arms about if I worked for Cloudflare. For a system with the power to roll out config to ~all of prod at once while bypassing a lot of the usual change tracking, having this escape testing and review is a major miss.
But the architectural assumption that the bot file build logic can safely obtain this operationally critical list of features from derivative database metadata vs. a SSOT seems like a bigger problem to me.
So basically bad config should be explicitly processed and handled by rolling back to known working config.
I think that's explicitly a non-goal. My understanding is that Cloudflare prefers fail safe (blocking legitimate traffic) over fail open (allowing harmful traffic).
Crashing is not an outage. It’s a restart and a stack trace for you to fix.
Are you in the right thread?
The problem was a query producing incorrect data. The crash helped them find it.
What do you think happens when a program crashes?
But you’re still missing it. Crashing is not bad. It’s good. It’s how you leverage OS level security and reliability.
In fact I'd argue that crashing is bad. It means you failed to properly enumerate and express your invariants, hit an unanticipated state, and thus had to fail in a way that requires you to give up and fall back on the OS to clean up your process state.
[edit]
Sigh, HN and its "you're posting too much". Here's my reply:
> Why? The end user result is a safe restart and the developer fixes the error.
Look at the thread your commenting on. The end result was a massive world-wide outage.
> That’s what it’s there for. Why is it bad to use its reliable error detection and recovery mechanism?
Because you don't have to crash at all.
> We don’t want to enumerate all possible paths. We want to limit them.
That's the exact same thing. Anything not "limited" is a possible path.
> If my program requires a config file to run, crash as soon as it can’t load the config file. There is nothing useful I can do (assuming that’s true).
Of course there's something useful you can do. In this particular case, the useful thing to do would have been to fall back on the previous valid configuration. And if that failed, the useful thing to do would be to log an informative, useful error so that nobody has to spend four hours during a worldwide outage to figure out what was going wrong.
The world wide outage was actually caused by deploying several incorrect programs in an incorrect system.
The root one was actually a bad query as outlined in the article.
Let’s get philosophical for a second. Programs WILL be written incorrectly - you will deploy to production something that can’t possibly work. What should you do with a program that can’t work? Pretend this can’t happen? Or let you know so you can fix it?
Why? The end user result is a safe restart and the developer fixes the error.
> fall back on the OS to clean up your process state.
That’s what it’s there for. Why is it bad to use its reliable error detection and recovery mechanism?
> It means you failed to properly enumerate and express your invariants
We don’t want to enumerate all possible paths. We want to prune them.
If my program requires auth info to run, crash as soon as it can’t load it. There is nothing useful I can do (assuming that’s true).
I know, this is "Monday morning quarterbacking", but that's what you get for an outage this big that had me tied up for half a day.
1. Their bot management system is designed to push a configuration out to their entire network rapidly. This is necessary so they can rapidly respond to attacks, but it creates risk as compared to systems that roll out changes gradually.
2. Despite the elevated risk of system wide rapid config propagation, it took them 2 hours to identify the config as the proximate cause, and another hour to roll it back.
SOP for stuff breaking is you roll back to a known good state. If you roll out gradually and your canaries break, you have a clear signal to roll back. Here was a special case where they needed their system to rapidly propagate changes everywhere, which is a huge risk, but didn’t quite have the visibility and rapid rollback capability in place to match that risk.
While it’s certainly useful to examine the root cause in the code, you’re never going to have defect free code. Reliability isn’t just about avoiding bugs. It’s about understanding how to give yourself clear visibility into the relationship between changes and behavior and the rollback capability to quickly revert to a known good state.
Cloudflare has done an amazing job with availability for many years and their Rust code now powers 20% of internet traffic. Truly a great team.
How can you write the proxy without handling the config containing more than the maximum features limit you set yourself?
How can the database export query not have a limit set if there is a hard limit on number of features?
Why do they do non-critical changes in production before testing in a stage environment?
Why did they think this was a cyberattack and only after two hours realize it was the config file?
Why are they that afraid of a botnet? Does not leave me confident that they will handle the next Aisuru attack.
I'm migrating my customers off Cloudflare. I don't think they can swallow the next botnet attacks and everyone on Cloudflare go down with the ship, so it will be safer to not be behind Cloudflare when it hits.
Yet you omit to acknowledge that the remaining 99.99999% logic written that powers Cloudflare works flawlessly.
Also, hindsight is 20/20
Isn’t getting cyberattacked their core business?
That's often the case with human error as especially aviation safety experts know: https://en.wikipedia.org/wiki/Swiss_cheese_model
Any big and noticeable incident is one of the "we failed on so many levels here" kind, by definition.
I guess the noncritical change here was the change to the database? My experience has been a lot of teams do a poor job having a faithful replica of databases in stage environments to expose this type of issue.
Permissions stuff might be caught without a completely faithful replica, but there are always going to be attributes of the system that only exist in prod.
But the case for Cloudflare here is complicated. Every engineer is very free to make a better system though.
Cloudflare builds a global scale system, not an iphone app. Please act like it.
There will always be bugs in code, even simple code, and sometimes those things don't get caught before they cause significant trouble.
The failing here was not having a quick rollback option, or having it and not hitting the button soon enough (even if they thought the problem was probably something else, I think my paranoia about my own code quality is such that I would have been rolling back much sooner just in case I was wrong about the “something else”).
Every system has a non-reducible risk and no data rollback is trivial, especially for a CDN.
It goes over my head why Cloudflare is HN's darling while others like Google, Microsoft and AWS don't usually enjoy the same treatment.
Do the others you mentioned provide such detailed outage reports, within 24 hours of an incident? I’ve never seen others share the actual code that related to the incident.
Or the CEO or CTO replying to comments here?
>Press Release
This is not press release, they always did these outage posts from the start of the company.
https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
Azure (albeit pretty old): https://devblogs.microsoft.com/devopsservice/?p=17665
AWS: https://aws.amazon.com/message/101925/
GCP: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...
The code sample might as well be COBOL for people not familiar with Rust and its error handling semantics.
> Or the CEO or CTO replying to comments here?
I've looked around the thread and I haven't seen the CTO here nor the CEO, probably I'm not familiar with their usernames and that's on me.
> This is not press release, they always did these outage posts from the start of the company.
My mistake calling them press releases. Newspapers and online publications also skim this outage report to inform their news stories.
I wasn't clear enough on my previous comment. I'd like all major players in the internet and web infrastructure to be held to higher standards. As it stands when it comes to them or the tech department of a retail store the retail store must answer to more laws when surface area of combined activities is took into account.
Yes, Cloudflare excels where others don't or barely bother and I too enjoyed the pretty graphs, diagrams and I've learned some nifty Rust tricks.
EDIT: I've removed some unwarranted snark from my comment which I apologize for.
They explain that at some length in TFA.
Is that an overreaction?
Name me global, redundant systems that have not (yet) failed.
And if you used cloudflare to protect against botnet and now go off cloudflare... you are vulnerable and may experience more downtime if you cannot swallow the traffic.
I mean no service have 100% uptime - just that some have more nines than others.
I do like the flat cost of Cloudflare and feature set better but they have quite a few outages compared to other large vendors--especially with Access (their zero trust product)
I'd lump them into GitHub levels of reliability
We had a comparable but slightly higher quote from an Akamai VAR.
What would some good examples of those be? I think something like Anubis is mostly against bot scraping, not sure how you'd mitigate a DDoS attack well with self-hosted infra if you don't have a lot of resources?
On that note, what would be a good self-hosted WAF? I recall using mod_security with Apache and the OWASP ruleset, apparently the Nginx version worked a bit slower (e.g. https://www.litespeedtech.com/benchmarks/modsecurity-apache-... ), there was also the Coraza project but I haven't heard much about it https://coraza.io/ or maybe the people who say that running a WAF isn't strictly necessary also have a point (depending on the particular attack surface).
Genuine questions.
There is haproxy-protection, which I believe is the basis of Kiwiflare. Clients making new connections have to solve a proof-of-work challenge that take about 3 seconds of compute time.
Enterprise: https://www.haproxy.com/solutions/ddos-protection-and-rate-l...
FOSS: https://gitgud.io/fatchan/haproxy-protection
Whatever you do, unless you have their bandwidth capacity, at some point those "self-hosted" will get flooded with traffic.
The fact that cloudflare can literally ready every bit of communication (as it sits between the client and your server) is already plenty bad. And yet, we accept this more easily, then a bit of downtime. We shall not ask about the prices for that service ;)
To me its nothing more then the whole "everybody on the cloud" issue, when most do not need the resource that cloud companies like AWS provide (and the bill), and yet, get totally tied down to this one service.
I am getting old lol ...
What is the cost of many-9s uptime from Cloudflare? For DDoS protection it is $0/month on their free tier:
* https://www.cloudflare.com/en-ca/plans/
How they magically manage DDOS larger than their bandwidth?
If the plan is to have larger bandwidth than any DDOS it is going to be expensive, quickly.
If you're just renting servers instead, you have a few options that are effectively closer to a 1% commit, but better have a plan B for when your upstreams drop you if the incoming attack traffic starts disrupting other customers - see Neoprotect having to shut down their service last month.
But at the same time, what value do they add if they:
* Took down the the customers sites due to their bug.
* Never protected against an attack that our infra could not have handled by itself.
* Don't think that they will be able to handle the "next big ddos" attack.
It's just an extra layer of complexity for us. I'm sure there are attacks that could help our customers with, that's why we're using them in the first place. But until the customers are hit with multiple ddos attacks that we can not handle ourself then it's just not worth it.
That is always a risk with using a 3rd party service, or even adding extra locally managed moving parts. We use them in DayJob, and despite this huge issue and the number of much smaller ones we've experienced over the last few years their reliability has been pretty darn good (at least as good as the Azure infrastructure we have their services sat in front of).
> • Never protected against an attack that our infra could not have handled by itself.
But what about the next one… Obviously this is a question sensitive to many factors in our risk profiles and attitudes to that risk, there is no one right answer to the “but is it worth it?” question here.
On a slightly facetious point: if something malicious does happen to your infrastructure, that it does not cope well with, you won't have the “everyone else is down too” shield :) [only slightly facetious because while some of our clients are asking for a full report including justification for continued use of CF and any other 3rd parties, which is their right both morally and as written in our contracts, most, especially those who had locally managed services affected, have taken the “yeah, half our other stuff was affected to, what can you do?” viewpoint].
> • Don't think that they will be able to handle the "next big ddos" attack.
It is a war of attrition. At some point a new technique, or just a new botnet significantly larger than those seen before, will come along that they might not be able to deflect quickly. I'd be concerned if they were conceited enough not to be concerned about that possibility. Any new player is likely to practise on smaller targets first before directly attacking CF (in fact I assume that it is rather rare that CF is attacked directly) or a large enough segment of their clients to cause them specific issues. Could your infrastructure do any better if you happen to be chosen as one of those earlier targets?
Again, I don't know your risk profile so can say which is the right answer, if there even is an easy one other than “not thinking about it at all” being a truly wrong answer. Also DDoS protection is not the only service many use CF for, so those need to be considered too if you aren't using them for that one thing.
"To make error is human. To propagate error to all server in automatic way is #devops"
This saying dates back to 1969: To err is human but to really foul things up requires a computer.
* https://quoteinvestigator.com/2010/12/07/foul-computer/
Also: I know there’s a proverb which says ‘To err is human,’ but a human error is nothing to what a computer can do if it tries.
* https://quoteinvestigator.com/2017/05/26/computer-error/
I don’t really buy this requirement. At least make it configurable with a more reasonable default for “routine” changes. E.g. ramping to 100% over 1 hour.
As long as that ramp rate is configurable, you can retain the ability to respond fast to attacks by setting the ramp time to a few seconds if you truly think it’s needed in that moment.
https://developers.cloudflare.com/bots/get-started/bot-manag...
> Unrelated to this incident, we were and are currently migrating our customer traffic to a new version of our proxy service, internally known as FL2. Both versions were affected by the issue, although the impact observed was different.
> Customers deployed on the new FL2 proxy engine, observed HTTP 5xx errors. Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero. Customers that had rules deployed to block bots would have seen large numbers of false positives. Customers who were not using our bot score in their rules did not see any impact.
Of course, this is all so easy to say after the fact..
> In the internal incident chat room, we were concerned that this might be the continuation of the recent spate of high volume Aisuru DDoS attacks:
Let those who have never written a bug before cast the first stone.
Reminds me of House of Dynamite, the movie about nuclear apocalypse that really revolves around these very human factors. This outage is a perfect example of why relying on anything humans have built is risky, which includes the entire nuclear apparatus. “I don’t understand why X wasn’t built in such a way that wouldn’t mean we live in an underground bunker now” is the sentence that comes to mind.
The new config file was not (AIUI) invalid (syntax-wise) but rather too big:
> […] That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.
> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
If this was a routine config change, I could see how it could take 2 hours to start the mediation plan. However they should have dashboards that correlate config setting changes with 500 errors (or equivalent). It gets difficult when you have many of of these going out at the same time and they are slowly rolled out.
The root cause document is mostly for high level and the public. The details on this specific outage will be in a internal document with many action items, some of them maybe quarter long projects including fixing this specific bug and maybe some linter/monitor to prevent it from happening again.
That and why the hell wasn't their alerting showing up colossal amount of panics in their bot manager thing?
This is also a pretty good example why having stack traces by default is great. That error could have been immediately understood just from a stack trace and a basic exception message.
In fact, the root bug (faulty assumption?) was in one or more SQL catalog queries that were presumably written some time ago.
(Interestingly the analysis doesn’t go into how these erroneous queries made it into production OR whether the assumption was “to spec” and it’s the security principal change work that was faulty. Seems more likely to be the former.)
I'm sure that there are misapplied guidelines to do that instead of being nice to incoming bot management configuration files, and someone might have been scolded (or worse) for proposing or attempting to handle them more safely.
In a productive way, this view also shifts the focus to improving the system (visibility etc), empowering the team, rather than focusing on the code which broke (probably strikes fear in the individuals, to do anything!)
If every time there's a new bot someone needs to write code that can blow up their whole service, maybe they need to iterate a bit on this design?
This is the danger of automated control systems. If they get hacked or somehow push out bad things (CloudStrike), they will have complete control and be very efficient.
Both are important, and I am pretty sure, that someone is gonna fix that line of code pretty soon.
And yes, there is a lint you can use against slicing ('indexing_slicing') and it's absolutely wild that it's not on by default in clippy.
This is sobering.
My new fear is some dependency unwrap()ing or expect()ing something where they didn't prove the correctness.
Unwrap() and expect() are an anti-pattern and have no place in idiomatic Rust code. The language should move to deprecate them.
Perhaps it needs a scarier name, like "assume_ok".
The handler could log the error and then panic. Much better than chasing bad hunches about a DDoS.
This lets me do logging at minimum. Sometimes I can gracefully degrade. I try to be elegant in failure as possible, but not to the point where I wouldn't be able to detect errors or would enter a bad state.
That said, I am totally fine with your use case in your application. You're probably making sane choices for your problem. It should be on each organization to decide what the appropriate level of granularity is for each solution.
My worry is that this runtime panic behavior has unwittingly seeped into library code that is beyond our ability and scope to observe. Or that an organization sets a policy, but that the tools don't allow for rigid enforcement.
As the user, I can't tell the difference, but it might have sped up their recovery a bit.
I imagine it would also require less time debugging a panic. That kind of breadcrumb trail in your logs is a gift to the future engineer and also customers who see a shorter period of downtime.
Once every 5m is not "rapidly". It isn't uncommon for configuration systems to do it every few seconds [0].
> While it’s certainly useful to examine the root cause in the code.
Believe the issue is as much an output from a periodic run (clickhouse query) caused by (on the surface, an unrelated change) causing this failure. That is, the system that validated the configuration (FL2) was different to the one that generated it (ML Bot Management DB).
Ideally, it is the system that vends a complex configuration that also vends & tests the library to consume it, or the system that consumes it, does so as if it was "tasting" the configuration first before devouring it unconditionally [1].
Of course, as with all distributed system failures, this is all easier said and done in hindsight.
[0] Avoiding overload in distributed systems by putting the smaller service in control (pg 4), https://d1.awsstatic.com/builderslibrary/pdfs/Avoiding%20ove...
[1] Lessons from CloudFront (2016), https://youtube.com/watch?v=n8qQGLJeUYA&t=1050
Isn't rapidly more of how long it takes to get from A to Z rather than how often it is performed? You can push out a configuration update every fortnight but if it goes through all of your global servers in three seconds, I'd call it quite rapid.
It's not necessarily invalid to use unwrap in production code if you would just call panic anyway. But just like every unsafe block needs a SAFETY comment, every unwrap in production code needs an INFALLIBILITY comment. clippy::unwrap_used can enforce this.
And because it gets picked up by LLMs. It would be interesting to know if this particular .unwrap() was written by a human.
In theory, experienced human code reviewers can course correct newer LLM-guided devs work before it blows up. In practice, reviewers are already stretched thin and submitters absolute to now rapidly generate more and more code to review makes that exhaustion effect way worse. It becomes less likely they spot something small but obvious amongst the haystack of LLM generated code bailing there way.
That gives you the same behavior as unwrap with a less useful error message though. In theory you can write useful messages, but in practice (and your example) expect is rarely better than unwrap in modern rust
This is Rust's Null Pointer Exception.
unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers or Rust developers not utilizing the language's design features.
The language should grow the ability to mark this code as dangerous, and we should have static tools to exclude this code from our dependency tree.
I don't want some library I use to `unwrap()` and cause my application to crash because I didn't anticipate their stupid panic.
Rust developers have clearly leaned on this crutch far too often:
https://github.com/search?q=unwrap%28%29+language%3ARust&typ...
The Rust team needs to plug this leak.
My blog on this topic was linked above, you should read it: https://burntsushi.net/unwrap/
> The language should grow the ability to mark this code as dangerous, and we should have static tools to exclude this code from our dependency tree.
Might be useful to point out that this static tool exists (clippy::unwrap_used).
> unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers or Rust developers not utilizing the language's design features.
That's factually incorrect. (And insulting.)
Note that they're not criticizing the language. I read "Rust developers" in this context as developers using Rust, not those who develop the language and ecosystem. (In particular they were not criticizing you.)
I think it's reasonable to question the use of unwrap() in this context. Taking a cue from your blog post^ under runtime invariant violations, I don't think this use matches any of your cases. They assumed the size of a config file is small, it wasn't, so the internet crashed.
^ https://burntsushi.net/unwrap/#what-is-my-position
I fully agree with burntsushi that echelon is taking an extreme and arguably wrong stance. His sentiment becomes more and more correct as Rust continues to evolve ways to avoid unwrap as an ergonomic shortcut, but I don't think we are quite there yet for general use. There absolutely is code that should never panic, but that involves tradeoffs and design choices that aren't true for every project (or even the majority of them)
> We shouldn't be using unwrap() or expect() at all.
So the context of their comment is not some specific nuanced example. They made a blanket statement.
> Note that they're not criticizing the language. I read "Rust developers" in this context as developers using Rust, not those who develop the language and ecosystem.
I have the same interpretation.
> I think it's reasonable to question the use of unwrap() in this context. Taking a cue from your blog post^ under runtime invariant violations, I don't think this use matches any of your cases. They assumed the size of a config file is small, it wasn't, so the internet crashed.
Yes? I didn't say it wasn't reasonable to question the use of unwrap() here. I don't think we really have enough information to know whether it was inappropriate or not.
unwrap() is all about nuance. I hope my blog post conveyed that. Because unwrap() is a manifestation of an assertion on a runtime invariant. A runtime invariant can be arbitrarily complicated. So saying things like, "we shouldn't be using unwrap() or expect() at all" is an extreme position to carve out that is also way too generalized.
I stand by what I said. They are factually mistaken in their characterization of the use of unwrap()/expect() in general.
That is their opinion, I disagree with it, but I don't think it's an insulting or invalid opinion to have. There are codebases that ban nulls in other languages too.
> They are factually mistaken in their characterization of the use of unwrap()/expect() in general.
It's an opinion about a stylistic choice. I don't see what fact there is here that could be mistaken.
> unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers or Rust developers not utilizing the language's design features.
The factually incorrect part of this is the statement that use of `unwrap()`, `expect()` and so on is caused by X or Y, where X is "lazy Rust developers" and Y is "Rust developers not utilizing the language's design features." But there are, factually, other causes than X or Y for use of `unwrap()`, `expect()` and so on. So stating that it is all caused by X or Y is factually incorrect. Moreover, X is 100% insulting when applied to any one specific individual. Y can be insulting when applied to any one specific individual.
Now this:
> We shouldn't be using unwrap() or expect() at all.
That's an opinion. It isn't factually incorrect. And it isn't insulting.
> unwrap(), expect(), bad math, etc. - this is all caused by lazy Rust developers or Rust developers not utilizing the language's design features
I just read that line as shorthand for large outages caused by misuse of unwrap(), expect(), bad math etc. - all caused by...
That's also an opinion, by my reading.
I assumed we were talking specifically about misuses, not all uses of unwrap(), or all bad bugs. Anyway, I think we're ultimately saying the same thing. It's ironic in its own way.
Change your API boundary, surface the discrepancy between your requirements and the potential failing case at the edges where it can be handled.
If you need the value, you need to handle the case that it’s not available explicitly. You need to define your error path(s)
Anything else leads to, well, this.
Your argument also implies that things like `slice[i]` are never okay.
The blog post doesn’t address the issue, it simply pretends it’s not a real problem.
Also from the post: “If we were to steelman advocates in favor of this style of coding, then I think the argument is probably best limited to certain high reliability domains. I personally don’t have a ton of experience in said domains …”
Enough said.
> The blog post doesn’t address the issue, it simply pretends it’s not a real problem.
It very explicitly addresses it! It even gives real examples.
> Also from the post: “If we were to steelman advocates in favor of this style of coding, then I think the argument is probably best limited to certain high reliability domains. I personally don’t have a ton of experience in said domains …” > > Enough said.
Ad hominem... I don't have experience working on, e.g., medical devices upon which someone's life depends. So the point of that sentence is to say, "yes, I acknowledge this advice may not apply there." You also cherry picked that quote and left off the context, which is relevant here.
And note that you said:
> I have to disagree that unwrap is ever OK.
That's an extreme position. It isn't caveated to only apply to certain contexts.
This is a failure caused by lazy Rust programming and not relying on the language's design features.
It's a shame this code can even be written. It is surprising and escapes the expected safety of the language.
I'm terrified of some dependency using unwrap() or expect() and crashing for something entirely outside of my control.
We should have an opt-in strict Cargo.toml declaration that forbids compilation of any crate that uses entirely preventable panics. The only panics I'll accept are those relating to memory allocation.
This is one of the sharpest edges in the language, and it needs to be smoothed away.
The problem starts with Rust stdlib. It panics on allocation failure. You expect Rust programmers to look at stdlib and not imitate it?
Sure, you can try to taboo unwrap(), but 1) it won't work, and 2) it'll contort program design in places where failure really is a logic bug, not a runtime failure, and for which unwrap() is actually appropriate.
The real solution is to go back in time, bonk the Rust designers over the head with a cluebat, and have them ship a language that makes error propagation the default and syntactically marks infallible cleanup paths --- like C++ with noexcept.
Of course it will. I've built enormous systems, including an entire compiler, without once relying on the local language equivalent of `.unwrap()`.
> 2) it'll contort program design in places where failure really is a logic bug, not a runtime failure, and for which unwrap() is actually appropriate.
That's a failure to model invariants in your API correctly.
> ... have them ship a language that makes error propagation the default and syntactically marks infallible cleanup paths --- like C++ with noexcept.
Unchecked exceptions aren't a solution. They're a way to avoid taking the thought, time, and effort to model failure paths, and instead leave that inherent unaddressed complexity until a runtime failure surprises users. Like just happened to Cloudflare.
IMO making unwrap a clippy lint (or perhaps a warning) would be a decent start. Or maybe renaming unwrap.
A tenet of systems code is that every possible error must be handled explicitly and exhaustively close to the point of occurrence. It doesn’t matter if it is Rust, C, etc. Knowing how to write systems code is unrelated to knowing a systems language. Rust is a systems language but most people coming into Rust have no systems code experience and are “holding it wrong”. It has been a recurring theme I’ve seen with Rust development in a systems context.
C is pretty broken as a language but one of the things going for it is that it has a strong systems code culture surrounding it that remembers e.g. why we do all of this extra error handling work. Rust really needs systems code practice to be more strongly visible in the culture around the language.
How about indexing into a slice/map/vec? Should every `foo[i]` have an infallibility comment? Because they're essentially `get(i).unwrap()`.
* Graph/tree traversal functions that take a visitor function as a parameter
* Binary search on sorted arrays
* Binary heap operations
* Probing buckets in open-addressed hash tables
The smoltcp crate typically uses runtime checks to ensure slice accesses made by the library do not cause a panic. It's not exactly equivalent to GP's assertion, since it doesn't cover "every single slice access", but it at least covers slice accesses triggered by the library's public API. (i.e. none of the public API functions should cause a panic, assuming that the runtime validation after the most recent mutation succeeds).
Example: https://docs.rs/smoltcp/latest/src/smoltcp/wire/ipv4.rs.html...
Could you share some more details, maybe one fully concrete scenario? There are lots of techniques, but there's no one-size-fits-all solution.
The developer was lazy.
A lot of Rust developers are: https://github.com/search?q=unwrap%28%29+language%3ARust&typ...
The details depend a lot on what you're doing and how you're doing it. Does the graph grow? Shrink? Do you have more than one? Do you care about programmer error types other than panic/UB?
Suppose, e.g., that your graph doesn't change sizes, you only have one, and you only care about panics/UB. Then you can get away with:
1. A dedicated index type, unique to that graph (shadow / strong-typedef / wrap / whatever), corresponding to whichever index type you're natively using to index nodes.
2. Some mechanism for generating such indices. E.g., during graph population phase you have a method which returns the next custom index or None if none exist. You generated the IR with those custom indexes, so you know (assuming that one critical function is correct) that they're able to appropriately index anywhere in your graph.
3. You have some unsafe code somewhere which blindly trusts those indices when you start actually indexing into your array(s) of node information. However, since the very existence of such an index is proof that you're allowed to access the data, that access is safe.
Techniques vary from language to language and depending on your exact goals. GhostCell [0] in Rust is one way of relegating literally all of the unsafe code to a well-vetted library, and it uses tagged types (via lifetimes), so you can also do away with the "only one graph" limitation. It's been awhile since I've looked at it, but resizes might also be safe pretty trivially (or might not be).
The general principle though is to structure your problem in such a way that a very small amount of code (so that you can more easily prove it correct) can provide promises that are enforceable purely via the type system (so that if the critical code is correct then so is everything else).
That's trivial by itself (e.g., just rely on option-returning .get operators), so the rest of the trick is to find a cheap place in your code which can provide stronger guarantees. For many problems, initialization is the perfect place (e.g., you can bounds-check on init and then not worry about it again) (e.g., if even bounds-checking on initialization is too slow then you can still use the opportunity at initialization to write out a proof of why some invariant holds and then blindly/unsafely assert it to be true, but you then immediately pack that hard-won information into a dedicated type so that the only place you ever have to think about it is on initialization).
[0] https://plv.mpi-sws.org/rustbelt/ghostcell/
For the 5% of cases that are too complex for standard iterators? I never bother justifying why my indexes are correct, but I don't see why not.
You very rarely need SAFETY comments in Rust because almost all the code you write is safe in the first place. The language also gives you the tool to avoid manual iteration (not just for safety, but because it lets the compiler eliminate bounds checks), so it would actually be quite viable to write these comments, since you only need them when you're doing something unusual.
So: first, identify code that cannot be allowed to panic. Within that code, yes, in the rare case that you use [i], you need to at least try to justify why you think it'll be in bounds. But it would be better not to.
There are a couple of attempts at getting the compiler to prove that code can't panic (e.g., the no-panic crate).
Unless the language addresses no-panic in its governing design or allows try-catch, not sure how you go about this.
https://github.com/search?q=unwrap%28%29+language%3ARust&typ...
This is ridiculous. We're probably going to start seeing more of these. This was just the first, big highly visible instance.
We should have a name for this similar to "my code just NPE'd". I suggest "unwrapped", as in, "My Rust app just unwrapped a present."
I think we should start advocating for the deprecation and eventual removal of the unwrap/expect family of methods. There's no reason engineers shouldn't be handling Options and Results gracefully, either passing the state to the caller or turning to a success or fail path. Not doing this is just laziness.
I want to ban crates that panic from my dependency chain.
The language could really use an extra set of static guarantees around this. I would opt in.
Which means banning anything that allocates memory and thousands of stdlib functions/methods.
I'm fine with allocation failures. I don't want stupid unwrap()s, improper slice access, or other stupid and totally preventable behavior.
There are things inside the engineer's control. I want that to not panic.
I don't want dependencies deciding to unwrap() or expect() some bullshit and that causing my entire program to crash because I didn't anticipate or handle the panic.
Code should be written, to the largest extent possible, to mitigate errors using Result<>. This is just laziness.
I want checks in the language to safeguard against lazy Rust developers. I don't want their code in my dependency tree, and I want static guarantees against this.
edit: I just searched unwrap() usage on Github, and I'm now kind of worried/angry:
https://github.com/search?q=unwrap%28%29+language%3ARust&typ...
A lot of this is just pure laziness.
Something that allows me to tag annotate a function (or my whole crate) as "no panic", and get a compile error if the function or anything it calls has a reachable panic.
This will allow it to work with many unmodified crates, as long as constant propagation can prove that any panics are unreachable. This approach will also allow crates to provide panicking and non panicking versions of their API (which many already do).
[0]: https://github.com/dtolnay/no-panic
On the subject of this, I want more ability to filter out crates in our Cargo.toml. Such as a max dependency depth. Or a frozen set of dependencies that is guaranteed not to change so audits are easier. (Obviously we could vendor the code in and be in charge of our own destiny, but this feels like something we can let crate authors police.)
Look at how many lazy cases of this there are in Rust code [1].
Some of these are no doubt tested (albeit impossible to statically guarantee), but a lot of it looks like sloppiness or not leaning on the language's strong error handling features.
It's disappointing to see. We've had so much of this creep into the language that eventually it caused a major stop-the-world outage. This is unlikely to be the last time we see it.
[1] https://github.com/search?q=unwrap%28%29+language%3ARust&typ...
A language DX feature I quite like is when dangerous things are labelled as such. IIRC, some examples of this are `accursedUnutterablePerformIO` in Haskell, and `DO_NOT_USE_OR_YOU_WILL_BE_FIRED_EXPERIMENTAL_CREATE_ROOT_CONTAINERS` in React.js.
I still think we should remove them outright or make production code fail to compile without a flag allowing them. And we also need tools to start cleaning up our dependency tree of this mess.
I think adoption would have played out very different if there had only been some more syntactic-sugar. For example, an easy syntax for saying: "In this method, any (checked) DeepException e that bubbles up should immediately be replaced by a new (checked) MylayerException(e) that contains the original one as a cause.
We might still get lazy programmers making systems where every damn thing goes into a generic MylayerException, but that mess would still be way easier to fix later than a hundred scattered RuntimeExceptions.
The problem is that any non-trivial software is composition, and encapsulation means most errors aren't recoverable.
We just need easy ways to propagate exceptions out to the appropriate reliability boundary, ie. the transaction/ request/ config loading, and fail it sensibly, with an easily diagnosable message and without crashing the whole process.
C# or unchecked Java exceptions are actually fairly close to ideal for this.
The correct paradigm is "prefer throw to catch" -- requiring devs to check every ret-val just created thousands of opportunities for mistakes to be made.
By contrast, a reliable C# or Java version might have just 3 catch clauses and handle errors arising below sensibly without any developer effort.
https://literatejava.com/exceptions/ten-practices-for-perfec...
1. in most cases they don't want to handle `InterruptedException` or `IOException` and yet need to bubble them up. In that case the code is very verbose.
2. it makes lambdas and functions incompatible. So eg: if you're passing a function to forEach, you're forced to wrap it in runtime exception.
3. Due to (1) and (2), most people become lazy and do `throws Exception` which negates most advantages of having exceptions in the first place.
In line-of-business apps (where Java is used the most), an uncaught exception is not a big deal. It will bubble up and gets handled somewhere far up the stack (eg: the server logger) without disrupting other parts of the application. This reduces the utility of having every function throw InterruptedException / IOException when those hardly ever happen.
This is true, but the hate predated lambdas in Java.
In my experience, it actually is a big deal, leaving a wake of indeterminant state behind after stack unrolling. The app then fails with heisenbugs later, raising more exceptions that get ignored, compounding the problem.
People just shrug off that unreliability as an unavoidable cost of doing business.
I think it comes down to a psychological or use-case issue: People hate thinking about errors and handling them, because it's that hard stuff that always consumes more time than we'd like to think. Not just digitally, but in physical machines too. It's also easier to put off "for later."
Exceptions force a panic on all errors, which is why they're supposed to be used in "exceptional" situations. To avoid exceptions when an error is expected, (eof, broken socket, file not found,) you either have to use an unnatural return type or accept the performance penalty of the panic that happens when you "throw."
In Rust, the stack trace happens at panic (unwrap), which is when the error isn't handled. IE, it's not when the file isn't found, it's when the error isn't handled.
What do you mean?
Exceptions do not force panic at all. In most practical situations, an exception unhandled close to where it was thrown will eventually get logged. It's kind of a "local" panic, if you will, that will terminate the specific function, but the rest of the program will remain unaffected. For example, a web server might throw an exception while processing a specific HTTP request, but other HTTP requests are unaffected.
Throwing an exception does not necessarily mean that your program is suddenly in an unsupported state, and therefore does not require terminating the entire program.
That's not what a panic means. Take a read through Go's panic / resume mechanism; it's similar to exceptions, but the semantics (with multiple return values) make it clear that panic is for exceptional situations. (IE, panic isn't for "file not found," but instead it's for when code isn't written to handle "file not found.")
Even Rust has mechanisms to panic without aborting the process, although I will readily admit that I haven't used them and don't understand them: https://doc.rust-lang.org/std/panic/fn.resume_unwind.html
When everyone uses runtime exceptions and doesn’t count for exception handling in every possible code path, that’s exactly what it means.
When you work with exceptions, the key is to assume that every line can throw unless proven otherwise, which in practice means almost all lines of code can throw. Once you adopt that mental model, things get easier.
It also makes errors part of the API contract, which is where they belong, because they are.
Can't Hotspot not generate the stack trace when it knows the exception will be caught and the stack trace ignored?
Actually it can also just turn off the collection of stack traces entirely for throw sites that are being hit all the time. But most Java code doesn't need this because code only throws exceptions for exceptional situations.
In theory, theory and practice are the same. In practice...
You can't throw a checked exception in a stream, this fact actually underlines the key difference between an exception and a Result: Result is in return position and exceptions are a sort of side effect that has its own control flow. Because of that, once your method throws an Exception or you are writing code in a try block that catches an exception, you become blind to further exceptions of that type, even if you might be able to or required to fix those errors. Results are required to be handled individually and you get syntactic sugar to easily back propagate.
It is trivial to include a stack trace, but stack traces are really only useful for identifying where something occurred, and generally what is superior is attaching context as you back propagate which trivially occurs with judicious use of custom error types with From impls. Doing this means that the error message uniquely defines the origin and paths it passed through without intermediate unimportant stack noise. With exceptions you would always need to catch each exception and rethrow a new exception containing the old to add contextual information, then to avoid catching to much you need variables that will be initialized inside the try block defined outside of the try block. So stack traces are basically only useful when you are doing Pokemon exception handling.
The same is required for any principled error handling.
It's not a checked exception without a stack trace.
Rust doesn't have Java's checked or unchecked exception semantics at the moment. Panics are more like Java's Errors (e.g. OOM error). Results are just error codes on steroids.
We are now discussing what can be done to improve code correctness beyond memory and thread safety. I am excited for what is to come.
The most useful thing exceptions give you is not static compile time checking, it's the stack trace, error message, causal chain and ability to catch errors at the right level of abstraction. Rust's panics give you none of that.
Look at the error message Cloudflare's engineers were faced with:
That's useless, barely better than "segmentation fault". No wonder it took so long to track down what was happening.A proxy stack written in a managed language with exceptions would have given an error message like this:
and so on. It'd have been immediately apparent what went wrong. The bad configs could have been rolled back in minutes instead of hours.In the past I've been able to diagnose production problems based on stack traces so many times I was been expecting an outage like this ever since the trend away from providing exceptions in new languages in the 2010s. A decade ago I wrote a defense of the feature and I hope we can now have a proper discussion about adding exceptions back to languages that need them (primarily Go and Rust):
https://blog.plan99.net/what-s-wrong-with-exceptions-nothing...
tldr: Capturing a backtrace can be a quite expensive runtime operation, so the environment variables allow either forcibly disabling this runtime performance hit or allow selectively enabling it in some programs.
By default it is disabled in release mode.
I am not sure that watching the trendy forefront successfully reach the 1990s and discuss how unwrapping Option is potentially dangerous really warm my heart. I can’t wait for the complete meltdown when they discover effect systems in 2040.
To be more serious, this kind of incident is yet another reminder that software development remains miles away from proper engineering and even key providers like Cloudfare utterly fail at proper risk management.
Celebrating because there is now one popular language using static analysis for memory safety feels to me like being happy we now teach people to swim before a transatlantic boat crossing while we refuse to actually install life boats.
To me the situation has barely changed. The industry has been refusing to put in place strong reliability practices for decades, keeps significantly under investing in tools mitigating errors outside of a few fields where safety was already taken seriously before software was a thing and keeps hiding behind the excuse that we need to move fast and safety is too complex and costly while regulation remains extremely lenient.
I mean this Cloudfare outage probably cost millions of dollars of damage in aggregate between lost revenue and lost productivity. How much of that will they actually have to pay?
But yes, I wish I had learned more, and somehow stumbled upon all the good stuff, or be taught at university about at least what Rust achieves today.
I think it has to be noted Rust still allows performance with the safety it provides. So that's something maybe.
> I mean this Cloudfare outage probably cost millions of dollars of damage in aggregate between lost revenue and lost productivity. How much of that will they actually have to pay?
Probably nothing, because most paying customers of cloudflare are probably signing away their rights to sue Cloudflare for damages by being down for a while when they purchase Cloudflare's services (maybe some customers have SLAs with monetary values attached, I dunno). I honestly have a hard time suggesting that those customers are individually wrong to do so - Cloudflare isn't down that often, and whatever amount it cost any individual customer by being down today might be more than offset by the DDOS protection they're buying.
Anyway if you want Cloudflare regulated to prevent this, name the specific regulations you want to see. Should it be illegal under US law to use `unwrap` in Rust code? Should it be illegal for any single internet services company to have more than X number of customers? A lot of the internet also breaks when AWS goes down because many people like to use AWS, so maybe they should be included in this regulatory framework too.
We have collectively agreed to a world where software service providers have no incentive to be reliable as they are shielded from the consequences of their mistakes and somehow we see it as acceptable that software have a ton of issues and defects. The side effect is that research on actually lowering the cost of safety has little return on investment. It doesn't have be so.
> Anyway if you want Cloudflare regulated to prevent this, name the specific regulations you want to see.
I want software provider to be liable for the damage they cause and minimum quality regulation on par with an actual engineering discipline. I have always been astounded that nearly all software licences start with extremely broad limitation of liability provisions and people somehow feel fine with it. Try to extend that to any other product you regularly use in your life and see how that makes you fell.
How to do proper testing, formal methods and resilient design have been known for decades. I would personnaly be more than okay with let's move less fast and stop breaking things.
That we’re even having this discussion is a major step forward. That we’re still having this discussion is a depressing testament to how slow slowly the mainstream has adopted better ideas.
Zig is undergoing this meltdown. Shame it's not memory safe. You can only get so far in developing programming wisdom before Eternal September kicks in and we're back to re-learning all the lessons of history as punishment for the youthful hubris that plagues this profession.
A few ideas:
- It should not compile in production Rust code
- It should only be usable within unsafe blocks
- It should require explicit "safe" annotation from the engineer. Though this is subject to drift and become erroneous.
- It should be possible to ban the use of unsafe in dependencies and transitive dependencies within Cargo.
unwrap() should effectively work as a Result<> where the user must manually invoke a panic in the failure branch. Make special syntax if a match and panic is too much boilerplate.
This is like an implicit null pointer exception that cannot be statically guarded against.
I want a way to statically block any crates doing this from my dependency chain.
How was I informed as a user? It's not in the type signature.
Sounds like I get to indeterminately crash at runtime and have a fun time debugging.
I don’t think you can ever completely eliminate panics, because there are always going to be some assumptions in code that will be surprisingly violated, because bugs exist. What if the heap allocator discovers the heap is corrupted? What if you reference memory that’s paged out and the disk is offline? (That one’s probably not turned into a panic, but it’s the same principle.)
Absent that there are hacks like no_panic[2]
[0] https://blog.yoshuawuyts.com/extending-rusts-effect-system/ [1] https://koka-lang.github.io/koka/doc/book.html#why-effects [2] https://crates.io/crates/no-panic
Software engineers tend to get stuck in software problems and thinking that everything should be fixed in code. In reality there are many things outside of the code that you can do to operate unreliable components safely.
There's also an assumption here that if the unwrap wasn't there, the caller would have handled the error properly. But if this isn't part of some common library at CF, then chances are the caller is the same person who wrote the panicking function in the first place. So if a new error variant they introduced was returned they'd probably still abort the thread either by panicking at that point or breaking out of the thread's processing loop.
It's not about whether you should ban unwrap() in production. You shouldn't. Some errors are logic bugs beyond which a program can't reasonably continue. The problem is that the language makes it too easy for junior developers (and AI!) to ignore non-logic-bug problems with unwrap().
Programmers early in their careers will do practically anything to avoid having to think about errors and they get angry when you tell them about it.
An unwrap should never make it to production IMHO. It's fine while prototyping, but once the project gets closer to production it's necessary to just grep `uncheck` in your code and replace those that can happen with a proper error management and replace those that cannot happen with `expect`, with a clear justification of why they cannot happen unless there's a bug somewhere else.
unwrap isn't like that.
Not unlike people having a blind spot for Rust in general, no?
You might start with a basic timeline of what happened, then you'd start exploring: why did this change affect so many customers (this would be a line of questioning to find a potential root cause), why did it take so long to discover or recover (this might be multiple lines of questioning), etc.
That's too semantic IMHO. The failure mode was "enforced invariant stopped being true". If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.
If the `.unwrap()` was replaced with `.expect("Feature config is too large!")` it would certainly make the outage shorter.
It wouldn't, not meaningfully. The outage was caused by change in how they processed the queries. They had no way to observe the changes, nor canaries to see that change is killing them. Plus, they would still need to manually feed and restart services that ingested bad configs.
`expect` would shave a few minutes; you would still spend hours figuring out and fixing it.
Granted, using expect is better, but it's not a silver bullet.
Or the good old:
1:
2: 3: 4: Then it would have been idiomatic Rust code and wouldn't have failed at all.The function signature returned a `Result<(), (ErrorFlags, i32)>`
Seems like it should have returned an Err((ErrorFlags, i32)) here. Case 2 or 3 above would have done nicely.
Removing unwrap() from Rust would have forced the proper handling of the function call and would have prevented this.
Unwrap() is Rust's original sin.
There's lots of useful code where `unwrap()` makes sense. On my team, we first try to avoid it (and there are many patterns where you can do this). But when you can't, we leave a comment explaining why it's safe.
> There's lots of useful code where `unwrap()` makes sense. On my team, we first try to avoid it (and there are many patterns where you can do this). But when you can't, we leave a comment explaining why it's safe.
I would prefer the boiler plate of a match / if-else / if let, etc. to call attention to it. If you absolutely must explicitly panic. Or better - just return an error Result.
It doesn't matter how smart your engineers are. A bad unwrap can sneak in through refactors, changing business logic, changing preconditions, new data, etc.
Then we need more safety semantics around panic behavior. A panic label or annotation that infects every call.
Moreover, I want a way of statically guaranteeing none of my dependencies do this.
That would be a fairly significant expansion of what `unsafe` means in Rust, to put it lightly. Not to mention that I think doing so would not really accomplish anything; marking unwrap() `unsafe` would not "surface `unwrap` usage" or "make any guarantees", as it's perfectly fine for safe functions to contain `unsafe` blocks with zero indication of such in the function signature and.
I want an expansion of panic free behavior. We'll never get all the way there due to allocations etc., but this is the class of error the language is intended to fix.
This turned into a null pointer, which is exactly what Rust is supposed to quench.
I'll go as far as saying I would like to statically guarantee none of my dependencies use the unwrap() methods. We should be able to design libraries that provably avoid panics to the greatest extent possible.
Unwrap is an easy loss on a technicality.
Sure, and I'd hardly be one to disagree that a first-party method to guarantee no panics would be nice, but marking unwrap() `unsafe` is definitely not an effective way to go about it.
> but this is the class of error the language is intended to fix.
Is it? I certainly don't see any memory safety problems here.
> This turned into a null pointer, which is exactly what Rust is supposed to quench.
There's some subtlety here - Rust is intended to eliminate UB due to null pointer dereferences. I don't think Rust was ever intended to eliminate panics. A panic may still be undesirable in some circumstances, but a panic is not the same thing as unrestricted UB.
> We should be able to design libraries that provably avoid panics to the greatest extent possible.
Yes, this would be nice indeed. But again, marking unwrap() `unsafe` is not an effective way to do so.
dtolnay's no_panic is the best we have right now IIRC, and there are some prover-style tools in an experimental stage which can accomplish something similar. I don't think either of those are polished enough for first-party adoption, though.
this was bad code that should have never hit production, it is not a rust language issue.
This is a null pointer. In Rust.
Unwrap needs to die. We should all fight to remove it.
Rust advertises memory safety (and other closely related things, like no UB, data race safety, etc.). I don't think it's made any promises about hard guarantees of other kinds of safety.
safe refers to memory safety.
once again, if you write bad code, that’s your fault, not the languages. this is a feature of rust that was used incorrectly.
These folks are not choosing Rust for the memory safety guarantees. They're choosing Rust for being a fast language with a nice type system that produces "safe" code.
Rust is widely known for producing relatively defect-free code on account of its strong type system and ergonomics. Safety beyond memory safety.
Unwrap(), expect(), and their kin are a direct affront to this.
There are only two uses cases for these: (1) developer laziness, (2) the engineer spent time proving the method couldn't fail, but unfortunately they're not using language design features that allow this to be represented in the AST with static guarantees.
In both of these cases, the engineer should instead choose to (1) pass the Result<T,E> or Option<T> to the caller and let the caller decide what to do, (2) do the same, but change the type to be more appropriate to the caller, (3) handle it locally so the caller doesn't have to deal with it, (4) silently turn it into a success. That's it. That's idiomatic Rust.
This should be concerning to everyone:
https://github.com/search?q=unwrap%28%29+language%3ARust&typ...
I'm now panicked (hah) that some dependency of mine will unwrap something and panic at runtime. That's entirely invisible to users. It's extremely dangerous.
Today a billion people saw the result of this laziness. It won't be the last time. And hopefully it never happens in safety-critical applications like aircraft. But the language has no say in this because it isn't taking a stand against this unreasonably sharp edge yet. Hopefully it will. It's a (relatively) easy fix.
> That's too semantic IMHO. The failure mode was "enforced invariant stopped being true". If they'd written explicit code to fail the request when that happened, the end result would have been exactly the same.
Problem is, the enclosing function (`fetch_features`) returns a `Result`, so the `unwrap` on line #82 only serves as a shortcut a developer took due to assuming `features.append_with_names` would never fail. Instead, the routine likely should have worked within `Result`.
But it's a fatal error. It doesn't matter whether it's implicit or explicit, the result is the same.
Maybe you're saying "it's better to be explicit", as a broad generalization I don't disagree with that.
But that has nothing to do with the actual bug here, which was that the invariant failed. How they choose to implement checking and failing the invariant in the semantics of the chosen language is irrelevant.
Maybe the new config has a new update. Who knows? Do we want to keep operating on the old config? Maybe maybe not.
But operating on old config when you don't want to is definitely worse.
Crashing on a config update is usually only done if it could cause data corruption if the configs aren't in sync. That's obviously not the case here since the updates (although distributed in real time) are not coupled between hosts. Such systems usually are replicated state machines where config is totally ordered relative to other commands. Example: database schema and write operations (even here the way many databases are operated they don't strongly couple the two).
The real issue is further up the chain where the malformed feature file got created and deployed without better checks.
I do not think that if the bot detection model inside your big web proxy has a configuration error it should panic and kill the entire proxy and take 20% of the internet with it. This is a system that should fail gracefully and it didn't.
> The real issue
Are there single "real issues" with systems this large? There are issues being created constantly (say, unwraps where there shouldn't be, assumptions about the consumers of the database schema) that only become apparent when they line up.
The thing I dislike most about Nginx is that if you are using it as a reverse proxy for like 20 containers and one of them is up, the whole web server will refuse to start up:
Obviously making 19 sites also unavailable just because one of them is caught in a crash loop isn't ideal. There is a workaround involving specifying variables, like so (non-Kubernetes example, regular Nginx web server running in a container, talking to other containers over an internal network, like Docker Compose or Docker Swarm): Sadly, if you try to use that approach, then you just get: Sadly, switching the redirect configuration away from the default makes some apps go into a redirect loop and fail to load: mostly legacy ones, where Firefox shows something along the lines of "The page isn't redirecting properly". It sucks especially badly if you can't change the software that you just need to run and suddenly your whole Nginx setup is brittle. Apache2 and Caddy don't have such an issue.That's to say that all software out there has some really annoying failure modes, even is Nginx is pretty cool otherwise.
But more generally you could catch the panic at the FL2 layer to make that decision intentional - missing logic at that layer IMHO.
But the bigger change is to make sure that config changes roll out gradually instead of all at once. That’s the source of 99% of all widespread outages
Another option is to make sure that config changes that fail to parse continue using the old config instead of resulting in an unusable service.
OR even, the bot code crashing should itself be generating alerts.
Canary deployment would be automatically rolled back until P0 incident resolved.
All of this could probably have happened and contained at their scale in less than a minute as they would likely generate enough "omg the proxy cannot handle its config" alerts off of a deployment of 0.001% near immediately.
But ultimately it’s not the panic that’s the problem but a failure to specify how panics within FL2 layers should be handled; each layer is at least one team and FL2’s job is providing a safe playground for everyone to safely coexist regardless of the misbehavior of any single component
But as always such failures are emblematic of multiple things going wrong at once. You probably want to end up using both catch_unwind for the typical case and the supervisor for the case where there’s a segfault in some unsafe code you call or native library you invoke.
I also mention the fundamental tension of do you want to fail open or closed. Most layers should probably fail open. Some layers (eg auth) it’s safer to fail closed.
Maybe the validation code should've handled the larger size, but also the db query produced something invalid. That shouldn't have ever happened in the first place.
Agreed, that's also my takeaway.
I don't see the problem being "lazy programmers shouldn't have called .unwrap()". That's reductive. This is a complex system and complex system failures aren't monocausal.
The function in question could have returned a smarter error rather than panicking, but what then? An invariant was violated, and maybe this system, at this layer, isn't equipped to take any reasonable action in response to that invariant violation and dying _is_ the correct thing to do.
But maybe it could take smarter action. Maybe it could be restarted into a known good state. Maybe this service could be supervised by another system that would have propagated its failure back to the source of the problem, alerting operators that a file was being generated in such a way that violated consumer invariants. Basically, I'm describing a more Erlang model of failure.
Regardless, a system like this should be able to tolerate (or at least correctly propagate) a panic in response to an invariant violation.
The point of option is the crash path is more verbose and explicit than the crash-free path. It takes more code to check for NULL in C or nil in Go; it takes more code in Rust to not check for Err.
> This wasn't an attack, but a classic chain reaction triggered by “hidden assumptions + configuration chains” — permission changes exposed underlying tables, doubling the number of lines in the generated feature file. This exceeded FL2's memory preset, ultimately pushing the core proxy into panic.
> Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.
> Hats off to Cloudflare's engineers—those on the front lines putting out fires bear the brunt of such incidents.
> Technical details: Even handling the unwrap correctly, an OOM would still occur. The primary issue was the lack of contract validation in feature ingest. The configuration system requires “bad → reject, keep last-known-good” logic.
> Why did it persist so long? The global kill switch was inadequate, preventing rapid circuit-breaking. Early suspicion of an attack also caused delays.
> Why not roll back software versions or restart?
> Rollback isn't feasible because this isn't a code issue—it's a continuously propagating bad configuration. Without version control or a kill switch, restarting would only cause all nodes to load the bad config faster and accelerate crashes.
> Why not roll back the configuration?
> Configuration lacks versioning and functions more like a continuously updated feed. As long as the ClickHouse pipeline remains active, manually rolling back would result in new corrupted files being regenerated within minutes, overwriting any fixes.
https://x.com/guanlandai/status/1990967570011468071
* For clarity, I am aware that the original tweets are written in Chinese, and they still have the stench of LLM writing all over them; it's not just the translation provided in the above comment.
> classic chain reaction triggered by “hidden assumptions + configuration chains”
"Classic/typical "x + y"", particularly when diagnosing an issue. This one is a really easy tell because humans, on aggregate, do not use quotation marks like this. There is absolutely no reason to quote these words here, and yet LLMs will do a combined quoted "x + y" where a human would simply write something natural like "hidden assumptions and configuration chains" without extraneous quotes.
> The configuration system requires “bad → reject, keep last-known-good” logic.
Another pattern with overeager usage of quotes is this ""x → y, z"" construct with very terse wording.
> This wasn't an attack, but a classic chain reaction
LLMs aggressively use "Not X, but Y". This is also a construct commonly used by humans, of course, but aside from often being paired with an em-dash, another tell is whether it actually contributes anything to the sentence. "Not X, but Y" is strongly contrasting and can add a dramatic flair to the thing being constrasted, but LLMs overuse it on things that really really don't need to be dramatised or contrasted.
> Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.
Two lists of three concepts back-to-back. LLMs enjoy, love, and adore this construct.
> Hats off to Cloudflare's engineers—those on the front lines putting out fires bear the brunt of such incidents.
This kind of completely vapid, feel-good word soup utilising a heroic analogy for something relatively mundane is another tell.
And more broadly speaking, there's a sort of verbosity and emptiness of actual meaning that permeates through most LLM writing. This reads absolutely nothing like what an engineer breaking down an outage looks like. Like, the aforementioned line of... "Rust mitigates certain errors, but the complexity in boundary layers, data flows, and configuration pipelines remains beyond the language's scope. The real challenge lies in designing robust system contracts, isolation layers, and fail-safe mechanisms.". What is that actually communicating to you? It piles on technical lingo and high-level concepts in a way that is grammatically correct but contains no useful information for the reader.
Bad writing exists, of course. There's plenty of bad writing out there on the internet, and some of it will suffer from flaws like these even when written by a human, and some humans do like their em-dashes. But it's generally pretty obvious when the writing is taken on aggregate and you see recognisable pattern after pattern combined with em-dashes combined with shallowness of meaning combined with unnecessary overdramatisations.
Like GP I'm not very good at spotting these patterns yet, so explicit real-world examples go a long way.
The way they wrote the code means that having more than 200 features is a hard non-transient error - even if they recovered from it, it meant they'd have had the same error when the code got to the same place.
I'm sure when the process crashed, k8s restarted the pod or something - then it reran the same piece of code and crashed in the same place.
While I don't necessarily agree with crashing as business strategy, I don't think that doing anything other than either dropping the extra rules or allocating more memory - neither of which the original code was built to do (probably by design).
The code made the local hard assumption that there won't ever be more than 200 rules and its okay to crash if that count is exceeded.
If you design your code around an invariant never being violated (which is fine), you have to make it clear on a higher level that they did.
This isn't a Rust problem (though Rust does make it easy to do the wrong thing here imo)
That's not always foolproof, e.g. a freshly (re)started process doesn't have any prior state it can fall back to, so it just hard crashes. But restarts are going to be rate limited anyways, so even then there is time to mitigate the issue before it becomes a large scale outage
I don't like to use implicit unwrap. Even things that are guaranteed to be there, I treat as explicit (For example, (self.view?.isEnabled ?? false), in a view controller, instead of self.view.isEnabled).
I always redefine @IBOutlets from:
to: I'm kind of a "belt & suspenders" type of guy.In this particular case, I would rather crash. It’s easier to spot in a crash report and you get a nice stack trace.
Silent failure is ultimately terrible for users.
Note: for the things I control I try to very explicitly model state in such a way as I never need to force unwrap at all. But for things beyond my control like this situation, I would rather end the program than continue with a state of the world I don’t understand.
See my above/below comment.
A good tool for catching stuff during development, is the humble assert()[0]. We can use precondition()[1], to do the same thing, in ship code.
The main thing is, is to remain in control, as much as possible. Rather than let the PC leave the stack frame, throw the error immediately when it happens.
[0] https://docs.swift.org/swift-book/documentation/the-swift-pr...
[1] https://docs.swift.org/swift-book/documentation/the-swift-pr...
Agreed.
Unfortunately, crashes in iOS are “silent failures,” and are a loss of control.
What this practice does, is give me the option to handle the failure “noisily,” and in a controlled manner; even if just emitting a log entry, before calling a system failure. That can be quite helpful, in threading. Also, it gives me the option to have a valid value applied, if there’s a structural failure.
But the main reason that I do that with @IBOutlets, is that it forces me to acknowledge, throughout the rest of the code, that it’s an optional. I could always treat implicit optionals as if they were explicit, anyway. This just forces me to.
I have a bunch of practices that folks can laugh at, but my stuff works pretty effectively, and I sleep well.
Also, I have found App Store crash reports to be next to useless. TestFlight ones are a bit better.
But if I spend a lot of time, doing it right, the first time, we can avoid all kinds of heartbreak.
Crash early, crash often. Find the bugs and bad assumptions.
No it's not. Read my other comments.
ToMAYto, ToMAHto.
I have learned that it's a bad idea to trash other folks' methodologies without taking the time to understand why they do things, the way they do.
I have found dogma to be an impediment, in my own work. As I've gotten older, the sharp edges have been sanded off.
Have a great day!
Oh I am aware. They do it because
A) they don’t have a mental model of correct execution. Events just happen to them with a feeling of powerlessness. So rather than trying to form one they just litter the code with cases things that might happen
> As I've gotten older, the sharp edges have been sanded off.
B) they have grown in bad organizations with bad incentives that penalize the appearance of making mistakes. So they learn to hide them.
For example there might be an initiative that rewards removing crashes in favor of silent error.
While there are certainly many things to admire about Rust, this is why I prefer Golang's "noisy" error handling. In golang that would be either:
And the compiler would have complained that this value of `err` was unused; or you'd write: And it would be far more obvious that an error message is being ignored.(Renaming `unwrap` to `unwrapOrPanic` would probably help too.)
People's biggest complaints about golang's errors:
1. You have to _TYPE_OUT_ what to do on EVERY.SINGLE.ERROR. SOO BOORING!
2. They clutter up the code and make it look ugly.
Rust is so much cleaner and more convenient (they say)! Just add ?, or .unwrap()!
Well, with ".unwrap()", you can type it fast enough that you're on to the next problem before it occurs to your brain to think about what to do if there is an error. Whereas, in golang, by the time you type in, "if err != nil {", you've broken the flow enough that now you're much more likely to be thinking, "Hmm, could this ever fail? What should we do if it does?" That break in flow is annoying, but necessary.
And ".unwrap()" looks so unassuming, it's easy to overlook on review; that "panic()" looks a lot more dangerous, and again, would be more likely to trigger a reviewer into thinking, "Wait, is it OK if this thing panics? Is this really so unlikely to happen?"
Renaming it `.unwrap_or_panic()` would probably help with both.
1. Culturally, using `unwrap` is an omerta to Rust developers in the same way `panic` is an omerta to Go devs;
2. In the Rust projects I've seen there is usually a linter rule forbidding `unwrap` so you can't use it in production
Unfortunately none of the meanings Wikipedia knows [1] seems to fit this usage. Did you perhaps mean "taboo"?
I disagree that "unwrap()" seems as scary as "panic()", but I will certainly agree to sibling commenters have a point when they say that "bar, _ := foo()" is a lot less scary than "unwrap()".
[1] https://en.wikipedia.org/wiki/Omerta_(disambiguation)
- it's literally written out that you're assuming it to be Ok
- there are no indications that the `_` is an error: it could very well be some other return value from the function. in your example, it could be the number of appended features, etc
That's why Go's error handling is indeed noisy: it's noise and you reduce noise by not handling errors. Rust's is terse yet verbose: if you add stuff it's because you're doing something wrong. You explicitly spelled out the error is being ignored.
Haven't used Go so maybe I'm missing some consideration, but I don't see how ", _" is more obvious than ".unwrap()". If anything it seems less clear, since you need to check/know the function's signature to see that it's an error being ignored (wouldn't be the case for a function like https://pkg.go.dev/math#Modf).
It may be that forcing handling at every call tends to makes code verbose, and devs insensitized to bad practice. And the diagnostic Rust provided seems pretty garbage.
There is bad practice here too -- config failure manifesting as request failure, lack of failing to safe, unsafe rollout, lack of observability.
Back to language design & error handling. My informed view is that robustness is best when only major reliability boundaries need to be coded.
This the "throw, don't catch" principle with the addition of catches on key reliability boundaries -- typically high-level interactions where you can meaningfully answer a failure.
For example, this system could have a total of three catch clauses "Error Loading Config" which fails to safe, "Error Handling Request" which answers 5xx, and "Socket Error" which closes the HTTP connection.
Rust has a lot of helpers to make it less verbose, even that error they demonstrate could've been written in some form `...code()?` with `?` helper that would have propagated the error forwards.
However I do acknowledge that writing Error types is boring sometimes so people don't bother to change their error types and just unwrap. But even my dinghy little apps for my personal use I do simple serach `unwrap` and make sure I have as few as possible.
The end result would've been the exact same if they "handled" the error: a bunch of 500s. The language being used doesn't matter if an invariant in your system is broken.
If anything, the "crash early" mentality may even be nefarious: instead of handling the error and keeping the old config, you would spin on trying to load a broken config on startup.
_In theory_ they could have used the old config, but maybe there are reasons that’s not possible in Cloudflare’s setup. Whether or not that’s an invariant violation or just an error that can be handled and recovered from is a matter of opinion in system design.
And crashing on an invariant violation is exactly the right thing to do rather than proceed in an undefined state.
At a previous job (cloud provider), we've had exactly this kind of issue, with exactly the same root cause. The entrypoint for the whole network had a set of rules (think a NAT gateway) that were reloaded periodically from the database. Someone rewrote that bit of plumbing from Python to Go. Someone else performed a database migration. Suddenly, the plumbing could not find the data, and pushed an empty file to prod. The rewrite lacked "if empty, do nothing and raise an alert", that the previous one had. I'll let you imagine what happened next :)
Them calling unwrap on a limit check is the real issue imo. Everything that takes in external input should assume it is bad input and should be fuzz tested imo.
In the end, what is the point of having a limit check if you are just unwrapping on it
Using the question mark operator [1] and even adding in some anyhow::context goes a long way to being able to fail fast and return an Err rather then panicking.
Sure you need to handle Results all the way up the stack but it forces you to think about how those nested parts of your app will fail as you travel back up the stack.
[1]: https://doc.rust-lang.org/rust-by-example/std/result/questio...
This .unwrap() sounds too easy for what it does, certainly much easier than having an entire try..catch block with an explicit panic. Full disclosure: I don't actually know Rust.
Any project has to reason about what sort of errors can be tolerated gracefully and which cannot. Unwrap is reasonable in scenarios you expect to never be reached, because otherwise your code will be full of all sorts of possible permutations and paths that are harder to reason about and may cascade into extremely nuanced or subtle errors.
Rust also has a version of unwrap called "expect" where you provide a string that logs why the unwrap occurred. It's similar, but for pieces of code that are crucial it could be a good idea to require all 'unwraps' to instead be 'expects' so that people at least are forced to write down a reason why they believe the unwrap can never be reached.
The config bug reaching prod without this being caught and pinpointed immediately is the strange part.
And, it took like over an hour between the problem started til my sites went down. That is just crazy.
Panics aren't exceptions, any "panic" in Rust can be thought of as an abort of the process (Rust binaries have the explicit option to implement panics as aborts). Companies like Dropbox do exactly this in their similar Rust-based systems, so it wouldn't surprise me if Cloudflare does the same.
"Banning exceptions" wouldn't have done anything here, what you're looking for is "banning partial functions" (in the Haskell sense).
Rust's unwrap isn't the same as std::expected::value. The former panics - i.e. either aborts the program or unwinds depending on context and is generally not meant to be handled. The latter just throws an exception that is generally expected to be handled. Panics and exceptions use similar machinery (at least they can depending on compiler options) but they are not equivalent - for example nested panics in destructors always abort the program.
In code that isn't meant to crash `unwind` should be treated as a sign saying that "I'm promising that this will never happen", but just like in C++ where you promise that pointers you deference are valid and signed integers you add don't overflow making promises like that is a necessary part of productive programming.
As usual: people problem, not a tech problem. In the last years a lot of strides have been made. But people will be people.
But now after we are past that and it has a lot of mind share, I'd say it's time to start tightening the bolts.
at some point machine would be better in coding because well machine code is machine instruction task
same like chess, engine is better than human grandmaster because its solvable math field
coding is no different
Might be worth noting that your description of chess is slightly incorrect. Chess technically isn't solved in the sense that the optimal move is known for any arbitrary position is known; it's just that chess engines are using what amounts to a fancy brute force for most of the game and the combination of hardware and search algorithm produces a better result than the human brain does. As such, chess engines are still capable of making mistakes, even if actually exploiting them is a challenge.
"chess engines are still capable of making mistakes", I'm sorry no
inaccurate yes but not mistake
The thing is that there is no known general objective criteria for "best" and "bad" moves. The best we have so far is based on engine evaluations, but as I said before that is because chess engines are better at searching the board's state space than humans, not because chess engines have solved chess in the mathematical sense. Engines are quite capable of misevaluating positions, as demonstrated quite well by the Top Chess Engine Championship [0] where one engine thinks it made a good move while the other thinks that move is bad, and this is especially the case when resources are limited.
The closest we are to solving chess are via tablebases, which are far from covering the entire state space and are basically as much of an exemplar of pure brute force as you can get.
> "chess engines are still capable of making mistakes", I'm sorry no
If you think chess engines are infalliable, then why does the Top Chess Engine Championship exist? Surely if chess engines could not make mistakes they would always agree on a position's evaluation and what move should be made, and therefore such an exercise would be pointless?
> inaccurate yes but not mistake
From the perspective to attaining perfect play an inaccuracy is a mistake.
[0]: https://en.wikipedia.org/wiki/Top_Chess_Engine_Championship
are you playing chess or not?????? if you playing chess then its oblivious how to differentiate bad move and best move
Yes it is objective, these thing called best move not without reason
"If you think chess engines are infalliable, then why does the Top Chess Engine Championship exist?"
to create better chess engine like what do even talking about here????, are you saying just because there are older bad engine that mean this thing is pointless ????
if you playing chess up to a decent level 1700+ (like me), you know that these argument its wrong and I assure you to learn chess to a decent level
up until that point that you know high level chess is brute force games and therefore solvable math
In a fascinating coincidence, there is a tonyhart7 on both chess.com and lichess, and they have been banned for cheating on both websites.
The key words in what I said are "general" and "objective". Yes, it's possible to determine "good" or "bad" moves in specific positions. There's no known method to determine "good" or "bad" moves in arbitrary positions, as would be required for chess to be considered strongly solved.
Furthermore, if it's "obvious" how to differentiate good and bad moves then we should never see engines blundering, right?
So (for example) how do you explain this game between Stockfish and Leela where Stockfish blunders a seemingly winning position [0]? After 37... Rdd8 both Stockfish and Leela think white is clearly winning (Stockfish's evaluation is +4.00, while Leela's evaluation is +3.81), but after 38. Nxb5 Leela's evaluation plummets to +0.34 while Stockfish's evaluation remains at +4.00. In the end, it turns out Leela was correct after 40... Rxc6 Stockfish's evaluation also drops from +4.28 to 0.00 as it realizes that Leela has a forced stalemate.
Or this game also between Stockfish and Leela where Leela blunders into a forced mating sequence and doesn't even realize it for a few moves [1]?
Engines will presumably always play what they think is the "best" move, but clearly sometimes this "best" move is wrong. Evidently, this means differentiating "good" and "bad" moves is not always obvious.
> Yes it is objective, these thing called best move not without reason
If it's objective, then why is it possible for engines to disagree on whether a move is good or bad, as they do in the above example and others?
> to create better chess engine like what do even talking about here????
The ability to create better chess engines necessarily implies that chess engines can and do make mistakes, contrary to what you asserted.
> are you saying just because there are older bad engine that mean this thing is pointless ????
No. What I'm saying is that your explanation for why chess engines are better than humans is wrong. Chess engines are not better than humans because they have solved chess in the mathematical sense; chess engines are better than humans because they search the state space faster and more efficiently than humans (at least until you reach 7 pieces on the board).
> up until that point that you know high level chess is brute force games and therefore solvable math
"Solvable" and "solved" are two very different things. Chess is solvable, in theory. Chess is very far from being solved.
[0]: https://www.chess.com/computer-chess-championship#event=309&...
[1]: https://www.chess.com/computer-chess-championship#event=309&...
The root cause here was that a file was mildly corrupt (with duplicate entries, I guess). And there was a validation check elsewhere that said "THIS FILE IS TOO BIG".
But if that's a validation failure, well, failing is correct? What wasn't correct was that the failure reached production. What should have happened is that the validation should have been a unified thing and whatever generated the file should have flagged it before it entered production.
And that's not an issue with function return value API management. The software that should have bailed was somewhere else entirely, and even there an unwrap explosion (in a smoke test or pre-release pass or whatever) would have been fine.
Ideally every validation should have a well-defined failure path. In the case of a config file rotation, validation failure of the new config could mean keeping the old config and logging a high-priority error message. In the case of malformed user-provided data, it might mean dropping the request and maybe logging it for security analysis reasons. In the case of "pi suddenly equals 4" checks the most logical approach might be to intentionally crash, as there's obviously something seriously wrong and application state has corrupted in such a way that any attempt to continue is only going to make things worse.
But in all cases there's a reason behind the post-validation-failure behavior. At a certain point leaving it up to "whatever happens on .unwrap() failure" isn't good enough anymore.
it'd be kinda hard to amend the clippy lints to ignore coroutine unwraps but still pipe up on system ones. i guess.
edit: i think they'd have to be "solely-task-color-flavored" so definitely probably not trivial to infer
How so? “Parse, don’t validate” implies converting input into typed values that prevent representation of invalid state. But the parsing still needs to be done correctly. An unchecked unwrap really has nothing to do with this.
Average Go code has much less panics than Rust has unwraps, which are functionally equivalent.
It's not in the type system, but it's idiomatic
I'd prefer a loud crash over that.
In the original PHP code, all worked, only it didn't properly check for bots.
The new Rust code did a loud crash and took off half of the internet.
Also wonder with a sharded system why are they not slow rolling out changes and monitoring?
> Bad data was only generated if the query ran on a part of the cluster which had been updated. As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.
The file should be versioned and rollout of new versions should be staged.
(There is definitely a trade-off; often times in the security critical path, you want to go as fast as possible because changes may be blocking a malicious actor. But if you move too fast, you break things. Here, they had a potential poison input in the pathway for synchronizing this state and Murphy's Law suggests it was going to break eventually, so the question becomes "How much damage can we tolerate when it does?")
That feature file is generated every 5 minutes at all times; the change to permissions was rolled out gradually over the clickhouse cluster, and whether a bad version of that file was generated depended on whether the part of the cluster that had the bad permissions generated the file.
First multi-million dollar .unwrap() story.
I have been saying for years that Rust botched error handling in unfixable ways. I will go to the grave believing Rust fumbled.
The design of the Rust language encourages people to use unwrap() to turn foreseeable runtime problems into fatal errors. It's the path of least resistance, so people will take it.
Rust encourages developers to consider only the happy path. No wonder it's popular among people who've never had to deal with failure.
All of the concomitant complexity--- Result, ?, the test thing, anyhow, the inability for stdlib to report allocation failure --- is downstream of a fashion statement against exceptions Rust cargo-culted from Go.
The funniest part is that Rust does have exceptions. It just calls them panics. So Rust code has to deal with the ergonomic footgun of Result but pays anyway for the possibility of exceptions. (Sure, you can compile with panic=abort. You can't count on it.)
I could not be more certain that Rust should have been a language with exceptions, not Result, and that error objects are a gross antipattern we'll regret for decades.
(You usually want to make a function infallible if you're using your noexcept function as part of a cleanup path or part of a container interface that allows for more optimizations of it knows certain container operations are infallible.)
Rust makes infallibility the syntactic default and makes you write Result to indicate fallibility. People often don't want to color their functions this way. Guess what happens when a programmer is six levels deep in infallible-colored function calls and does something that can fail.
.unwrap()
Guess what, in Rust, is fallible?
Mutex acquire.
Guess what you need to do often on infallible cleanup paths?
Mutex acquire.
If unwrap() were named UNWRAP_OR_PANIC(), it would be used much less glibly. Even more, I wish there existed a super strict mode when all places that can panic are treated as compile-time errors, except those specifically wrapped in some may_panic_intentionally!() or similar.
React.__SECRET_INTERNALS_DO_NOT_USE_OR_YOU_WILL_BE_FIRED comes to mind. I did have to reach to this before, but it certainly works for keeping this out of example code and other things like reading other implementations without the danger being very apparent.
At some point it was renamed to __CLIENT_INTERNALS_DO_NOT_USE_OR_WARN_USERS_THEY_CANNOT_UPGRADE which is much less fun.
Not for this guy:
https://github.com/reactjs/react.dev/issues/3896
There is already a try/catch around that code, which produces the Result type, which you can presumptuously .unwrap() without checking if it contains an error.
Instead, one should use the question mark operator, that immediately returns the error from the current function if a Result is an error. This is exactly similar to rethrowing an exception, but only requires typing one character, the "?".
It's way less book-keeping with exceptions, since you, intentionally, don't have to write code for that exceptional behavior, except where it makes sense to. The return by value method, necessarily, implements the same behavior, where handling is bubbled up to the conceptually appropriate place, through returns, but with much more typing involved. Care is required for either, since not properly bubbling up an exception can happen in either case (no re-raise for exceptions, no return after handling for return).
With return values, you can trivially ignore an exception.
In the wild, I've seen far more ignoring return errors, because of the mechanical burden of having type handling at every function call.This is backed by decades of writing libraries. I've tried to implement libraries without exceptions, and was my admittedly cargo-cult preference long ago, but ignoring errors was so prevalent among the users of all the libraries that I now always include a "raise" type boolean that defaults to True for any exception that returns an error value, to force exceptions, and their handling, as default behavior.
> In big projects you can basically never know when or how something can fail.
How is this fundamentally different than return value? Looking at a high level function, you can't know how it will fail, you just know it did fail, from the error being bubbled up through the returns. The only difference is the mechanism for bubbling up the error.
Maybe some water is required for this flame war. ;)
That is the main reason why zig doesn’t have exceptions.
What I suspect you mean, because it's a better argument, is:
which is fair, although how often is it really the right thing to let all the errors from 4 independent sources flow together and then get picked apart after the fact by inspecting `e`? It's an easier life, but it's also one where subtle problems constantly creep in without the compiler having any visibility into them at all.A function or a keyword would interrupt that and make it less tempting
How many times can you truly prove that an `unwrap()` is correct and that you also need that performance edge?
Ignoring the performance aspect that often comes from a hat-trick, to prove such a thing you need to be wary of the inner workings of a call giving you a `Return`. That knowledge is only valid at the time of writing your `unwrap()`, but won't necessarily hold later.
Also, aren't you implicitly forcing whoever changes the function to check for every smartass dev that decided to `unwrap` at their callsite? That's bonkers.
If I were Cloudflare I would immediately audit the codebase for all uses of unwrap (or similar rust panic idioms like expect), ensure that they are either removed or clearly documented as to why it's worth crashing the program there, and then add a linter to their CI system that will fire if anyone tries to check in a new commit with unwrap in it.
So the point of unwrap() is not to prove anything. Like an assertion it indicates a precondition of the function that the implementer cannot uphold. That's not to say unwrap() can't be used incorrectly. Just that it's a valid thing to do in your code.
Note that none of this is about performance.
Returning a Result by definition means the method can fail.
No more than returning an int by definition means the method can return -2.
What? Returning an int does in fact mean that the method can return -2. I have no idea what your argument is with this, because you seem to be disagreeing with the person while actually agreeing with them.
Some call points to a function that returns an int will never return -2.
Sometimes you know things the type system does not know.
Except maybe Haskell.
Also, exception handling is hard and lame. We don't need exceptions, just add a "match" block after every line in your program.
Rust compiler is a god of sorts, or at least a law of nature haha
Way to comment and go instantly off topic
The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside.
The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.
It has somewhat regularly saved us from disaster in the past.
Edit: Similar to Crowdstrike, the bot detector should have fallen-back to its last-known-good signature database after panicking, instead of just continuing to panic.
If you want to say that systems that light up hundreds of customers, or propagate new reactive bot rules, or notify a routing system that a service has gone down are intrinsically too complicated, that's one thing. By all means: "don't build modern systems! computers are garbage!". I have that sticker on my laptop already.
But like: handling these problems is basically the premise of large-scale cloud services. You can't just define it away.
https://fly.io/blog/a-foolish-consistency/
https://fly.io/blog/corrosion/
> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
I'm no FAANG 10x engineer, and I appreciate things can be obvious in hindsight, but I'm somewhat surprised that engineering at the level of Cloudflare does not:
1. Push out files A/B to ensure the old file is not removed.
2. Handle the failure of loading the file (for whatever reason) by automatically reloading the old file instead and logging the error.
This seems like pretty basic SRE stuff.
Even if you want this data to be very fresh you can probably afford to do something like:
1. Push out data to a single location or some subset of servers.
2. Confirm that the data is loaded.
3. Wait to observe any issues. (Even a minute is probably enough to catch the most severe issues.)
4. Roll out globally.
Even if the servers weren't crashing it is possible that a bet set of parameters results in far too many false positives which may as well be complete failure.
Sometimes you have smart people in the room who dig deeper and fish it out, but you cannot always rely on that.
I'm also suspicious that
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
from the blog had a lot more to do with the issue than perhaps the narrative is letting on.
My best guess is too many alerts firing without a clear hierarchy and possibilities to seprate cause from effect. It's a typical challenge but I wish they would shed some light on that. And its a bit concerning that improving observability is not part of their follow up steps.
> Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input
> Enabling more global kill switches for features
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
> Reviewing failure modes for error conditions across all core proxy modules
Absent from this list are canary deployments and incremental or wave-based deployment of configuration files (which are often as dangerous as code changes) across fault isolation boundaries -- assuming CloudFlare has such boundaries at all. How are they going to contain the blast radius in the future?
This is something the industry was supposed to learn from the CrowdStrike incident last year, but it's clear that we still have a long way to go.
Also, enabling global anything (i.e., "enabling global kill switches for features") sounds like an incredibly risky idea. One can imagine a bug in a global switch that transforms disabling a feature into disabling an entire system.
I wonder why clickhouse is used to store the feature flags here, as it has it's own duplication footguns[0] which could have also easily lead to a query blowing up 2/3x in size. oltp/sqlite seems more suited, but i'm sure they have their reasons
[0] https://clickhouse.com/docs/guides/developer/deduplication
Also, the link you provided is for eventual deduplication at the storage layer, not deduplication at query time.
It’s not a terrible idea, in that you can test the exact database engine binary in CI, and it’s (by definition) not a single point of failure.
I love sqlite for some things, but it's not The One True Database Solution.
In this case, the older proxy's "fail-closed" categorization of bot activity was obviously better than the "fail-crash", but every global change needs to be carefully validated to have good characteristics here.
Having a mapping of which services are downstream of which other service configs and versions would make detecting global incidents much easier too, by making the causative threads of changes more apparent to the investigators.
People think of configuration updates (or state updates, call them what you will) as inherently safer than code updates, but history (and today!) demonstrates that they are not. Yet even experienced engineers will allow changes like these into production unattended -- even ones who wouldn't dare let a single line of code go live without being subject to the full CI/CD process.
The point here remains: consider every change to involve risk, and architect defensively.
And I hope fly.io has these mechanisms as well :-)
https://fly.io/blog/corrosion/
If the exact same thing happens again at Cloudflare, they'll be fair game. But right now I feel people on this thread are doing exactly, precisely, surgically and specifically the thing Richard Cook and the Cook-ites try to get people not to do, which is to see complex system failures as predictable faults with root causes, rather than as part of the process of creating resilient systems.
† https://how.complexsystems.fail/
Complex system failures are not monocausal! Complex systems are in a continuous state of partial failure!
Fires happen every day. Smoke alarms go off, firefighters get called in, incident response is exercised, and lessons from the situation are learned (with resulting updates to the fire and building codes).
Yet even though this happens, entire cities almost never burn down anymore. And we want to keep it that way.
As Cook points out, "Safety is a characteristic of systems and not of their components."
No matter what architecture, processes, software, frameworks, and systems you use, or how exhaustively you plan and test for every failure mode, you cannot 100% predict every scenario and claim "cellular architecture fixes this". This includes making 100% of all failures "contained". Not realistic.
Cellular architecture within a region is the next level and is more difficult, but is achievable if you adhere to the same principles that prohibit inter-regional coupling:
https://docs.aws.amazon.com/wellarchitected/latest/reducing-...
https://docs.aws.amazon.com/wellarchitected/latest/reducing-...
Amazon has had multi-region outages due to pushing bad configs, so it’s extremely difficult to believe whatever you are proposing solves that exact problem by relying on multi-regions.
Come to think of it, Cloudflare’s outage today is another good counterexample.
Customers survive incidents on a daily basis by failing over across regions (even in the absence of an AWS regional failure, they can fail due to a bad deployment or other cause). The reason you don’t hear about it is because it works.
This has to sting a bit after that post.
Unfortunately they do not share, what caused the status page to went down as well. (Does this happen often? Otherwise a big coincidence it seems)
And since it seems this is hosted by Atlassian, this would be up to Atlassian.
[0] https://docs.aws.amazon.com/AmazonCloudFront/latest/Develope...
IDK Atlassian Statuspage clientele, but it's possible Cloudflare is much larger than usual.
The report actually seems to confirm this - it was indeed a crash on ingesting the bad config. However I'm actually surprised that the long duration didn't come from "it takes a long time to restart the fleet manually" or "tooling to restart the fleet was bad".
The problem mostly seems to have been "we didn't knew whats going on". Some look into the proxy logs would hopefully have shown the stacktrace/unwrap, and metrics about the incoming requests would hopefully have shown that there's no abnormal amount of requests coming in.
At the bare minimum they could've used an expect("this should never happen, if it does database schema is incorrect").
The whole point of errors as values is preventing this kind of thing.... It wouldn't have stopped the outage but it would've made it easy to diagnose.
If anyone at cloudflare is here please let me in that codebase :)
Unwrap gives you a stack trace, while retuned Err doesn't, so simply using a Result for that line of code could have been even harder to diagnose.
`unwrap_or_default()` or other ways of silently eating the error would be less catastrophic immediately, but could still end up breaking the system down the line, and likely make it harder to trace the problem to the root cause.
The problem is deeper than an unwrap(), related to handling rollouts of invalid configurations, but that's not a 1-line change.
The problem is that they didn't surface a failure case, which means they couldn't handle rollouts of invalid configurations correctly.
The use of `.unwrap()` isn't superficial at all -- it hid an invariant that should have been handled above this code. The failure to correctly account for and handle those true invariants is exactly what caused this failure mode.
There needs to be something at the top level that can handle a crashing process.
Or can a unwrap be stopped?
This is just a normal Tuesday for languages with Exception and try/catch.
Assuming something similar to Sentry would be in use, it should clearly pick up the many process crashes that start occurring right as the downtime starts. And the well defined clean crashes should in theory also stand out against all the random errors that start occuring all over the system as it begins to go down, precisely because it's always failing at the exact same point.
The issue here is about the system as a whole not any line of code.
There are plenty of resources , yet it's somehow never enough. You do tons of pretty amazing things with pretty amazing tools that also have notable shortcomings.
You're surround by smart people who do lots of great work, but you also end up in incident reviews where you find facepalm-y stuff. Sometimes you even find out it was a known corner case that was deemed too unlikely to prioritize.
The last incident for my team that I remember dealing with there ended up with my coworker and I realizing the staging environment we'd taken down hours earlier was actually the source of data for a production dashboard, so we'd lost some visibility and monitoring for a bit.
I've also worked at Facebook (pre-Meta days) and at Datadog, and I'd say it was about the same. Most things are done quite well, but so much stuff is happening that you still end up with occasional incidents that feel like they shouldn't have happened.
I'd agree that the use of `unwrap` could possibly make sense in a place where you do want the system to fail hard. There's lot of good reasons to make the system fail hard. I'd lean towards an `expect` here, but whatever.
That said, the function already returns a `Result` and we don't know what the calling code looks like. Maybe it does do an `unwrap` there too, or maybe there is a save way for this to log and continue that we're not aware of because we don't have enough info.
Should a system as critical as the CF proxy fail hard? I don't know. I'd say yes if it was the kind of situation that could revert itself (like an incremental rollout), but this is such an interesting situation since it's a config being rolled out. Hindsight is 20:20 obviously, but it feels like there should've been better logging, deployment, rollback, and parsing/validation capabilities, no matter what the `unwrap`/`Result` option is.
Also, it seems like the initial Clickhouse changes could've been testing much better, but I'm sure the CF team realizes that.
On the bright side, this is a very solid write up so quickly after the outage. Much better than those times we get it two weeks later.
I don't use Rust, but a lot of Rust people say if it compiles it runs.
Well Rust won't save you from the usual programming mistake. Not blaming anyone at cloudflare here. I love Cloudflare and the awesome tools they put out.
end of day - let's pick languages | tech because of what we love to do. if you love Rust - pick it all day. I actually wanna try it for industrial robot stuff or small controllers etc.
there's no bad language - just occassional hiccups from us users who use those tools.
Unwrapping is a very powerful and important assertion to make in Rust whereby the programmer explicitly states that the value within will not be an error, otherwise panic. This is a contract between the author and the runtime. As you mentioned, this is a human failure, not a language failure.
Pause for a moment and think about what a C++ implementation of a globally distributed network ingress proxy service would look like - and how many memory vulnerabilities there would be… I shudder at the thought… (n.b. nginx)
This is the classic example of when something fails, the failure cause over indexes on - while under indexing on the quadrillions of memory accesses that went off without a single hitch thanks to the borrow checker.
I postulate that whatever the cost in millions or hundreds of millions of dollars by this Cloudflare outage, it has paid for more than by the savings of safe memory access.
See: https://en.wikipedia.org/wiki/Survivorship_bias
Well, no, most Rust programmers misunderstand what the guarantees are because they keep parroting this quote. Obviously the language does not protect you from logic errors, so saying "if it compiles, it works" is disingenuous, when really what they mean is "if it compiles, it's probably free of memory errors".
It's a common thing I've experienced and seen a lot of others say that the stricter the language is in what it accepts the more likely it is to be correct by the time you get it to run. It's not just a Rust thing (although I think Rust is _stricter_ and therefore this does hold true more of the time), it's something I've also experienced with C++ and Haskell.
So no, it's not a guarantee, but that quote was never about Rust's guarantees.
Even more now after this outage.
But it's a fact that "if it compiles it runs" is often associated with Rust, in HN at least. A quick Algolia search tells me that.
I mean thats an unfalsifiable statement, not really fair. C is used to successfully launch spaceships.
Whereas we have a real Rust bug that crashed a good portion of the internet for a significant amount of time. If this was a C++ service everyone would be blaming the language, but somehow Rust evangelicals are quick to blame it on "unidiomatic Rust code".
A language that lets this easily happen is a poorly designed language. Saying you need to ban a commonly used method in all production code is broken.
Consider that the set of possible failures enabled by language design should be as small as possible.
Rust's set is small enough while also being productive. Until another breakthrough in language design as impactful as the borrow checker is invented, I don't imagine more programmers will be able to write such a large amount of safe code.
Disagree. Rust is at least giving you an "are you sure?" moment here. Calling unwrap() should be a red flag, something that a code reviewer asks you to explain; you can have a linter forbid it entirely if you like.
No language will prevent you from writing broken code if you're determined to do so, and no language is impossible to write correct code in if you make a superhuman effort. But most of life happens in the middle, and tools like Rust make a huge difference to how often a small mistake snowballs into a big one.
No one treats it like that and nearly every Rust project is filled with unwraps all over the place even in production system like Cloudflare's.
If you haven't read the Rust Book at least, which is effectively Rust 101, you should not be writing Rust professionally. It has a chapter explaining all of this.
It would be better if that would be the other way round "linter forbids it unless you ask it not to". Never wrong to allow users to shoot themself in the foot, but it should be explicit.
This is not a Rust problem. Someone consciously chose to NOT handle an error, possibly thinking "this will never happen". Then someone else conconciouly reviewed (I hope so) a PR with an unwrap() and let it slide.
Now it might be that it was tested, but then ignored or deprioritised by management...
as they say in the post, these files get generated every 5 minutes and rolled out across their fleet.
so in this case, the thing farther up the callstack is a "watch for updated files and ingest them" component.
that component, when it receives the error, can simply continue using the existing file it loaded 5 minutes earlier.
and then it can increment a Prometheus metric (or similar) representing "count of errors from attempting to load the definition file". that metric should be zero in normal conditions, so it's easy to write an alert rule to notify the appropriate team that the definitions are broken in some way.
that's not a complete solution - in particular it doesn't necessarily solve the problem of needing to scale up the fleet, because freshly-started instances won't have a "previous good" definition file loaded. but it does allow for the existing instances to fail gracefully into a degraded state.
in my experience, on a large enough system, "this could never happen, so if it does it's fine to just crash" is almost always better served by a metric for "count of how many times a thing that could never happen has happened" and a corresponding "that should happen zero times" alert rule.
Panics should be logged, and probably grouped by stack trace for things like prometheus (outside of process). That handles all sorts of panic scenarios, including kernel bugs and hardware errors, which are common at cloudflare scale.
Similarly, mitigating by having rapid restart with backoff outside the process covers far more failure scenarios with far less complexity.
One important scenario your approach misses is “the watch config file endpoint fell over”, which probably would have happened in this outage if 100% of servers went back to watching all of a sudden.
Sure, you could add an error handler for that too, and for prometheus is being slow, and an infinite other things. Or, you could just move process management and reporting out of process.
1. At startup, load the last known good config.
2. When signaled, load the new config.
3. When that passes validation, update the last-known-good pointer to the new version.
That way something like this makes the crash recoverable on the theory that stale config is better than the service staying down. One variant also recorded the last tried config version so it wouldn’t even attempt to parse the latest one until it was changed again.
For Cloudflare, it’d be tempting to have step #3 be after 5 minutes or so to catch stuff which crashes soon but not instantly.
Anecdotally I can write code for several hours, deploy it to a test sandbox without review or running tests and it will run well enough to use it, without silly errors like null pointer exceptions, type mismatches, OOBs etc. That doesn't mean it's bug-free. But it doesn't immediately crash and burn either. Recently I even introduced a bug that I didn't immediately notice because careful error handling in another place recovered from it.
Do you grok what the issue was with the unwrap, though...?
Idiomatic Rust code does not use that. The fact that it's allowed in a codebase says more about the engineering practices of that particular project/module/whatever. Whoever put the `unwrap` call there had to contend with the notion that it could panic and they still chose to do it.
It's a programmer error, but Rust at least forces you to recognize "okay, I'm going to be an idiot here". There is real value in that.
The "no unwrap" rule is common in most production codebases. Chill.
could have been tight deadline, managerial pressure or just the occasional slip up.
I haven’t worked in Rust codebases, but I have never worked in a Go codebase where a `panic` in such a location would make it through code review.
Is this normal in Rust?
I can imagine that this could easily lead to less visibility into issues.
What would you propose to fix it? The fixed cost of being DDoS-proof is in the hundreds of millions of dollars.
Hell, I would be very curious to know the costs to keep HackerNews running. They probably serve more users than my current client.
People want to chase the next big thing to write it on their CV, not architect simple systems that scale. (Do they even need to scale?)
I never said serving millions of requests is more expensive. Protecting your servers is more expensive.
> Hell, I would be very curious to know the costs to keep HackerNews running. They probably serve more users than my current client.
HN uses Cloudflare. You're making my point for me. If you included the fixed costs that Cloudflare's CDN/proxy is giving to HN incredibly cheaply, then running HN at the edge with good performance (and protecting it from botnets) would costs hundreds of millions of dollars.
> People want to chase the next big thing to write it on their CV, not architect simple systems that scale. (Do they even need to scale?)
Again, attacking your own straw men here.
Writing high-throughput web applications is easier than ever. Hosting them on the open web is harder than ever.
From the ping output, I can see HN is using m5hosting.com. This is why HN was up yesterday, even though everything on CF was down.
> Writing high-throughput web applications is easier than ever. Hosting them on the open web is harder than ever.
Writing proper high-throughput applications was never easy and will never be. It is a little bit easier because we have highly optimized tools like nginx or nodejs so we can offset critical parts. And hosting is "harder than ever" if you complicate the matter, which is a quite common pattern these days. I saw people running monstrosities to serve some html & js in the name of redundancy. You'd be surprised how much a single bare-metal (hell, even a proper VM from DigitalOcean or Vultr) can handle.
"Single" means "you only need one," not that there is only one.
Having the feature table pivoted (with 200 feature1, feature2, etc columns) meant they had to do meta queries to system.columns to get all the feature columns which made the query sensitive to permissioning changes (especially duplicate databases).
A Crowdstrike style config update that affects all nodes but obviously isn't tested in any QA or staged rollout strategy beforehand (the application panicking straight away with this new file basically proves this).
Finally an error with bot management config files should probably disable bot management vs crash the core proxy.
I'm interested here why they even decided to name Clickhouse as this error could have been caused by any other database. I can see though the replicas updating causing flip / flopping of results would have been really frustrating for incident responders.
The solution to that problem wasn't better testing of database permutations or a better staging environment (though in time we did do those things). It was (1) a watchdog system in our proxies to catch arbitrary deadlocks (which caught other stuff later), (2) segmenting our global broadcast domain for changes into regional broadcast domains so prod rollouts are implicitly staged, and (3) a process for operators to quickly restore that system to a known good state in the early stages of an outage.
(Cloudflare's responses will be different than ours, really I'm just sticking up for the idea that the changes you need don't follow obviously from the immediate facts of an outage.)
Of course, some users were still blocked, because the Turnstile JS failed to load in their browser but the subsequent siteverify check succeeded on the backend. But overall the fail-open implementation lessened impact to our customers nonetheless.
Fail-open with Turnstile works for us because we have other bot mitigations that are sufficient to fall back on in the event of a Cloudflare outage.
However, I have a question from a release deployment process perspective. Why was this issue not detected during internal testing ? I didn't find the RCA analysis covering this aspect. Doesn't cloudflare have an internal test stage as part of its CICD pipeline. Looking the description of the issue, it should have been immediately detected in internal stage test environment.
I get it, don’t pick languages just because they are trendy, but if any company’s use case is a perfect fit for Rust it’s cloudflare.
but Rust's type system did catch this error - and then author decided it's fine to panic if this error happens
> You won't see Go or Java developers making such strong claims about their preferred languages.
yess no Java developer ever said that OOP will solve world hunger
The issue is that it wasn't fine to panic, thus Rust did not catch this error.
This simply means, the exception handling quality of your new FL2 is non-existent and is not at par / code logic wise similar to FL.
I hope it was not because of AI driven efficiency gains.
> Eliminating the ability for core dumps or other error reports to overwhelm system resources
but this is not mentioned at all in the timeline above. My best guess would be that the process got stuck in a tight restart loop and filled available disk space with logs, but I'm happy to hear other guesses for people more familiar with Rust.
> As well as returning HTTP 5xx errors, we observed significant increases in latency of responses from our CDN during the impact period. This was due to large amounts of CPU being consumed by our debugging and observability systems, which automatically enhance uncaught errors with additional debugging information.
(Just above https://blog.cloudflare.com/18-november-2025-outage/#how-clo...)
They are escape hatches. Without those your language would never take off.
But here's the thing. Escape hatches are like emergency exits. They are not to be used by your team to go to lunch in a nearby restaurant.
---
Cloudflare should likely invest in better linting and CI/CD alerts. Not to mention isolated testing i.e. deploy this change only to a small subset and monitor, and only then do a wider deployment.
Hindsight is 20/20 and we can all be smartasses after the fact of course. But I am really surprised because lately I am only using Rust for hobby projects and even I know I should not use `unwrap` and `expect` beyond the first iteration phases.
---
I have advocated for this before but IMO Rust at this point will benefit greatly from disallowing those unsafe APIs by default in release mode. Though I understand why they don't want to do it -- likely millions of CI/CD pipelines will break overnight. But in the interim, maybe a rustc flag we can put in our `Cargo.toml` that enables such a stricter mode? Or have that flag just remove all the panicky API _at compile time_ though I believe this might be a Gargantuan effort and is likely never happening (sadly).
In any case, I would expect many other failures from Cloudflare but not _this_ one in particular.
Bubbling up the error or None does not make the program correct. Panicking may be the only reasonable thing to do.
If panicking is guaranteed because of some input mistake to the system your failure is in testing.
I am not trashing on them, I've made such mistakes in the past, but I do expect more from them is all.
And you will not believe how many alerts I got for the "impossible" errors.
I do agree there was not too much that could have been done, yes. But they should have invested in more visibility and be more thorough. I mean, hobbyist Rust devs seem to do that better.
It was just a bit disappointing for me. As mentioned above, I'd understand and sympathise with many other mistakes but this one stung a bit.
I'm just pushing back a bit on the idea that unwrap() is unsafe - it's not, and I wouldn't even call it a foot gun. The code did what it was written to do, when it saw the input was garbage it crashed because it couldn't make sense of what to do next. That's a desirable property in reliable systems (of course monitoring that and testing it is what makes it reliable/fixable in the first place).
Using those should be done in an extremely disciplined manner. I agree that there are many legitimate uses but in the production Rust code I've seen this has rarely been the case. People just want to move on and then forget to circle back and add proper error handling. But yes, in this case that's not quite true. Still, my point that an APM alert should have been raised on the "impossible" code path before panicking, stands.
If you think about it, it’s not really different from handling the bubbled up error inside of Rust. You don’t (?) your results and your errors go away, they just move up the chain.
I don’t think the infrastructure has been as fully recovered as they think yet…
For something so critical, why aren't you using lints to identify and ideally deny panic inducing code. This is one of the biggest strengths of using Rust in the first place for this problem domain.
I'm pretty surprised that Cloudflare let an unwrap into prod that caused their worst outage in 6 years.
1. https://doc.rust-lang.org/std/option/enum.Option.html#recomm...
I don't know enough about Cloudflare's situation to confidently recommend anything (and I certainly don't know enough to dunk on them, unlike the many Rust experts of this thread) but if I was in their shoes, I'd be a lot less interested in eradicating `unwrap` everywhere and more in making sure than an errant `unwrap` wouldn't produce stable failure modes.
But like, the `unwrap` thing is all programmers here have to latch on to, and there's a psychological self-soothing instinct we all have to seize onto some root cause with a clear fix (or, better yet for dopaminergia, an opportunity to dunk).
A thing I really feel in threads like this is that I'd instinctively have avoided including the detail about an `unwrap` call --- I'd have worded that part more ambiguously --- knowing (because I have a pathological affinity for this community) that this is exactly how HN would react. Maybe ironically, Prince's writing is a little better for not having dodged that bullet.
It's one thing to not want to be the one to armchair it, but that doesn't mean that one has to suppress their normal and obvious reactions. You're allowed to think things even if they're kitsch, you too are human, and what's kitsch depends and changes. Applies to everyone else here by extension too.
But I do feel strongly that the expect pattern is a highly useful control and that naked unwraps almost always indicate a failure to reason about the reliability of a change. An unwrap in their core proxy system indicates a problem in their change management process (review, linting, whatever).
This reads to me more like the error type returned by append with names is not (ErrorFlags, i32) and wasn't trivially convertible into that type so someone left an unwrap in place on an "I'll fix it later" basis, but who knows.
Surely a unwrap_or_default() would have been a much better fit--if fetching features fails, continue processing with an empty set of rules vs stop world.
Wonder why these old grey beards chose to go with that.
Afaik, Go and Java are the only languages that make you pause and explicitly deal with these exceptions.
unwrap() implicitly panic-ed, right?
I suppose another way to think about it is that Result<T, E> is somewhat analogous to Java's checked exceptions - you can't get the T out unless you say what to do in the case of the E/checked exception. unwrap() in this context is equivalent to wrapping the checked exception in a RuntimeException and throwing that.
They just sell proxies, to whoever.
Why are they the only company doing ddos protection?
I just don't get it.
If that's true, is there a way to tell (easily) whether a site is using cloudflare or not?
Just ping the host and see if the ip belongs to CF.
https://www.cloudflare.com/en-ca/ips/
Companies seem to place a lot of trust is configs being pushed automatically without human review into running systems. Considering how important these configs are, shouldn't they perhaps first be deployed to a staging/isolated network for a monitoring window before pushing to production systems?
Not trying to pontificate here, these systems are more complicated than anything I have maintained. Just trying to think of best practices perhaps everyone can adopt.
I wrote a book on feature stores by O'Reilly. The bad query they wrote in Clickhouse could have been caused by another more error - duplicate rows in materialized feature data. For example, in Hopsworks it prevents duplicate rows by building on primary key uniqueness enforcement in Apache Hudi. In contrast, Delta lake and Iceberg do not enforce primary key constraints, and neither does Clickhouse. So they could have the same bug again due to a bug in feature ingestion - and given they hacked together their feature store, it is not beyond the bounds of possibility.
Reference: https://www.oreilly.com/library/view/building-machine-learni...
i’m a little confused on how this was initially confused for an attack though?
is there no internal visibility into where 5xx’s are being thrown? i’m surprised there isn’t some kind of "this request terminated at the <bot checking logic>" error mapping that could have initially pointed you guys towards that over an attack.
also a bit taken aback that .unwrap()’s are ever allowed within such an important context.
would appreciate some insight!
2. Attacks that make it through the usual defences make servers run at rates beyond their breaking point, causing all kinds of novel and unexpected errors.
Additionally, attackers try to hit endpoints/features that amplify severity of their attack by being computationally expensive, holding a lock, or trigger an error path that restarts a service — like this one.
I'm impressed they were able to corral people this quickly.
What could have prevented this failure?
Cloudflare's software could have included a check that refused to generate the feature file if it's size was higher than the limit.
A testcase could have caught this.
It sounds like the change could've been rolled out more slowly, halted when the incident started and perhaps rolled back just in case.
All that said, to have an outage reported turned around practically the same day, that is this detailed, is quite impressive. Here's to hoping they make their changes from this learning, and we don't see this exact failure mode again.
i think this is happening way too frequently
meanwhile VPS, dedicated servers hum along without any issues
i dont want to use kubernetes but if we have to build mission critical systems doesn't seem like building on cloudflare is going to cut it
So they basically hardcoded something, didn't bother to cover the overflow case with unit tests, didn't have basic error catching that would fallback and send logs/alerts to their internal monitoring system and this is why half of the internet went down?
There never was an unbound "select all rows from some table" without a "fetch first N rows only" or "limit N"
If you knew that this design is rigid, why not leverage the query to actually do it ?
What am I missing ?
Anyway regardless of which language you use to construct a SQL query, you're not obligated to put in a max rows
Excuse me, what you've just said? Who decided on “Cloudflare's importance in the Internet ecosystem”? Some see it differently, you know, there's no need for that self-assured arrogance of an inseminating alpha male.
These folks weren't operating for charity. They were highly paid so-called professionals.
Who will be held accountable for this?
> Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare.
also cloudflare:
> The Cloudflare Dashboard was also impacted due to both Workers KV being used internally and Cloudflare Turnstile being deployed as part of our login flow.
Cloudflare's status page: https://www.cloudflarestatus.com/
Cloudflare Dashboard: https://dash.cloudflare.com/
Unclear to me if it's an Atlassian-managed deployment they have, or if it's self-managed, I'm not familiar with Statuspage and their website isn't helping. Though if it's managed, I'm not sure how they can know for sure there's no interdependence. (Though I guess we could technically keep that rabbit hole going indefinitely.)
Sounds like the ops team had one hell of a day.
Even worse - the small botnet that controls everything.
Why have we built / permitted the building of / Subscribed to such a Failure-intolerant "Network"?
Gonna use that one at $WORK.
https://blog.cloudflare.com/18-november-2025-outage/#:~:text...
> As we wrote before, we believe Blackbird Tech's dangerous new model of patent trolling — where they buy patents and then act their own attorneys in cases — may be a violation of the rules of professional ethics.
https://blog.cloudflare.com/patent-troll-battle-update-doubl...
ChatGPT didn't invent the em dash, some people were always using it. But yeah, it's often one of the signs of AI.
And here is the query they used ** (OK, so it's not exactly):
someone added a new row to permissions and the JOIN started returning two dupe feature rows for each distinct feature.** "here is the query" is used for dramatic effect. I have no knowledge of what kind of database they are even using much less queries (but i do have an idea).
more edits: OK apparently it's described later in the post as a query against clickhouse's table metadata table, and because users were granted access to an additional database that was actually the backing store to the one they normally worked with, some row level security type of thing doubled up the rows. Not sure why querying system.columns is part of a production level query though, seems overly dynamic.
Cloudflare is very cheap at these prices.
What I'm trying to say is that things would be much better if everyone took a chill pill and accepted the possibility that in rare instances, the internet doesn't work and that's fine. You don't need to keep scrolling TikTok 24/7.
> but my use case is especially important
Take a chill pill. Probably it isn't.
> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail
A configuration error can cause internet-scale outages. What an era we live in
Edit: also, after finishing my reading, I have to express some surprise that this type of error wasn't caught in a staging environment. If the entire error is that "during migration of ClickHouse nodes, the migration -> query -> configuration file pipeline caused configuration files to become illegally large", it seems intuitive to me that doing this same migration in staging would have identified this exact error, no?
I'm not big on distributed systems by any means, so maybe I'm overly naive, but frankly posting a faulty Rust code snippet that was unwrapping an error value without checking for the error didn't inspire confidence for me!
I think it's quite rare for any company to have exact similar scale and size of storage in stage as in prod.
We’re like a millionth the size of cloudflare and we have automated tests for all (sort of) queries to see what would happen with 20x more data.
Mostly to catch performance regressions, but it would work to catch these issues too.
I guess that doesn’t say anything about how rare it is, because this is also the first company at which I get the time to go to such lengths.
In this case it seems the database table in question seemed modest in size (the features for ML) so naively thinking they could have kept stage features always in sync with prod at the very least, but could be they didn't consider that 55 rows vs 60 rows or similar could be a breaking point given a certain specific bug.
It is much easier to test with 20x data if you don't have the amount of data cloudflare probably handles.
Either way, you don’t need to do it on every commit, just often enough that you catch these kinds of issues before they go to prod.
Cloudflare doesn’t run in AWS. They are a cloud provider themselves and mostly run on bare metal. Where would these extra 100k physical servers come from?
I also found the "remediation and follow up" section a bit lacking, not mentioning how, in general, regressions in query results caused by DB changes could be caught in future before they get widely rolled out.
Even if a staging env didn't have a production-like volume of data to trigger the same failure mode of a bot management system crash, there's also an opportunity to detect that something has gone awry if there were tests that the queries were returning functionally equivalent results after the proposed permission change. A dummy dataset containing a single http_requests_features column would suffice to trigger the dupe results behaviour.
In theory there's a few general ways this kind of issue could be detected, e.g. someone or something doing a before/after comparison to test that the DB permission change did not regress query results for common DB queries, for changes that are expected to not cause functional changes in behaviour.
Maybe it could have been detected with an automated test suite of the form "spin up a new DB, populate it with some curated toy dataset, then run a suite of important queries we must support and check the results are still equivalent (after normalising row order etc) to known good golden outputs". This style of regression testing is brittle, burdensome to maintain and error prone when you need to make functional changes and update what then "golden" outputs are - but it can give a pretty high probability of detecting that a DB change has caused unplanned functional regressions in query output, and you can find out about this in a dev environment or CI before a proposed DB change goes anywhere near production.
...
(I'd pick Haskell, cause I'm having fun with it recently :P)
The last thing we need here is for more of the internet to sign up for Cloudflare.
This is all gone. The internet is a centralised system in the hand of just a few companies. If AWS goes down half the internet does. If Azure, Google Cloud, Oracle Cloud, Tencent Cloud or Alibaba Cloud goes down a large part of the internet does.
Yesterday with Cloudflare down half the sites I tried gave me nothing but errors.
The internet is dead.
464646449
Reputationally this is extremely embarrassing for Cloudflare, but imo they seem to get their feet back on the ground. I was surprised to see not just one, but two apologies to the internet. This just cements how professional and dedicated the Cloudflare team is to ensure stable resilient internet and how embarrassed they must have been.
A reputational hit for sure, but outcome is lessons learned and hopefully stronger resilience.
No other time in history has one single company been responsible for so much commerce and traffic. I wonder what some outage analogs to the pre-internet ages would be.
> The standard procedures the managers tried first failed to bring the network back up to speed and for nine hours, while engineers raced to stabilize the network, almost 50% of the calls placed through AT&T failed to go through.
> Until 11:30pm, when network loads were low enough to allow the system to stabilize, AT&T alone lost more than $60 million in unconnected calls.
> Still unknown is the amount of business lost by airline reservations systems, hotels, rental car agencies and other businesses that relied on the telephone network.
https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collap...
Lots of things have the sky in common. Maybe comet-induced ice ages...
In the pre digital era, East India Company dwarfs every other company in any metric like commerce controlled, global shipping, communication traffic, private army size, %GDP , % of workforce employed by considerable margins.
The default was large consolidated organization throughout history, like say Bell Labs, or Standard Oil before that and so on, only for a brief periods we have enjoyed benefits of true capitalism.
[1] Although I suspect either AWS or MS/Azure recent down-times in the last couple of years are likely higher
AWS very likely has Cloudflare beat in commerce responsibility. Amazon is equal to ~2.3% of US GDP by itself.
Best post mortem I've read in a while, this thing will be studied for years.
A bit ironic that their internal FL2 tool is supposed to make Cloudflare "faster and more secure" but brought a lot of things down. And yeah, as other have already pointed out, that's a very unsafe use of Rust, should've never made it to production.
Free fire
Big tech is a fucking joke.
My dude, everything is a footgun if you hold it wrong enough
This is the first significant outage that has involved Rust code, and as you can see the .unwrap is known to carry the risk of a panic and should never be used on production code.
Its fair to be upset at their decision making - use that to renegotiate your contract.
https://blog.cloudflare.com/18-november-2025-outage/#:~:text...
From an operations perspective, it would appear they didn't test this on a non-production system mimicing production; they then didn't have a progressive deployment; and they didn't have a circuit breaker to stop the deployment or roll-back when a newly deployed app started crashing.
This is literally the CrowdStrike bug, in a CDN. This is the most basic, elementary, day 0 test you could possibly invent. Forget the other things they fucked up. Their app just crashes with a config file, and nobody evaluates it?! Not every bug is preventable, but an egregious lack of testing is preventable.
This is what a software building code (like the electrical code's UL listings that prevent your house from burning down from untested electrical components) is intended to prevent. No critical infrastructure should be legal without testing, period.