Definitely been seeing a handful of 50x errors this morning. Fortunately seems like a partial outage but definitely annoying (and can sometimes indicate worse trouble coming)
Yes. Many times. Kubernetes upgrade during maintenance schedule borks up entire cluster, yet everything is green on status page. Support case under enterprise support plan took almost 6 hours to get it resolved.
"Rarity" is a distinction without merit in this particular case; the important thing to note is that (most) clouds don't guarantee _any_ availability of a single zone. A system which stashes all of its infrastructure in one zone only is expected to be impacted by issues with that cloud, while a multi-zone setup spanning a region is generally "soft-guaranteed" to be resilient to normal operations / failures.
I worked at PagerDuty, so definitely not selling availability theater. We did multi-cloud / multi-region for many years, and the story is not so simple. Development is all about trade-offs, and deciding what risk you are OK with. Multi-cloud provided a relatively small amount of value (given how incredibly unlikely whole-cloud outages are, even full-region outages are quite rare) at the expense of 2x implementation overhead, 2x exposure to random cloud-specific operational events, and the need to develop for the common denominator of functionality, which leaves out a LOT of interesting cloud offerings. In the end, it ended up just not being worth it, and moving to the single-cloud multi-region config provided enough reliability even for the company where reliability is the primary differentiator.
In my current job as a technical due diligence advisor, I frequently recommend multi-AZ setup but specifically not multi-region, because the former is easy and worthwhile while the latter carries a lot more operational overhead (you become much more sensitive to various latencies and network jitters) and you now need to think about things like synchronous vs async replication, etc. Much better to focus dev effort on the product, rather than eke out an additional .001% of availability (unless availability is a super critical component).
I've maintained a large multi-cloud architecture in the past. The problem is they really hit you hard on egress costs. Of course the motivation is obvious, they want to keep you locked in to their vendor. I did like that it gave a stronger leverage in contract renewals, but that was about it. The IAC was much more complicated and required more people/areas of knowledge. So it's definitely a tradeoff.
You are correct that it's "better" though if your goal is to have as many 9's of uptime as possible.
I currently have the strong opinion that for many mid-sized orgs with 250+ engineers it can be more resilient if you go back to bare metal or at least VM only in two or three local date centers. Yes, you need to know that they do their job well. But it will probably also reduce a lot of devops overhead...
There are multiple companies that help you with that by running tunnels via Direct Interconnect (Direct Connect in AWS) so that you "only" pay 2c/G egressing data out of VPC via this tunnel
yes, direct connect I have quite a bit of experience with. The costs add up in weird ways. if you want to spend on it though, multi cloud is extremely resilient, and my preferred architecture if money and talent are no object.
Probably because it's hard to form long-term memories when you're sleep-deprived :/
Seems to be some hardware problem at least in us-east1
https://status.cloud.google.com/incidents/8cY8jdUpEGGbsSMSQk...
edit: Never mind, it's down for me now as well.
- SSO issues;
- Google workspace tools not loading;
current time: 2025-07-18T15:35:43+00:00 12h35 GMT-3
Anyone who says otherwise is selling availability theater
Too many whole-cloud outages due to a bad config in the last 2 months (GCP x2, cloudflare x2)
Whole-cloud outages are pretty damn rare. The recent GCP issues are an exception to the general rule.
I’d posit that the complexity of a multi-cloud setup is generally going to reduce your service’s reliability more than relying on a single cloud does.
Really?
AWS (EC2) does: https://aws.amazon.com/compute/sla/?did=sla_card&trk=sla_car... so does GCP (GCE): https://cloud.google.com/compute/sla?hl=en and so does OVH: https://us.ovhcloud.com/legal/sla/public-cloud/
Are none of those three part of "most clouds"? What cloud platform do you use?
In my current job as a technical due diligence advisor, I frequently recommend multi-AZ setup but specifically not multi-region, because the former is easy and worthwhile while the latter carries a lot more operational overhead (you become much more sensitive to various latencies and network jitters) and you now need to think about things like synchronous vs async replication, etc. Much better to focus dev effort on the product, rather than eke out an additional .001% of availability (unless availability is a super critical component).
You are correct that it's "better" though if your goal is to have as many 9's of uptime as possible.
B2B customers don’t care if the other sites are also down, your SLA is affected with them, and they will want compensation.