Public cloud outages are *your* problem

Every time I see news of a public cloud outage a little part of me groans on the inside because it means a slew of news articles will be written about how you can’t trust telco workloads to the public cloud.

For example*:

Reading through articles like this, I invariably find myself shaking my head at the lack of understanding about the public cloud. While the authors like to write about “the public cloud” being “down,” they fail to make the key distinction that it’s not the *entire* public cloud (or even all of the services) suffering from loss of service. Readers need to understand that during “public cloud outages” other public cloud regions are still up and running like nothing’s happened. So it’s time to set the record straight about outages in the public cloud.

The bad old days

In the early days of the public cloud (I’m talking 2006–2012), there were significant outages, affecting the services of global companies like Netflix and Apple. Remember the 2011 Amazon Elastic Compute Cloud (EC2) outage? Or the day that Google Talk, Twitter, and Azure all went down at once? Back then, I would have agreed that the public cloud wasn’t ready for telco workloads. But since then, the public cloud is ready for carrier-grade workloads. So, what’s changed?

In short: the hyperscalers. They have been working hard and spending BIG BUCKS to improve the resiliency options they offer to customers. Because they serve a huge variety of customers with a wide variety of workload needs, they allow each customer to configure resiliency according to their workload’s needs. Which means resiliency is now YOUR problem.

Shared responsibility model for resiliency

The hyperscalers employ what they call a shared responsibility for resiliency. For example, AWS commits to the resiliency of its infrastructure—the hardware, software, networking, and facilities that run the services—and it makes commercially reasonable efforts to ensure these services meet or exceed its contractual service level agreements (SLAs). But everything else becomes the responsibility of the customer. For example, a service such as Amazon EC2 requires the customer to perform all of the necessary resiliency configuration and management tasks. Customers are responsible for managing resiliency of their data including backup, versioning, and replication strategies—not AWS. Azure has a similar approach, as does Google Cloud.

Net net: you are in control of your resiliency strategy. You decide if your workloads need to run across multiple Availability Zones (AZs) in a single region as part of a high availability strategy, or not. You can design a multi-AZ architecture if you need to protect workloads from issues like power outages, lightning strikes, tornadoes, earthquakes, or other disasters. Depending on the workload criticality, you can use more of the resiliency options, or less. The benefit of this approach is that you have all of this built-in and ready to use at a moment’s notice as a service of the public cloud. Use more options and your workload will be more resilient, but you’ll spend more. On the other hand, if you don’t need it, you don’t pay for it. As they say—Your Mileage May Vary (YMMV).

Your DR strategy

Figuring out your approach to each workload will be key to your move and your costs in the cloud:

  • The most cost-effective and simplest way to run your workloads is in one availability zone in one region, but this approach leaves you the most exposed. This is a good tactic for when you are experimenting with the public cloud or with non-critical workloads that can tolerate some downtime.
  • You’ll pay more to run on multiple availability zones in a single region; but it will give you good protection and it is a great approach for most workloads. With AZs separated by up to 100 kilometers (60 miles), this path is decently resilient ,making it a great middle ground.
  • It’s more expensive (as well as operationally complex) to run workloads in multiple regions with the same hyperscaler, so do that only when absolutely necessary for mission critical systems that really can’t be down for more than a few minutes.
  • Lastly, some people suggest going multi-region across multiple hyperscalers (say, both AWS and Azure for the same workload). I say no. This approach requires your team to learn two different hyperscalers, making it incredibly operationally complex and super duper expensive to do, plus it doesn’t give you that much more coverage assurance than the previous approach. Not worth it. I’d do the previous approach first, and if you get that right, then try this approach if you must.

Table 1: Cost vs. Difficulty of DR Strategies

Disaster Recovery (DR) StrategyCostDifficulty level
One availability zone, one region💰🍰
Multiple availability zones, one region💰💰🛠️
Multiple regions💰💰💰💰🛠️ 🛠️
Multiple hyperscalers🏦😩

Don’t blame the cloud vendors

The ability to avoid outages is available to public cloud customers. Next time you read a write up about an internet service going down that blames the public cloud provider, remind yourself what it really means is that affected customers decided to not spend money for resiliency, or had an operational snafu and didn’t set things up right. But don’t blame the cloud vendors. They’ve invested hundreds of billions of dollars building out 69 regions in 39 countries to give you the power to avoid outages.

So, if you’re putting workloads on the public cloud (as you should be), then avoiding outages is your problem. Luckily for you, public cloud technology continues to become more and more resilient and outages are becoming less significant. This shit is more than battle tested, so use it to your advantage and GO CLOUD!

* Where are the articles about the on-premise outages — Rogers and KDDI — where their whole systems went down for most of a day to several days? (Service restoration varied for Rogers customers from 19 hours to several days. KDDI’s outage lasted two days.) Had those services been in the public cloud, the telcos might have been able to manage a failover to another availability zone or region in seconds. Tell me again: which approach is riskier?

Recent Posts

  1. The promise of AI is no UI
  2. Unlock the true value of your BSS with AI
  3. Telco’s shift to the public cloud gets real
  4. Mi Casa es su Casa (if you’re Verizon)
  5. Why AI’s future might depend on nuclear energy

Telco in 20 Podcast - Tune in

Get my FREE insider newsletter, delivered every two weeks, with curated content to help telco execs across the globe move to the public cloud.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.

More from TelcoDR