How to survive a cloud outage
Amazon’s infamous cloud outage in April brought down a number of popular Web sites, including foursquare and Reddit – but many of Amazon’s enterprise cloud customers were able to weather the storm without experiencing downtime.
They architected their systems for resiliency by using multiple availability zones, having hot backups in traditional data centers, or having a backup cloud provider set up and ready to go in case of a problem.
Silicon Valley-based photosharing company SmugMug stayed up through the outage even as its peers failed. That was partly because it avoided the use of Amazon’s Elastic Block Storage – the particular service component that went down.
But the company also spread its systems across several Amazon data centers – what Amazon calls “availability zones.”
Other companies would have stayed up as well if they had also distributed their applications, says SmugMug CEO Chris MacAskill. He also recommends that companies also use multiple Amazon regions, which are more isolated from one another than availability zones. Of course, Amazon does charge extra for using multiple zones, so that needs to be taken into account.
SmugMug relies heavily on Amazon, using its cloud-based Simple Storage Service (S3) to store customer photos and videos. SmugMug also uses many instances of the Elastic Compute Cloud (EC2). But instead of using Elastic Block Storage – which is attached to individual EC2 instances, and often used to store operational data – the company still uses traditional data centers.
That has its own downsides – the week of Amazon’s outage, for example, the company lost a core router, its backup, and a core master database server. “I wish I didn’t have to deal with routers or database hardware failures anymore, which is why we’re still marching towards the cloud,” MacAskill says.
And, despite the outage, the cloud-based services that he gets from Amazon are still better than what SmugMug could have on its own, he adds, and better than other cloud service providers. “We’re very committed to them,” he says.
Israel-based startup Kitely Ltd. only used one of Amazon’s availability zones – but, fortunately, not the one that went down.
However, the company plans to learn from the experience. “We intend to split all of our services across multiple availability zones,” Kitely CTO Oren Hurvitz says.
Kitely, which runs cloud-based virtual meeting and collaboration environments based on the OpenSim platform, also performs continuous checks to ensure that all of its services are up and running.
“Our system is designed with the assumption that any service might stop working at any time,” he says. “If we discover that a server is not responding then we terminate it and start a new server instead.”
Another company unaffected by the outage because it used multiple availability zones was Mashery, which provides APIs to more than 100 companies such as BestBuy, Hoovers and The New York Times. But Mashery also has another backup plan – a traditional data center.
“We very early on realized that there could be a service problem where Amazon would be entirely unavailable, and we decided that we needed fail-over infrastructure,” Mashery CEO Oren Michels says. “We have dedicated hardware with Internap.”
Atlanta-based Internap Network Services Corp. provides not only a hot backup site for Mashery but also a production environment for customers that need lower latency than possible with a cloud, or services delivered in geographic areas where Amazon is not available.
“We maintain plenty of infrastructure on both sides to handle peak load,” he says.
When Mashery was first building its cloud infrastructure two years ago, Amazon was the only real player in town. Backing up to another cloud was not an option back then – but it might be possible now.
“We’re definitely keeping our eye on it,” he says. “But if it ain’t broke, don’t fix it. Amazon has worked amazingly well for us. Likewise, Internap has been a great partner and continues to provide us the services we need.”
Internap has even lowered its prices to stay competitive, he adds, though price isn’t the major factor in his decision-making.
“We have a hundred huge brands as customers,” he says. “It’s more expensive to lose customers in case their stuff goes down. Our customers pay us to solve their API problems, and that includes that we stay up if there’s an outage.”
Companies that are just making the transition to the cloud often use traditional data centers as backups at the start of the process, says Rob Enderle, an analyst at research firm Enderle Group.
“You can have a set of lesser resources that are on stand-by that you can failover to,” he says. “Often, that’s whatever you had before you moved to the cloud. You can fail-over to a lower-performing technology and still hold your customers.”
Companies that have some applications running in a traditional data center and some running in the cloud may be able to double up, he says, and use the same disaster recovery site for both, since the odds are low that Amazon would go down at the same exact time as the traditional data center.
But he warned against trusting too much in using one set of cloud services as a backup for another set of cloud services on the same cloud.
“A redundant service might use some of the same resources as the primary service,” he says. “Care should be taken to ensure that redundancies are, in fact, redundant and not simply a different name for overlapping hardware and software.”
Secondary cloud providers
Using a cloud service provider as a backup for a traditional data center is typically more cost-effective than the other way around.
That’s because with a cloud service provider, you pay for computing cycles. When it’s not being used, customers need only have the minimum computing power running to enable a quick switch-over, and then add more server capacity as needed.
With a traditional data center, enough servers have to be available to handle peak workload, even if they are rarely used. That translates to hardware costs as well as power and staffing requirements – typically a traditional backup center would double total computing costs, while a cloud backup would only add a fraction.
For example, Web-based disk encryption vendor AlertBoot, headquartered in Las Vegas, used to pay $50,000 a month just for electricity, AlertBoot CEO Tim Maliyil says.
“We had two physical data centers at one point — and you can’t believe how happy we were to shut it down,” he says. “Now, two clouds, bandwidth and hosting is $16,000 a month. There was so much waste of electricity and capacity. The cloud really minimized our costs and ongoing expenses.”
Transitioning to cloud providers wasn’t difficult, because AlertBoot was already using virtualization software from VMware in its traditional data center, he says. The two cloud providers the company picked are SunGard and OpSource, both of which use VMware technology as well. (Systems integrator Dimension Data announced recently that it plans to acquire OpSource.)
Switching from one cloud provider to another now takes just a minute or two, he says, and the backup cloud can ramp up quickly to handle increased load. The switch-over itself is handled by a service from Zeus Technology, a U.K. vendor that helps companies move applications from one cloud to another.
Maliyil said that his company selected these vendors because they are known for their enterprise-level reliability. “For the kind of business we’re in, and our customers’ [lack of] tolerance for failure, we’ve steered away from the Amazon infrastructure,” he says.
Another vendor that helps companies manage services running on multiple clouds is rPath, which has more than 90 corporate customers, mostly large enterprises and ISPs, including ADM, Fujitsu, Qualcomm and EMC.
The company currently deploys to 16 types of image formats, which are snapshots of applications that run in cloud environments. Adding another cloud to the list typically takes less than a week, says Jake Sorofman, rPath’s chief marketing officer. “It’s fairly trivial for us.”
The company currently supports Amazon EC2, VMware, Citrix Zen, Microsoft HyperV, Rackspace and several other formats. Once an application is in the rPath system, it takes as little as 15 minutes to generate a new image and deploy it to a new cloud, he says.
However, architecting an application for the rPath system in the first place can take a little longer. “The process of packaging a new application for our platform could take from a couple of hours to a couple of days, depending on its complexity,” he says. “But we have a professional services team that does that work for customers if they choose.”
“There’s a fairly extensive list of complete stacks that have already been modeled using our technology, and can be leveraged,” he says.
And having the option to move applications between clouds does more than just provide backup options for companies, he says – it also allows companies to get the best possible deals from their providers.
“There is an arbitrage opportunity that comes with having choice,” he says. “Being able to optimize where workloads are running based on performance, policy and price. And, to the extent that you can easily move a workload between Amazon, Rackspace or other environments, you have leverage over your service providers because you have eliminated lock-in.”