Watching hurricane Sandy from a safe distance reminded me that our online culture is vulnerable to a long list of threats: floods, wind storms, wild fires, cyber-attacks can all turn our wonderful handhelds, laptops, and desktops into useless bricks of plastic and steel. Disasters are inevitable, but planning and preparedness can reduce or avoid the damage.
In IT, disaster planning and response are called “failover” and “disaster recovery.” When a system fails over, alternate resources automatically replace compromised resources. Backup generators that take over when the power grid fails are a familiar form of failover. Disaster recovery is the execution of plans to restore service quickly when a disaster occurs. Laptop and desktop users exercise a form of disaster recovery when they restore data from backups after a hard drive fails.
The cloud presents both new opportunities for failover and disaster recovery and new challenges when disaster strikes. Enterprises today have plans to failover to a cloud if their data center is compromised. Many individuals already benefit from cloud backup services that automatically backup their disks and restore them in a few clicks.
But what happens when the cloud is the compromised resource? This can happen. Hurricane Sandy brought down a number of data centers in its path. Popular websites like Huffington Post and Gawker went offline when their providers failed. Given that New York is a center for Internet content, it surprised me that the more sites were not affected.
Sandy was bad, but it could be worse if a massive cloud provider like AWS (Amazon Web Services) or Google suddenly goes out of service. The big providers take pains to reduce their vulnerability. Their services are distributed; they avoid single points of failure by locating data centers all over the world. A disaster like Sandy might take down one center, but other centers in distant locations can take over the load. In addition, part of data center planning is to locate where disasters are less likely, away from flood zones and close to power supplies.
Enterprise Disaster Preparations
Despite provider best efforts, sometimes services are degraded or unavailable, and enterprises must cope with the problem themselves.
One method they have to protect themselves is Availability Zones (AZs) offered by cloud providers like AWS and Google. AZs are blocks of cloud resources that are unlikely to fail simultaneously. Exactly how the providers implement AZs depends on the provider. AWS also offers Elastic Load Balancing by Region, representing a geographic area like Western United States or Ireland. By judiciously distributing cloud deployments in different regions and AZs, consumers can exercise some control of their vulnerability to provider failure.
Case Example: Netflix
However, it takes more than careful deployment for consumers of cloud services to avoid service interruptions.
Netflix is a good example. It depends on AWS to provide its popular on-demand streaming media service to its customers. If AWS is not available, Netflix customers get error messages instead of movies, and Netflix loses both immediate revenue for the unavailable content and future revenue from customers who can easily switch to a competitor when Netflix does not deliver. No surprise that Netflix has invested in failover and disaster recovery to minimize these consequences.
They have thus far successfully minimized the effects of AWS outages with automated tools. One technique they use is “zone evacuation” in which they rapidly move services from one zone to another. Using this technique and others, Netflix was able to dodge the bullet on October 22. Fortunately, AWS stayed up through Sandy, but Netflix had prepared for the worst.
Standards in the Mix
Of course, zone evacuation will not help when an entire provider fails. This is unlikely, but I can think of at least two situations that might cause a provider wide disaster. First, we have heard about cyber warfare lately. A single cloud provider with tightly connected similar operating software in all their data centers is vulnerable to an attack that could bring down an entire system in a short time. Second, legal action or fiat could abruptly shut down a provider. I admit that both of these approach the limits of plausibility, but before last week, few imagined that a windstorm could fill New York subways with seawater.
Is there a response to the total failure of a cloud provider? Can providers be evacuated like Netflix evacuates zones? The answer is yes. With interoperable standard management interfaces, such as CIMI (Cloud Infrastructure Management Interface) or OCCI (Open Cloud Computing Interface), cloud consumers like Netflix can consider automated movement of deployments from one cloud provider to another using technology similar to moving them from one AZ to another. As standard management interfaces are adopted, computing will progress toward a rock-solid universal cloud infrastructure that will become the foundation of a next revolution in information technology.
For a more detailed description of CIMI and OCCI, see my book Cloud Standards.