Microsoft to reduce Azure outages with Project Tardigrade

Reliability is a critical metric for cloud vendors, particularly as competition for customers converges around AWS, Microsoft and Google.

One of the ways Microsoft is working to reduce Azure outages is Project Tardigrade, which can preserve virtual machines in the case their underlying host platform fails.

“Hardware has spontaneous failures that are transient in nature,” said Azure CTO Mark Russinovich in a talk at the Build conference in Seattle. “We also have issues with the software that runs on the server, where it might be a memory leak or a corruption [that] shows up because of a hardware issue, and we’ve got a crash of the operating system or hypervisor.”

Project Tardigrade emerged from Microsoft Research, and was first presented in a technical paper in 2015. It is named after tardigrades, micro-animals also referred to as water bears, which are known for their ability to survive in the most extreme conditions — even the vacuum of outer space and volcanoes.

“We want our servers to be like [the] tardigrade,” Russinovich said. “Today if the host platform goes down, VMs die and need to be rebooted. Tardigrade freezes the VMs in RAM, their states are preserved and then the OS reboots underneath them.”

Azure CTO Mark RussinovichMark Russinovich

Russinovich didn’t specify when Project Tardigrade will be a production element within Azure, but conducted a live demo of it onstage at Build. Overall, he touched only briefly upon the effort, but the Tardigrade technical paper provides more details on its implementation.

“To achieve this efficiently, we use lightweight virtual machine replication,” the authors wrote. “A lightweight virtual machine is a process sandboxed so that its external dependencies are completely encapsulated, enabling it to be migrated across machines.”

Cloud outages a multifaceted challenge

While Project Tardigrade focuses on isolated hardware failures, this is just one aspect of reliability in the cloud. Last week, a botched DNS migration caused interruptions in services such as Office 365 and Xbox Live for several hours.

The problem resulted from the combination of two separate errors and would not have occurred if only one had happened, Microsoft said in a root-cause analysis statement. One of the errors was made by Microsoft engineers and the other by “an artifact of automation from prior maintenance,” the company said.

One of the more serious Azure outages occurred in September due to severe weather, which caused a power surge that shut down hardware systems.

While Azure has many regions around the world, Microsoft lags behind AWS and Google with respect to availability zones, which provide separate and redundant physical facilities within a region for higher reliability. More availability zones may have prevented or lessened Azure’s September outage.

Automation of operations is such a huge focus at most of the cloud leaders.
Deepak MohanAnalyst, IDC

While Microsoft is trying to catch up on Azure’s availability zones, human error like that involved in last week’s service impacts may be unavoidable from time to time.

“This is still one of the biggest buckets of root causes for major cloud errors,” said Deepak Mohan, an analyst at IDC. “This is also why automation of operations is such a huge focus at most of the cloud leaders, both in terms of internal operations and customers’ oversight of their own resources.”

Cloud providers are a long way from completely addressing the question of human error, but the investments in automation and autonomy of deployments — which Project Tardigrade addresses — is an area of priority across providers such as Microsoft, Mohan added.

Go to Original Article
Author: