Home Data Center Facebook outage was a series of unfortunate events

by Tim Greene

Executive Editor

Facebook outage was a series of unfortunate events

News Analysis

Oct 05, 20215 mins

Data CenterEnterprise StorageNetwork Management Software

A badly written command, a buggy audit tool, a DNS system that hobbled efforts to restore the network, and tight data-center security all contributed to Facebook’s seven-hour Dumpster fire.

Credit: Ben Watts

Facebook says the root cause of its outage Monday involved a routine maintenance job gone awry that resulted in rendering its DNS servers unavailable, but first the entire Facebook backbone network had crashed.

To make matters worse, the loss of DNS made it impossible for Facebook engineers to remotely access the devices they needed to in order to bring the network back up, so they had to go into the data centers to manually restart systems.

That slowed things down, but they were slowed down even more because the data centers have safeguards in place to make tampering hard—for anybody. “They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them,” according to a Facebook blog written by Santosh Janardhan, the company’s vice president of engineering and infrastructure.

It took time, but once the systems were restored, the network came back up.

Restoring the customer-facing services that run over the network was another lengthy process because turning them up all at once could cause another round of crashes. “Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk,” Janardhan wrote.

In all, Facebook was down for seven hours and five minutes.

Routine-maintenance foul up

To kick off the outage, Facebook was taking just part of the backbone network offline for maintenance at 11:39 a.m. EDT. “During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally,” Janardhan wrote.

That wasn’t the plan, and Facebook even had a tool in place to sort out commands that might cause such a catastrophic failure, but it didn’t work. “Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command,” according to Janardhan.

Once that happened, the DNS was doomed.

DNS was a single point of failure

An automated response to the backbone crash seems to be what took down the DNS, according to Angelique Medina, head of product marketing at Cisco ThousandEyes, which monitorsand analyzes internet traffic and outages.

DNS, or directory name services, responds to queries about how to translate Web names into IP addresses, and Facebook hosts its own DNS nameservers. “They have an architecture where their DNS service is scaled up or down in relation to server availability,” Medina says. “And when server availability went to zero because the network went down, they decommissioned all their DNS servers.”

That decommissioning was accomplished by Facebook’s DNS nameservers initiating messages to internet border gateway protocol (BGP) routers that store knowledge about routes to use to reach specific IP addresses. The routes are routinely advertised to the routers to keep them current on how to direct traffic appropriately.

Facebook DNS servers’ route-withdrawal messages disabled the advertised routes to themselves, making it impossible for BGP routers to send traffic their way. “The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers,” Janardhan wrote.

Even if the DNS servers were still accessible from the internet, Facebook customers would have lost service because the network they were trying to reach had crashed. Unfortunately for Facebook, its own engineers also lost access to the DNS servers, which were necessary for their remote management platforms to reach the downed backbone systems.

“They don’t use their DNS service just for their customer-facing Web properties,” Medina says. “They also use it for their own internal tools and systems. By taking it down completely, that prevented their network operators or engineers from gaining access to the systems they needed to in order to fix the problem.”

That meant that rather than fix things from a management console, the engineers had to lay hands on data-center devices to bring them back up, one by one.

A more robust architecture would have dual DNS services so one could backup the other. For example, Amazon, whose AWS offers a DNS service, uses two external services—Dyn and UltraDNS—for its DNS, according to Medina.

Lessons to learn

The incident reveals what networking best-practices suggest is a shortcoming of the Facebook architecture. “Why was their DNS effectively a single point of failure here?” she says. If there were no underlying backbone failure and the single DNS system failed, that itself might trigger an outage, “so I think having redundant DNS is a big takeaway.”

Another general observation is one that Medina has made about other service-provider outages. “Often times with these outages there are so many interdependencies within their network that one small issue in one part of their overall service architecture experiences an issue, and then it has sort of this cascading effect,” she says.

“A lot of companies are leveraging a lot of internal services, and in doing that there can be unforeseen consequences. That may be more for the technical folks [to analyze], but I do think it’s worth pointing out.”

by Tim Greene

Executive Editor

Tim Greene was executive editor of Network World.

Americas

Topics

About

Policies

Our Network

More

Facebook outage was a series of unfortunate events

A badly written command, a buggy audit tool, a DNS system that hobbled efforts to restore the network, and tight data-center security all contributed to Facebook’s seven-hour Dumpster fire.

Routine-maintenance foul up

DNS was a single point of failure

Lessons to learn

More from this author

Frontier still reigns as world’s fastest supercomputer

Your decommissioned routers could be a security disaster

Global hosting providers help keep key Ukraine web sites available

IDC: Add used IT gear to the mix to stretch budgets, support sustainability

IDC: With possible recession looming, IT pros plan spending adjustments

World’s fastest supercomputer is still Frontier, 2.5X faster than #2

Report: Cloud services can be made more resilient but at a premium

The ‘Cisco’ gear you bought from these companies could be counterfeit

Show me more

Billion-dollar fine against Intel annulled, says EU Court of Justice

F5, Nvidia team to boost AI, cloud security

How to examine files on Linux

Has the hype around ‘Internet of Things’ paid off? | Ep. 145

Episode 1: Understanding Cisco’s Converged SDN Transport

Episode 2: Pluggable Optics and the Internet for the Future

How to use the diff3 command

How to use the colordiff command

How to use the CMP command

Facebook outage was a series of unfortunate events

A badly written command, a buggy audit tool, a DNS system that hobbled efforts to restore the network, and tight data-center security all contributed to Facebook’s seven-hour Dumpster fire.

Routine-maintenance foul up

DNS was a single point of failure

Lessons to learn

Related content

Supermicro unveils AI-optimized storage powered by Nvidia

Nvidia to power India’s AI factories with tens of thousands of AI chips

Gartner: 13 AI insights for enterprise IT

Network jobs watch: Hiring, skills and certification trends

Newsletter Promo Module Test

More from this author

Frontier still reigns as world’s fastest supercomputer

Your decommissioned routers could be a security disaster

Global hosting providers help keep key Ukraine web sites available

IDC: Add used IT gear to the mix to stretch budgets, support sustainability

IDC: With possible recession looming, IT pros plan spending adjustments

World’s fastest supercomputer is still Frontier, 2.5X faster than #2

Report: Cloud services can be made more resilient but at a premium

The ‘Cisco’ gear you bought from these companies could be counterfeit

Show me more

Billion-dollar fine against Intel annulled, says EU Court of Justice

F5, Nvidia team to boost AI, cloud security

How to examine files on Linux

Has the hype around ‘Internet of Things’ paid off? | Ep. 145

Episode 1: Understanding Cisco’s Converged SDN Transport

Episode 2: Pluggable Optics and the Internet for the Future

How to use the diff3 command

How to use the colordiff command

How to use the CMP command