Americas

  • United States
abednarz
Executive Editor

Global Microsoft cloud-service outage traced to rapid BGP router updates

News Analysis
Jan 30, 20235 mins
MicrosoftMicrosoft 365Networking

Unstable border-gateway routing tables led to high packet loss, leaving Microsoft customers unable to reach Teams, Outlook, SharePoint, and other services and resulting in a 'really poor experience,' according to ThousandEyes.

Outages that made Microsoft Azure and multiple Microsoft cloud services widely unavailable for 90 minutes on Jan. 25 can be traced to the cascading effects of repeated, rapid readvertising of BGP router prefixes, according to a ThousandEyes analysis of the incident.

The Cisco-owned network intelligence company traced the Microsoft outage to an external BGP change by Microsoft that affected service providers. (Read more about network and infrastructure outages in our top 10 outages of 2022 recap.)

Multiple Microsoft BGP prefixes were withdrawn completely and then almost immediately readvertised, ThousandEyes said. Border gateway protocol (BGP) tells Internet traffic what route to take, and the BGP best-path selection algorithm determines the optimal routes to use for traffic forwarding.

The withdrawal of BGP routes prior to the outage appeared largely to impact direct peers, ThousandEyes said. With a direct path unavailable during the withdrawal periods, the next best available path would have been through a transit provider. Once direct paths were readvertised, the BGP best-path selection algorithm would have chosen the shortest path, resulting in a reversion to the original route. 

These re-advertisements repeated several times, causing significant route-table instability. “This was rapidly changing, causing a lot of churn in the global internet routing tables,” said Kemal Sanjta, principal internet analyst at ThousandEyes, in a webcast analysis of the Microsoft outage. “As a result, we can see that a lot of routers were executing best path selection algorithm, which is not really a cheap operation from a power-consumption perspective.”

More importantly, the routing changes caused significant packet loss, leaving customers unable to reach Microsoft Teams, Outlook, SharePoint, and other applications. “Microsoft was volatilely switching between transit providers before installing best path, and then it was repeating the same thing again, and that’s never good for the customer experience,” Sanjta said.

In addition to the rapid changes in traffic paths, there was a large-scale shift of traffic through transit provider networks that was difficult for the service providers to absorb and explains the levels of packet loss that ThousandEyes documented.

“Given the popularity of Microsoft services such as SharePoint, Teams and other services that were affected as part of this event, they were most likely receiving pretty large amounts of traffic when the traffic was diverted to them,” Sanjta said. Depending on the routing technology these ISPs were using – for example, software-defined networking or MPLS traffic engineering enabled by the network-control protocol RSVP – “all of these solutions required some time to react to an influx of a large amount of traffic. And if they don’t have enough time to react to the influx of large amounts of traffic, obviously, what you’re going to see is overutilization of certain interfaces, ultimately resulting in drops.”

The resulting heavy packet loss “is something that would definitely be observed by the customers, and it would reflect itself in a really poor experience.”

As for the cause of the connectivity disruptions, ThousandEyes said the scope and rapidity of changes indicate an administrative change, likely involving automation technology, that caused a destabilization of global routes to Microsoft’s prefixes.

“Given the rapidity of these changes in the routing table, we think that some of this was caused by automated action on the Microsoft side,” Sanjta said. “Essentially, we think that there was certain automation that kicked in, that did something that was unexpected from a traffic-engineering perspective, and it repeated itself several times.”

The bulk of the service disruptions lasted approximately 90 minutes, although ThousandEyes said it spotted residual connectivity issues the following day.

What Microsoft has said about the outage

Microsoft said it will publish a final post-incident review of the incident with more details, likely within the next two weeks, after it finishes its internal review.

Based on what Microsoft has said so far, a network configuration change caused the outage, which it first acknowledged in a tweet at 7:31 AM UTC on the Microsoft 365 Status Twitter account: “We’re investigating issues impacting multiple Microsoft 365 services.”

Roughly 90 minutes later, the Twitter account posted: “We’ve isolated the problem to networking configuration issues, and we’re analyzing the best mitigation strategy to address these without causing additional impact.” And at 9:26 UTC: “We’ve rolled back a network change that we believe is causing impact. We’re monitoring the service as the rollback takes effect.”

Microsoft shared more details in a preliminary post-incident review published via its Azure status page.

“Between 07:05 UTC and 12:43 UTC on 25 January 2023, customers experienced issues with networking connectivity, manifesting as long network latency and/or timeouts when attempting to connect to resources hosted in Azure regions, as well as other Microsoft services including Microsoft 365 and Power Platform. While most regions and services had recovered by 09:00 UTC, intermittent packet loss issues were fully mitigated by 12:43 UTC. This incident also impacted Azure Government cloud services that were dependent on Azure public cloud.”

A change made to the Microsoft WAN impacted connectivity, Microsoft determined:

“As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them. The command that caused the issue has different behaviors on different network devices, and the command had not been vetted using our full qualification process on the router on which it was executed.”