Americas

  • United States
denise_dubie
Senior Editor

Network connectivity issues are leading cause of IT service outages

Analysis
Apr 04, 20248 mins
Data CenterNetwork Management SoftwareNetwork Security

Uptime Institute finds networking problems are the root cause of most IT service-related issues, while power is the primary source of data center downtime, and cyberattacks are becoming a significant contributor to outages.

Credit: Rawpixel.com

Networking and connectivity issues are the leading cause of IT service-related outages, according to Uptime Institute’s annual outage analysis, while power is the most common culprit when looking specifically at data center outages.

According to the Uptime Institute Data Center Resiliency Survey 2024, 31% of 442 respondents pointed to networking and connectivity issues as the most common cause of IT service-related outages, followed closely by IT system/software with 22% of respondents identifying it as a root cause. Other common causes for IT service-related outages include power (18%), cooling (7%), and third-party IT service (10%).

Uptime revisited some of the biggest publicly reported outages as well as conducted surveys on both IT services-related outages and data center downtime to determine what factors most impact enterprise networks and data centers. According to Uptime’s Annual Outage Analysis 2024, the leading causes of publicly reported IT service outages are:

  • IT (software/configuration): 23%
  • Network (software/configuration): 22%
  • Power: 11%
  • Cyberattack/ransomware: 11%
  • Fiber: 10%
  • Fire: 9%
  • Cooling: 6%
  • Network (cabling): 4%
  • Provider/partner issue: 2%
  • Capacity/demand: 1%
  • Other: 1%

“We have identified that IT software is the single biggest cause. But if we add network software and configuration to fiber connectivity, that becomes the biggest single cause,” said Andy Lawrence, executive director at Uptime Institute Research during a webinar sharing the report results.

Uptime Institute’s Annual Outage Analysis 2024 features data that incorporates responses from the Uptime Intelligence Annual Global Data Center Survey conducted in Q2 and Q3 of 2023 with 850 respondents; the Uptime Intelligence Data Center Resiliency Survey conducted in Q1 2024 also with 850 respondents; and the Uptime Intelligence Public Outage Tracking report that monitored more than 750 outages between 2016 and 2023.

Uptime analysts said that overall outage frequency and severity continue to decline, but cyber-related incidents are increasing—and “are responsible for many of the most severe outages … causing extensive and serious disruption,” the report states.

“We’ve seen that [cyberattack/ransomware] is a fast-growing component accounting for 11% of serious outages. One of the notable features of a ransomware attack is they usually last days, some have lasted weeks. And in a few rare instances, the company involved has never recovered their business, so that does open up a new, very serious category,” Lawrence explained.

The data collected revealed a key point about how cyberattacks are hitting differently today versus several years ago. According to Uptime, most of the control systems used in data centers are now IP-enabled, making them more susceptible to attack—and more likely to be included in an outage. In the past, OT systems, or operational technology, would use their own private serial communications, separate from the corporate network. Network security becomes more critical with IP-enabled OT systems because if bad actors gain access they can shut down operations.

“While the main IP systems have patches that come out on a regular basis to patch security issues, a lot of these equipment chillers, generators, building management systems, and things of that nature don’t get patched that often for security and their security features are typically not that robust or advanced. They typically rely on the network being secure as being the first and main line of defense,” said Chris Brown, chief technical officer at Uptime Institute.

Outage severity is improving

The research firm noted that most operators reported having no or negligible outages in the past three years, meaning the organizations didn’t incur major damages due to the downtime. When asked to classify their outages, 41% said they experienced a negligible outage, which Uptime defined as “recordable outages but little or no obvious impact on services.” Another 32% reported outages defined as minimal, or services disrupted with minimal effect on users/customers/reputation. Less than one-fifth (17%) experienced an outage classified as significant or downtime that resulted in customer/user service disruptions but had minimal or no financial effect and some reputational or compliance impact.

Six percent pointed to serious outages, which included disruption of service or operations, financial losses, compliance breaches, safety concerns, and reputational damage—with customer losses possible. And 4% said they experienced severe outages that resulted in a major or damaging disruption of services or operations. These severe outages include large financial losses and possible safety issues, compliance breaches, customer losses, and reputational damage.

“There is no question that the data seems to show that the outage severity is improving. In other words, a lower proportion falls into that very severe category of serious, or severe that means our financial reputation, or other extreme consequences,” Lawrence explained.

Uptime pointed to a few public outages that severely impacted an organization. For instance, the U.S. Federal Aviation Administration experienced an outage that pointed to an IT software configuration error as its cause, when mistakenly deleted files in a pilot-alert system affected more than 30,000 flights, impacting stocks for major airlines. Australian telecommunications provider Optus experienced a costly outage due to a network issue that caused transport delays, resulted in banking issues, and cut hospital phone lines for 12 hours, impacting more than 10 million users and 400,000 businesses. Another example included a ransomware cyberattack on Dish Network that involved cybercriminals encrypting critical data, which disrupted services for nearly 300,000 users and caused the company’s share values to drop by more than 6%.

Power issues persist

Despite improved data center design and redundancy, power continues to be identified as the top contributor to data center outages, according to Uptime. Uptime’s surveys found that 30% of respondents experienced an outage directly caused by a power problem. Among those, 42% pointed to uninterruptible power supply (UPS) failure as the leading cause of power issues. Another top cause for 30% involved the transfer switch over to a generator, which continues to be problematic for organizations. Generator failures accounted for 28% of power-related outages, and close to one-fifth (18%) said a transfer switch between paths (A/B) failure led to a power outage.

“Everything requires power, and power is so binary, and tolerance to power fluctuations can be very small,” Brown said. “The one thing that most people forget about is testing. They’ll have redundant systems, but they don’t test those on a regular basis,” Brown said. “It’s important to test these systems, and it’s important to test them under real-world conditions.”

Uptime also found positive news in data proving more organizations are increasing their efforts when it comes to physical site redundancy. Some 39% of enterprise respondents reported increased redundancy for power and 37% said the same for cooling. Colocation and data center providers also increased their power (35%) and cooling (33%) redundancy, while 37% of cloud/hosting/SaaS providers increased power redundancy and 33% increased cooling redundancy.

Human error a frequent contributor to outages

While communications and cloud providers can take some of the blame for some publicly reported outages, nearly 40% of respondents were able to connect an outage directly to human error. For instance, 48% of those reporting outages said that data center staff failing to follow procedures led to an outage. Another 45% pointed to incorrect staff processes or procedures as a cause, and 23% cited installation issues as the source of outage-causing human error. Other human-related causes include:

  • In-service issues: 20%
  • Insufficient staff: 15%
  • Preventative maintenance frequency issues: 14%
  • Data center design or omissions: 10%

“It’s really worth noting that human error is at least a component either directly or indirectly, in almost every outage out there, or at least the majority of outages involve some degree of human error. If a system has been built or installed or put together by a human, it will inherently have some capacity for failure,” said Douglas Donnellan, research analyst at Uptime Institute.