After the CrowdStrike outage, some enterprises IT teams are rethinking their assumptions about how use of the cloud impacts application reliability.
Early on July 19, just minutes after data security giant CrowdStrike released what was supposed to be a security update, enterprises started losing Windows endpoints, and we ended up with one of the worst and most widespread IT outages of all time. There’s been a lot said about the why and the how. But how much of that reflects what enterprises think about the outage and what they believe they need to do? We’ve been told that enterprises are rethinking their cloud strategy. Is that true, and what are they planning to do?
One thing is clear: Enterprises believe this was CrowdStrike’s problem. Only 21 of the enterprises I contacted thought Microsoft was even a contributor, and none thought Microsoft was primarily to blame.
CrowdStrike made two errors, enterprises say. First, CrowdStrike didn’t account for the sensitivity of its Falcon client software for endpoints to the tabular data that described how to look for security issues. As a result, an update to that data crashed the client by introducing a condition that had existed before but hadn’t been properly tested. Second, rather than doing a limited release of the new data file that would almost certainly have caught the problem and limited its impact, CrowdStrike pushed it out to its entire user base.
All program logic is data-dependent in that the paths through the software are determined by the data it’s processing. You can’t say you’ve tested unless you’ve exercised all these paths. Of 89 enterprise development managers who shared comments with me, all said they had to deal with this in their own testing, and they’d expect a software supplier to be even more careful than an end user. Still, they understand how it could happen. One said they’d heard the software bug had been in the Falcon client for over a year and just hadn’t been hit yet.
Where things get a bit murky is whether the CrowdStrike failure should have caused Windows systems (over eight million of them) to crash and resist remote recovery. All of the 21 enterprises who said they believed Microsoft had contributed to the problem thought Microsoft’s Windows should not have responded to the CrowdStrike error in the way it did. The 37 who didn’t hold Microsoft accountable pointed out that security software necessarily has a unique ability to interact with the Windows kernel software, and this means it can create a major problem if there’s an error.
But while enterprises aren’t convinced that Microsoft contributed to the problem, over three-quarters think Microsoft could contribute to reducing the risk of a recurrence. Nearly as many said that they believed Windows was more prone to the kind of problem CrowdStrike’s bug created, and that view was held by 80 of the 89 development managers, many of whom said that Apple’s MacOS or Linux didn’t pose the same risk and that neither was impacted by the problem.
Misjudging cloud’s impact on application reliability
But what does this all mean with regard to things like cloud usage?
Enterprises said they are looking at their use of the cloud as a means of improving application reliability. In fact, the number who said they believed they’d misjudged the cloud’s value in that area increased from less than 15% before the CrowdStrike event to 35% immediately after it, and to 55% by early August. The biggest factor in that growth was the realization that massive endpoint faults could take down their operation, and no cloud backup would be effective. Enterprises were forced by the fault to examine just how the cloud impacts application reliability.
Let’s say you have a data center application linked to a Windows PC device. Let’s say that each is likely to be down one percent of the time. You want to improve reliability with the addition of a cloud front-end, and let’s say that it’s also down one percent of the time. What’s your reliability? It depends on whether the cloud and data center are able to back each other up. If they can’t, the chances all three will be up is 0.99 cubed, or 97%, which is less than it would have been without the cloud. But, if the cloud and data center can back each other up, then both would have to fail to take your application down. The chances of both cloud and data center failing is 1% times 1% or 0.0001, which is one in ten thousand, and application reliability is improved.
The same thing has to be considered in multi-cloud. Of 110 enterprises who commented on the reliability impact of multi-cloud, 108 said it made applications more reliable. Does it? It depends. If two clouds back each other up, the risk of failure is indeed lower, just like in my cloud/data-center example above. But many enterprises admitted that at least some of their applications needed both clouds because components relied on features specific to each cloud. Now they both need to be up, and so multi-cloud actually reduced reliability!
What this proves is that enterprises may be deluding themselves about the cloud and reliability, overall. The cloud isn’t always going to improve reliability any more than it always lowers costs. There’s no substitute for knowing what you’re doing, especially in the area of managing reliability. Instincts are a poor substitute for a tutorial in probability and statistics.
But let’s go back to my cloud reliability calculation. Yes, the chances of both cloud and data enter failing is one in ten thousand, but the chance of the endpoint failing in that example is one in a hundred. Endpoint risk is clearly more of a problem, so what can enterprises do about it?
Of the 138 enterprises who commented on the problem, the suggestion made most often was to teach key people at each location how to do a “safe boot” of their systems, because this was all that was really needed to quickly resolve the CrowdStrike problem. The second-place recommendation was “use a browser interface” on the endpoint device rather than an application. In fact, 44 enterprises said they used browser application access and were able to operate normally if they had something other than Windows endpoints to fall back on. Most often the other endpoint choice was a phone or tablet, but some (13) had Mac or Linux desktop systems they could use during the outage. In addition, you can use any number of simple devices to run a browser, like a Chromebook, and simple devices are less likely to fall prey to the sort of problem CrowdStrike had, or even to need specialized endpoint security tools.
So, should you be “rethinking your cloud strategy”? In fact, maybe what’s needed is rethinking the endpoint strategy. The second-place recommendation above could mean that doing more in the cloud would reduce risk, because the real problem here is that sophisticated devices as user on-ramps to applications are harder to fix remotely, and local people lack the skills to do the job themselves. Simplification of endpoints can lead to a multiplicity of available endpoint options as it did for many enterprises, and that would make the kind of failure that CrowdStrike created little more than an inconvenience. Don’t panic; properly used, the cloud is still your friend.