Enterprises typically have three options: Set up a secondary data center, go with an external DR service provider, or leverage the public cloud for recovery.
When designing a disaster recovery (DR) plan, one of the first decisions you’ll need to make is determining where you’ll recover your operations if catastrophe strikes your primary data center. I’ve been helping organizations plan for cyber recovery for many years, and in my experience, companies typically have three main options for their DR site. Here’s a look at the pros, cons, costs and risks of each approach.
Roll your own disaster recovery site
The first option is to set up your own secondary DR data center in a different location from your primary site. Many large enterprises go this route; they build out DR infrastructure that mirrors what they have in production so that, at least in theory, it can take over instantly.
The appeal here lies in control. Since you own and operate the hardware, you dictate compatibility, capacity, security controls and every other aspect. You’re not relying on any third party. The downside of course, lies in cost. All of that redundant infrastructure sitting idle doesn’t come cheap. You need to purchase, install and maintain a second set of servers, storage, networking gear and more. Real estate, power, cooling – all the data center basics become a duplicate expense.
Not only that, but since the DR site is a full-scale replica, any time you add or change hardware in production, you need to do the same in the DR environment. And if you think about it, maintaining a DR site is like having a pool no one swims in: You still have to constantly clean and treat the water, trim the bushes, etc. It’s work that keeps ops resources busy but provides little tangible day-to-day value.
The danger here is that over time, the DR site becomes an afterthought. Changes stack up, configurations drift, and when disaster strikes, you find out the hard way that your recovery process wasn’t as turnkey as expected. Avoiding this requires diligent discipline to keep the two sites in sync. And of course, regular DR testing is a must.
Use a third-party disaster recovery service
The second approach is to engage an external DR service provider to furnish and manage a recovery site on your behalf. Companies like SunGard built their business around this model. The appeal lies in offloading responsibility. Rather than build out your own infrastructure, you essentially reserve DR data center capacity with the provider.
SunGard and others construct and operate large recovery sites purpose-built to host client infrastructure and data. When disaster strikes, you show up on their doorstep ready to restore systems and resume operations. Costs are generally cheaper than managing your own DR facility since providers can maximize resources across a shared client base.
However, this method isn’t without risk. The biggest potential pitfall revolves around the “shared” nature inherent in the model. When a major incident impacts a broad area, multiple clients may be vying for those same DR resources simultaneously. If the provider underestimated demand or overcommitted capacity, your recovery may be degraded.
And while using shared hardware can work fine for testing, you’ll likely need similar configurations to production for doing actual recovery work. So while providers tout flexibility and customization, your options may be limited if you have very specific needs or unique environments.
Recover in the public cloud
The third option for housing your DR infrastructure is leveraging the public cloud. Market leaders like AWS and Azure offer seemingly limitless capacity that can scale to meet even huge demands when disaster strikes. Just as with normal operations, your compute and storage needs during recovery are available on demand.
The public cloud’s native scalability provides protection against the “oversubscription risk” present with traditional DR providers. Barring massive region-wide service disruption (which is rare), cloud providers have the excess reservoirs to handle spikes in client resource requests.
Another advantage here is cost and consumption flexibility. You only pay for the cloud infrastructure used during testing or actual recovery events. During normal “DR idle” mode where you’re just replicating backups, expenses are minimized. And unlike traditional providers, cloud capacity scales up or down to precisely match your needs – there’s no guessing about what you might require someday.
This all serves to promote more rigorous and regular DR testing. Since spinning cloud servers up and down is fast and economical, tests that were cost- or resource-prohibitive become more feasible. You can validate that your latest backups boot quickly without consuming unnecessary capital. Testing frequency improves, and hence overall DR preparedness follows suit.
The cloud also lends itself well to infrastructure automation and scripting. Via tools that support infrastructure-as-code techniques, you can predefine server configs, resource provisioning logic, network topology and more. When it’s time to invoke DR, all the cloud construction can happen nearly automatically based on your templates and preferences. No manual intervention required.
The catch with cloud-based disaster recovery
Recovering operations in the public cloud isn’t without some caveats. The most obvious drawback compared to alternatives is that a network connection is implicitly always required back to the cloud provider’s data center. If local internet connectivity is disrupted due to the disaster, cloud access stalls.
Sometimes internet redundancy already exists, but if not, teams may need to pursue options like satellite network links to maintain that cloud lifeline when disaster strikes (yet another additional cost and setup complexity).
The other primary consideration surrounds managing DR data replication into the cloud account ahead of time. To launch production servers on demand later, you need to continuously pump backup data and VM images into cloud storage. Replication software, network bandwidth, and cloud storage to house all this data obviously doesn’t come for free.
Finally, some applications and databases that run smoothly on-prem may encounter hiccups operating in a cloud infrastructure environment. Allocating time to test finicky systems is key to surface any design flaws that could impede DR readiness when it counts most.
Adopting a culture of resilience
Constructing a comprehensive disaster recovery plan, no matter the approach, demands significant time, resources and ongoing budget. But skimping in this area heightens business risk far beyond the immediate expenditure. Data from the U.S. Federal Emergency Management Agency shows that around 40% of small businesses close permanently after suffering a disaster because they simply aren’t prepared to recover.
My advice after assisting countless companies in designing resilient cyber recovery programs boils down to this: Comprehensive readiness isn’t a project with a completion date. It’s an ingrained business philosophy that perseveres. Appoint DR planning owners, conduct failure scenarios, clearly define policies and document procedures. Make readiness checks and tests routine. Embrace disaster preparation as a standard cost of doing business in this modern age full of risk and uncertainty. A culture centered on resilience in the face of disaster or disturbance is key to survival and success.