The existence of backup copies or replicas isn't enough if you don't understand how to efficiently restore data and systems at scale.
Ransomware has moved disaster recovery planning from the back room to the front table. It’s understandable to think you might not get hit by a hurricane, tornado, or earthquake. But thinking your company probably won’t get hit with ransomware is like thinking you’re probably going to win the lottery – neither assumption is true. The odds of a company being hit by ransomware are extremely high, which means you need a solid disaster recovery plan wrapped in an incident response plan.
If my decades in this industry have taught me anything, it’s that merely having backups or replicated copies does not constitute having a bona fide DR plan. I’ve endured critical system recoveries firsthand, and I’ve witnessed what happens when assumptions are made and planning is incomplete. Recovery preparation is nonnegotiable – especially in today’s ransomware-ridden landscape.
Define recovery infrastructure upfront
A proper DR plan is built on the assumption that you’ll probably need to restore data and systems on entirely new infrastructure. The original system could be unavailable because it needs to be retained for forensic reasons, or its hardware is beyond repair. Counting on reusing the same physical servers after an attack or disaster is reckless. It simply might not be possible, so you must be prepared to start a recovery with all new hardware.
That means you need to procure standby equipment or failover hosts ahead of time for temporary deployment post-event. Leveraging cloud infrastructure is one of the best ways to do this, since you can create the configuration up front, but you only pay for it when you need it. What you don’t want is to find yourself scrambling to purchase replacement servers amidst chaos. It will only magnify the crisis. So, know where replacement systems will come from – well before you actually need them.
Safeguard failover environments ASAP
Another thing that people fail to take into account is how to back up the DR site once a failover has happened. Once a cutover to a DR site takes place, whether on-premises or in the cloud, backup of the new environment must begin straight away. The last thing you want is to successfully recover critical systems only to have a secondary ransomware wave impact the DR location because you neglected to create backups. Just like planning to acquire recovery hardware, this is a step that is best done upfront, well before any recovery begins. How you will back up your recovery site must be decided on and designed upfront, and it’s best if you design it in such a way that backups start automatically if you fail over to your recovery site. Never consider the recovery finished until legacy backups are reestablished.
Define tangible recovery timeframe objectives
If you ask the typical business lead what kind of recovery requirement they have, they will say they need an immediate recovery with zero loss of data (i.e. an RPO and RTO of zero). But then they balk at the astronomical costs required to satisfy that requirement. This means you must have the difficult but mandatory discussions regarding genuine recovery time and data loss tolerance thresholds way ahead of time, before disaster strikes. The more you can get C-suite stakeholders to calibrate organizational priorities and balance business continuity needs with actual IT abilities and cost constraints, the more successful your recoveries will be. Remember that “successful” is relative; you will be you successful if you meet the objectives set for you by the business. If those objectives are both realistic and properly funded, you’re on your way to success.
Document and test to overcome doubt
Even with defined objectives, documented and proven runbooks are indispensable when in the heat of battle. Simply having backup copies or replicas is fruitless without understanding how to efficiently restore data and systems at scale. You need step-by-step procedures for recovering assets in order of criticality, including system dependencies. The documentation must also include a complete inventory of the recovery environment, a contact list of all staff and vendors, and an escalation process for dealing with issues. Critical, repeated testing under simulated scenarios allows responders to practice tapping this documentation for value realization when it counts most.
Embrace automation. Start somewhere
While manual procedures might dominate early DR efforts, you should continuously strive for increased automation. The more recovery tasks that can be scripted or triggered programmatically, the higher the probability of success. However, don’t let the best be the enemy of the good. The pursuit of full automation should not deter having some, even nascent, DR plan in place. Anything is better than nothing. Document everything first, then selectively automate over time based on testing feedback. Baby steps come before marathon results.
Leverage the cloud
For numerous organizations today, the cloud offers more flexibility for replicating data, and especially for provisioning temporary DR infrastructure. Cloud services make it so easy to conveniently spin up and down test environments, making it much easier to rehearse disaster scenarios. Even for companies using on-premises infrastructure, give consideration to cloud-based replication of backups for DR purposes. For most, the cloud already features prominently in cyber resilience preparations – ensure it’s covering disaster readiness too.
Test until it hurts, then test again
Having DR documentation is good, but it’s not good if you only test it when you need it. Besides helping to develop muscle memory, the other purpose of testing is to unmask gaps in the process. Proactively force failures through simulated disasters. Even anticipate occasional setbacks when examining new capabilities or procedures.
Twenty-nine years ago, I received a phone call in my wife’s hospital room – mere hours after my daughter’s arrival – asking me to come help with a bungled restoration at work. Because I had a fully documented process, I was able to just hang up and focus on my new child. You never know when a fully documented recovery process will be necessary, so now’s the time to begin the process.