For some time, disaster recovery plans were more about checking a box than having a viable plan that IT could easily find, follow, and execute during an unexpected outage. The importance of DR plans was always understood, but it had become increasingly difficult to keep plans up to date with frequent changes across IT.
As businesses accelerate their digital transformation, they have become more reliant on IT to rapidly respond to their changing needs. This means implementing new technology and infrastructure and delivering new applications and services.
But it also means maintaining app availability, managing legacy environments, and ensuring security. In other words, IT is the backbone of every organization and must be tightly aligned with strategic business goals.
With these increased demands comes an increase in the complexity of hybrid IT environments. It is more challenging for IT teams, which often operate in silos, to gain a deep understanding of their landscape and how their applications work across data centers and cloud platforms.
When an outage occurs, can they easily identify all the resources affected and which order they need to be recovered? Do they know what RTOs are for each app, and can they meet them? And, what about apps with regulatory compliance or organizational requirements – can they be met?
Aim for resilience – but know how to recover
At TDS, we agree with Gartner in their updated ITRO Market Guide that “even if an organization was able to theoretically avert, absorb and abate 100% of known hazards, unknown hazards persist, which would require recovery. Hence, although “always on” is the ideal, the ability to recover is an absolute imperative.” Everyone can agree that 2020 has certainly demonstrated the need to prepare for the unexpected and recover. Clearly, it’s time that organizations rethink their resilience capabilities, because there are always unknown threats for which they haven’t prepared.
IT is replete with policies and tools to help them become more resilient, and to predict, prevent, adapt and recover from potential threats. Because resilience is fluid and not completely attainable, the products that help IT become more resilient cannot guarantee resilience. And, while each of the purpose-built tools provide critical data and insight into specific areas of risk – such as app defects, security vulnerabilities, and performance thresholds – they were not meant to work together. So while IT may have a lot of useful information about risks, the data is not aggregated to provide a comprehensive picture and evaluate how resilient systems are across the enterprise.
The ability to recover from any outage, regardless of its source, requires that IT and business work together to ensure the following are in place:
An understanding of recovery capabilities and needs
Most organizations require a combination of recovery functionality offered by cloud providers as well as in-house IT DR teams. IT and business units need to define a level of criticality for each application and map the appropriate recovery method for that tier.
Access to a comprehensive and accurate view of IT
IT teams need to know how their applications work and what other assets they depend on to run across hosting sites and technology stacks. And, teams need quick access to business facts and requirements to ensure that RTOs are met, and any security or compliance actions are taken, such as ensuring proper procedures are followed for accessing or moving customer data, or notifying customers in the case of a breach. And because IT is a highly dynamic environment, this data needs to be current and easily accessible.
Ability to quickly identify blast radius of outage
When a full site outage occurs, failover to a backup site keeps business running until the cause can be determined and addressed. But when a service outage occurs that affects one or more applications, IT needs a way to quickly identify the blast radius – or all other systems impacted by the outage. This requires managing and maintaining IT with a business focused, app-centric approach that provides teams with visual maps of relationships across applications, infrastructure, and other devices. Accessing a current view of your environment where it is easy to visually track dependencies enables you to identify every application affected during an outage, as well as all the assets it needs to run.
Dynamic generation of runbooks and ability to orchestrate execution of highly complex workstreams
Systems need to be restored in the right order and without bringing other systems down. The ability to create runbooks on demand enables IT to generate workstreams for all affected systems, in the right order, without causing further disruption. These workstreams are often a combination of automated and manual processes.
Teams need the ability to track the progress of workstream execution in real-time, and take alternate action if bottlenecks or other problems occur to ensure meeting RTO and other requirements. When multiple systems are impacted simultaneously, runbooks need to be able to handle moving assets to different targets, using different transport tools. Often, this requires orchestrating recovery across multiple cloud platforms. And, because IT environments are dynamic and volatile, before moving assets to a recovery location, runbooks need to include tasks that ensure the target environment is properly configured and will support the assets being moved.
Proactive IT DR teams have a variety of tools that can help them minimize risk and become more resilient. Cloud, backup and DR vendors offer solutions that deliver overlapping functionality for recovery, and are expanding reporting and monitoring capabilities for applications that span multiple data centers and public clouds.
No-code tools are easier to use and enable faster execution. Teams are able to more efficiently access information they need to minimize complexity during recovery, improve confidence in execution of recovery and meet required recovery targets as well as better prepare for various threats.
Even with all this in place, a state of 100% resilience is unattainable since nobody can prepare for or anticipate every possible risk. Recovery, therefore, is an absolute imperative.
Our TransitionManager platform captures business, IT, regulatory, security and other data for each asset so that IT teams can more efficiently make decisions about resilience. By integrating with cloud assessment, performance monitoring, security, and other systems, TransitionManager extracts more value out of your existing tools by making them work together.