Business Continuity and Disaster Recovery or BCDR

Basic Theory

Business Continuity and Disaster Recovery (BCDR) components are:

  • High Availability (HA). Application has a catastrophic failure, so redundancy is required. Usually this redundancy is achieved in the context of one Data Centre.
  • Backup. Data corruption, deletion or loss. Backup with the consistent state is required.
  • Disaster Recovery (DR). Data Centre failure or data corruption, so the process of returning a system to a functional state is required.

Causes of IT Disasters

HA And DR Business Objectives

  • business continuing during the software failures or regional outages

Backup Business Objectives

  • defend against human errors
  • protect against malware and ransomware

BCDR Key Metrics

  • RPO (recovery point objective) – the amount of time, e.g., 30 minutes, 4 hours, etc., for which it is tolerable to lose data should a disruptive event occur. RPO largely determines the frequency of data replication required.
  • RTO (recovery time objective) – the window of time between a disruptive event and a return to operational status. RTO largely determines the class of equipment and size of connection to your provider that is necessary to meet your recovery objectives.

RTO & RPO

High Availability

Cloud PaaS and Saas services in most of the cases provide this redundancy out of the box, which if often baked into the respective service SLAs.

Custom solutions built on top of IaaS, usually require more efforts. There are 2 major groups of failures to consider: software failures and hardware failures. To mitigate the software failures a common solution is to have multiple instances of it and have a Load Balancer in front of them to route the requests. To mitigate the hardware failures, it is necessary to have instances to be installed on the different hardware. Azure does it with the help of the VM availability sets. Availability set spreads your virtual machines across multiple fault domains and update domains.

Fault Domain has the following characteristics:

  • Fault domains define the group of virtual machines that share a common power source and network switch.
  • Each and every fault domain contains some racks and each rack contains virtual machine.
  • Each of these Fault domain shares a power supply and a network switch.
  • If there is a failure in the fault domain then all the resources in the fault domain become unavailable.
  • You should place your VMs such a way that each fault domain get one web server, one database server and like that.

Update Domain has the following characteristics:

  • All VMs within that update domain will reboot together.
  • Update domains are used for patching of the virtual machines.
  • Only one update domain would be updated at the time

fault-domain-isolation

More information about this could be read in the article High availability for applications built on Microsoft Azure

High Availability Implementation Options

When implementing a High Availability there are 3 major variations to consider:

  • Active-Active. When multiple instances of a software are actively serving the requests and there is a Load Balancer which orchestrates the distribution of the requests. This approach often does not have a downtime, if the Load Balancer is smart enough and can route the failed requests due to the instance failure to another instance. A maximum target SLA for this option is usually 99.99%.
  • Active-Passive. When multiple instances of a software are ready to serve the requests, but only one of the instances is doing it. A Load Balancer actively checks the health of all the instances and when the active instance fails, selects a new active instance which will serve the requests. This approach usually has a downtime when a Load Balancer selects a new active instance. A maximum target SLA for this option is usually 99.9%.
  • Passive. When there is only one active instance. The redundant instances are switched off and are turned on only when the currently active instance fails. The switch is usually done manually in this case, which implies a substantial downtime. A maximum target SLA for this option is usually 99%.

To achieve the SLA greater then 99.99%, Disaster Recovery (GEO redundancy) is usually required.

High Availability for Stateful Services

When a service is staeless, switch to a new instance is a matter of starting a new instance and directing the load to it. However when a service is stateful, it is not enough. To run a new stateful service instance, it has to get some version of a state. The version of the state for a new stateful service instance is usually question of the acceptable data loss, which is usually a very important one.

There are different solutions which vary from a substantial data loss (high RPO) to no data loss at all. They are based on the data replication process.

Note that data replication is not a substitution of a data backup, due to the following potential disasters:

  • human errors, like accidental data deletion, software errors, etc
  • malware and ransomware

Replication Level Options

There are four options available to choose from when it comes to the level of replication. Usually one level of replication is used for all applications in the same tier. A solution can quickly become over complicated when mixing levels unnecessarily.

  • Application-level. In this case, application has its own replication mechanism, that is optimal for this particular case. This approach offers the benefits of low RTO and RPOs, but it requires to maintain the OS and patching to ensure it works properly at the time of failover. Databases are an ideal fit for this replication.
  • Guest OS-level replication. This approach replicates data on a “block-level basis” to a target machine. This solution offers one-click failover, but it requires license costs and an agent on the source machine.
  • SAN/LUN level replication. This solution replicates an entire SAN or LUN and all the VMs on each. This solution will replicate both physical and virtual machines, but is not public cloud friendly and is not hardware-, SAN- or hypervisor-agnostic.
  • Hypervisor-level replication. This replication approach protects VMs at the virtual machine disk format file level rather than at the SAN/LUN or storage volume level, thus replication can be done without the challenges associated with array-based replication. This replication is fully agnostic to storage source and destination, natively supporting all storage platforms and the full breadth of capabilities made possible by virtualization. However, this replication does not work well for physical machines.

Application Level Replication

Application level solutions could be divided into 2 major architecture groups:

  • Master-Slave (Active-Passive). One instance is active, all the rest are passive. In this architecture the bias is made to consistency because each piece of data has exactly one owning master. Systems that support ACID are implemented using this architecture, but when a master goes down, they are either unavailable or provide a read only access to a slave instance.
  • Master-Master (Active-Active). All instances are active. In this architecture the bias is made to availability but  it is very hard to preserve absolute consistency. Conflict resolution can become intractable as the number of nodes involved rises and the required latency decreases. Some of the conflict resolutions could require a manual intervention.

Replication Architectures

Application level replication has several solutions that offer different RPO & RTO.

  • Synchronous replication. Data is written to both the primary and secondary area at the same time. So, data remains identical and current on both sources. As the process works quickly, there is small margin of error. The recovery point objective of synchronous replication is almost “Zero”.
  • Asynchronous replication. In this replication, data is written to a primary and secondary instances. Data arrives at the replication target from the replication host with a delay, ranging from nearly instantaneous to minutes or even hours. If the replication is being done to a geographically separate data centre, then asynchronous replication is considered to be effective, as it cohesively works with network latency and is bandwidth tolerant. For this reason, some storage experts call this method as “Store and forward”. Sometimes, there are variations to optimize the network traffic, which do the replication in batches.

Backup

Backup of data is a required part of a DR solution. Backups are simple; they are copy of the systems and data that can be used to bring a failed system back online. However, backups do not necessarily include the infrastructure to restore to – you may have a copy of the data or systems, but no infrastructure to run those systems or process that data. In this case it may be very time consuming to utilize these backups if you have a need.

An ongoing service backup solutions usually consist of 2 parts:

  • Full Service Backup. Which is a snapshot of a service at a specific point in time.
  • Incremental Service Change Backups. A list of back up changes since a full service backup.

During a system restore, it is required to restore a full service backup and then all the consequent incremental change backups. To limit the number of incremental change backups during the restore and lower the RTO, there are 2 options:

  • do a new full service backup from time to time
  • merge incremental service change backups

Disaster Recovery

Disaster recovery has 2 major scenarios:

  • High availability. This is a GEO redundancy, which allows a service to be available even in case of a Data Centre failure. When switching in between the Data Centres usually a DNS record changes are required. There are services in the cloud like Traffic Manager in Azure, which help to automate this process.
  • System and data restore after corruption. When this happens, it usually requires a manual intervention, and, as a result, a substantial service downtime.

If to compare the High Availability and the Disaster Recovery there is a big difference:

  • Disaster Recovery focuses on people and the processes to execute the necessary procedures
  • High Availability focuses on the technology design and implementation

Azure Services

Azure has 2 services which help to build a Backup and a Disaster Recovery solutions to support the BCDR:

At the moment both these services are joined under one Azure resource with the name ‘Recovery Services Vault’. To protect a  resource with the Site Recovery, it should be located in a region different to the ‘Recovery Services Vault’ resource. It is because during a region outage, ‘Recovery Services Vault’ resource instance could also be unavailable.

Azure Site Recovery

Site Recovery services replicates entire site data, physical servers, VMs and even network services. In this manner, an entire site has redundancy for high availability, failover and failback services. With a Site Recovery, business continuity is achieved through replication either from an on-premise to cloud model or a cloud to cloud model.

The biggest cost benefits are:

  • cost
    • saving on a DR infrastructure
    • saving on building and maintaining a DR solution
    • saving on DR instance licences (no need to pay for additional licences for MS and some other software products)
  • faster TTM (time to market) than custom DR implementation
  • less complex and, as a result, less expertise is required to implement and test a DR

ASR is, first of all, a compute replication software, which does a VM level replication.

Microsoft DR Stack

Application DR with ASR

Following source compute instances are supported: (there are different requirements for them):

  • physical Windows and Linux servers
  • Hyper-V VMs
  • VMware VMs
  • AWS workloads
  • Azure VMs

Following destination compute instances are supported:

  • Hyper-V host
  • VMware Site
  • Azure

Migrate VMWare & Phisycal Servers to Azure

Migrate Hyper-V to Azure

ASR Features:

  • Availability – 99.9%
  • Security: data encrypted in transit and at rest
  • Failover: automated and manual failover and failback are supported
  • RPO (recovery point)
    • up to 30 seconds for Hyper-V (Hyper-V replica)
    • continuous replication for VMWare

ASR integrates with:

  • SQL Always On
  • Active Directory
  • DNS
  • Exchange
  • SAP
  • Oracle Data Guard
  • etc.

Failover Modes:

  • planned failover – involves shifting the production site to the replication site. Performs a complete failover and recovery in your recovery plans in a proactive, planned manner. Non-replicated changes are applied to the replica virtual machine loss before bringing the VM online ensuring zero data loss
  • unplanned failover – involves shifting the production site to the replication site . Used when a primary site experiences an unexpected incident, such as a power outage.
  • test failover – has no impact to production

Recovery customization is achieved through definition of the recovery plans, which includes:

  • do a sequenced failover, where necessary (based on recovery groups)
  • do a parallel failover, where necessary (based on recovery groups)
  • run a script
  • perform a manual actions

ASR solution moving parts. Configuration required is different based on different scenarios:

  • Config Server: The centralized on-premises appliance coordinating VMWare and physical server replication
  • Process Server(s): Caching, compression, and encryption for VMWare and physical server bi-directional replication during.
  • Mobility Service: Captures block level changes in memory on each protected VMware or physical machine.  Supports filesystem (Linux) and application (Windows) level consistency across multiple servers in a consistency group.
  • ASR Provider and the ASR Agent: Used for replicating and controlling replication of Hyper-V VMs
  • HRL Files: Files that are used to track the delta replication changes that occur after the initial replication.  Replication intervals are defined in the replication policy.

ASR DR planning compulsory steps:

  • clarify RTO and RPO goals
  • check source machines eligibility
  • define storage requirements: size and IOPS
  • check network bandwidth (for the initial and delta replication )
  • define network reconfiguration: subnets, IPs, load-balancing, integration with Azure Traffic Manager
  • calculate operational price

ASR has a useful tool Azure Site Recovery Deployment Planner, built to help with the DR planning.

ASR additional possibilities:

  • use ASR to migrate workloads from on-premises, other clouds or other Azure regions with
    • Zero-data loss during migration
    • Near-zero downtime for their users
    • Ability to test application in the new cloud before migration

Billing:

  • per protected instance
  • used storage capacity and transactions to store:
  • replicas
  • retention and recovery points
  • outbound data transfer
  • Azure infrastructure when a DR happens

Related DR service is Azure Backup. It keeps the data safe and recoverable by backing it up to Azure.

Azure Backup

A services has the enhanced security features below:

  • Backup data retained for 14 days after delete operation, so if anyone accidentally or maliciously deletes your backups, you still can recover
  • Ensures that there are more than one recovery point in case of attacks
  • Send alerts & notifications for critical operations with backups
  • Requires a security PIN for critical operations

Backup is supported form the 2 sources below:

  • Azure Cloud
  • On premises

Azure Cloud supported backup services:

  • VM
  • SQL Server in Azure VM
  • Azure File Share

Azure Backup Architecture

On premises supported features and the required backup tools are:

  • Microsoft Azure Recovery Services (MARS). Used for simple file backups:
    • Files & Folders
    • System State. Backs up operating system files, so you can recover when a computer starts, but system files and the registry are lost.

Files & Folders Backup

  • Microsoft Azure Backup Server (MABS). Used for VM & app backup
    • Hyper-V VM
    • VMWare VM
    • MS SQl Server
    • MS Sharepoint
    • MS Exchange
    • Bare Metal Recovery. Backs up operating system files and all data on critical volumes (except user data). By definition, a BMR backup includes a system state backup. It provides protection when a computer won’t start and you have to recover everything.
  • System Centre (SC) & System Centre Data Protection Manager (SCDPM) & MARS. For Windows Server 2016 with additional benefits below, which reduce cost and overall backup TCO.
    •  – 3X faster backups
    •  – 50% reduction in storage

Note that MABS has as many options as SCDPM.

Application Backup

Summary

Both services are valuable and used for the different purposes.

Azure Backup has a larger RPO, because the amount of data a backup solution needs to process is usually much higher, which also leads to a longer RTO. It provides a wide variability in the RTO & RPO. For an example, RPO is usually around one day for VMs and around 15 minutes for a databases.

Azure Site Recovery is different. It has smaller RPO and shorter RTO, because it usually needs only operational recovery data. DR copy can be behind by a few seconds/minutes – they are more in sync with the source. At the same time using DR data for long-term retention is not recommended because of the fine-grained data capture.

Useful resources:

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s