Module - 4: 3.1 Introduction To Business Continuity
Module - 4: 3.1 Introduction To Business Continuity
Module - 4: 3.1 Introduction To Business Continuity
MODULE – 4
Introduction to Business Continuity: Information Availability, BC Terminology, BC Planning
Life Cycle, Failure Analysis, Business Impact Analysis, BC Technology Solutions, Backup
and Archive: Backup Purpose, Backup Considerations, Backup Granularity, Recovery
Considerations, Backup Methods, Backup Architecture, Backup and Restore Operations,
Backup Topologies, Backup in NAS Environments
RBT: L1, L2
Notes:
Business continuity (BC) is an integrated and enterprise wide process that includes all activities
(internal and external to IT) that a business must perform to mitigate the impact of planned and
unplanned downtime.
BC entails preparing for, responding to, and recovering from a system outage that adversely
affects business operations. It involves proactive measures, such as business impact analysis, risk
assessments, deployment of BC technology solutions (backup and replication), and reactive
measures, such as disaster recovery and restart, to be invoked in the event of a failure.
The goal of a BC solution is to ensure the “information availability” required to conduct vital
business operations.
Information availability (IA) refers to the ability of the infrastructure to function according to
business expectations during its specified time of operation. Information availability ensures that
people (employees, customers, suppliers, and partners) can access information whenever they
1. Reliability,
2. Accessibility
3. Timeliness.
1. Reliability: This reflects a component’s ability to function without failure, under stated
conditions, for a specified amount of time.
2. Accessibility: This is the state within which the required information is accessible at the
right place, to the right user. The period of time during which the system is in an
accessible state is termed system uptime; when it is not accessible it is termed system
downtime.
3. Timeliness: Defines the exact moment or the time window (a particular time of the day,
week, month, and/or year as specified) during which information must be accessible. For
example, if online access to an application is required between 8:00 am and 10:00 pm
each day, any disruptions to data availability outside of this time slot are not considered
to affect timeliness.
As illustrated in Fig 3.1 above, the majority of outages are planned. Planned outages are
expected and scheduled, but still cause data to be unavailable.
➢ Information availability (IA) relies on the availability of physical and virtual components
of a data center. Failure of these components might disrupt IA. A failure is the
termination of a component’s capability to perform a required function. The component’s
capability can be restored by performing an external corrective action, such as a manual
reboot, a repair, or replacement of the failed component(s).
➢ Proactive risk analysis performed as part of the BC planning process considers the
component failure rate and average repair time, which are measured by MTBF and
MTTR:
→ Mean Time Between Failure (MTBF): It is the average time available for a system or
component to perform its normal operations between failures.
→ Mean Time To Repair (MTTR): It is the average time required to repair a failed
component. MTTR includes the total time required to do the following activities:
Detect the fault, mobilize the maintenance team, diagnose the fault, obtain the spare
parts, repair, test, and restore the data.
Fig 3.2 illustrates the various information availability metrics that represent system uptime
and downtime.
IA is the time period that a system is in a condition to perform its intended function upon
demand. It can be expressed in terms of system uptime and downtime and measured as the
amount or percentage of system uptime:
IA = system uptime / (system uptime + system downtime)
In terms of MTBF and MTTR, IA could also be expressed as
IA = MTBF / (MTBF + MTTR)
Uptime per year is based on the exact timeliness requirements of the service, this calculation
leads to the number of “9s” representation for availability metrics.
Table 3-1 lists the approximate amount of downtime allowed for a service to achieve certain
levels of 9s availability. For example, a service that is said to be “five 9s available” is available
for 99.999 percent of the scheduled time in a year (24 × 365).
3.1.2 BC Terminology
This section defines common terms related to BC operations which are used in this module to
explain advanced concepts:
➢ Disaster recovery: This is the coordinated process of restoring systems, data, and the
infrastructure required to support key ongoing business operations in the event of a disaster.
It is the process of restoring a previous copy of the data and applying logs or other necessary
processes to that copy to bring it to a known point of consistency. Once all recoveries are
completed, the data is validated to ensure that it is correct.
➢ Disaster restart: This is the process of restarting business operations with mirrored
consistent copies of data and applications.
➢ Recovery-Point Objective (RPO): This is the point in time to which systems and data must
be recovered after an outage. It defines the amount of data loss that a business can endure. A
large RPO signifies high tolerance to information loss in a business. Based on the RPO,
organizations plan for the minimum frequency with which a backup or replica must be made.
For example, if the RPO is six hours, backups or replicas must be made at least once in 6
hours. Fig 3.3 (a) shows various RPOs and their corresponding ideal recovery strategies. An
organization can plan for an appropriate BC technology solution on the basis of the RPO it
sets. For example:
→ RPO of 24 hours: This ensures that backups are created on an offsite tape drive every
midnight. The corresponding recovery strategy is to restore data from the set of last
➢ Recovery-Time Objective (RTO): The time within which systems and applications must be
recovered after an outage. It defines the amount of downtime that a business can endure and
survive. Businesses can optimize disaster recovery plans after defining the RTO for a given
system. For example, if the RTO is two hours, then use a disk backup because it enables a
faster restore than a tape backup. However, for an RTO of one week, tape backup will likely
meet requirements. Some examples of RTOs and the recovery strategies to ensure data
availability are listed below (refer to Fig 3.3 (b)):
→ RTO of 72 hours: Restore from backup tapes at a cold site.
→ RTO of 12 hours: Restore from tapes at a hot site.
→ RTO of few hours: Use a data vault to a hot site.
→ RTO of a few seconds: Cluster production servers with bidirectional mirroring, enabling
the applications to run at both sites simultaneously.
The BC planning lifecycle includes five stages shown below (Fig 3.4):
2. Analyzing
→ Collect information on data profiles, business processes, infrastructure support,
dependencies, and frequency of using business infrastructure.
→ Identify critical business needs and assign recovery priorities.
→ Create a risk analysis for critical areas and mitigation strategies.
→ Conduct a Business Impact Analysis (BIA).
→ Create a cost and benefit analysis based on the consequences of data unavailability.
3. Designing and developing
→ Define the team structure and assign individual roles and responsibilities. For example,
different teams are formed for activities such as emergency response, damage assessment,
and infrastructure and application recovery.
→ Design data protection strategies and develop infrastructure.
→ Develop contingency scenarios.
→ Develop emergency response procedures.
→ Detail recovery and restart procedures.
4. Implementing
→ Implement risk management and mitigation procedures that include backup, replication, and
management of resources.
→ Prepare the disaster recovery sites that can be utilized if a disaster affects the primary data
center.
→ Implement redundancy for every resource in a data center to avoid single points of
failure.
5. Training, testing, assessing, and maintaining
→ Train the employees who are responsible for backup and replication of business-critical data
on a regular basis or whenever there is a modification in the BC plan
→ Train employees on emergency response procedures when disasters are declared.
→ Train the recovery team on recovery procedures based on contingency scenarios.
→ Perform damage assessment processes and review recovery plans.
→ Test the BC plan regularly to evaluate its performance and identify its limitations.
→ Assess the performance reports and identify limitations.
Prof. Sowmya C V, Dept of CSE, SKIT 9
Storage Area Networks/18CS822 Module-4
→ Update the BC plans and recovery/restart procedures to reflect regular changes within the
data center.
➢ A single point of failure refers to the failure of a component that can terminate the
availability of the entire system or IT service.
➢ Fig 3.5 depicts a system setup in which an application, running on a VM, provides an
interface to the client and performs I/O operations.
➢ The client is connected to the server through an IP network, the server is connected to
the storage array through a FC connection, an HBA installed at the server sends or
receives data to and from a storage array, and an FC switch connects the HBA to the
storage port
➢ In a setup where each component must function as required to ensure data
availability, the failure of a single physical or virtual component causes the failure of the
entire data center or an application, resulting in disruption of business operations.
➢ In this example, failure of a hypervisor can affect all the running VMs and the virtual
network, which are hosted on it.
➢ The can be several similar single points of failure identified in this example. A VM, a
hypervisor, an HBA/NIC on the server, the physical server, the IP network, the FC switch,
the storage array ports, or even the storage array could be a potential single point of
failure. To avoid single points of failure, it is essential to implement a fault-tolerant
mechanism.
➢ Data centers follow stringent guidelines to implement fault tolerance for uninterrupted
information availability. Careful analysis is performed to eliminate every single point of
failure.
➢ The example shown in Fig 3.6 represents all enhancements of the system shown in
Fig 3.5 in the infrastructure to mitigate single points of failure:
• Configuration of redundant HBAs at a server to mitigate single HBA failure
• Configuration of NIC (network interface card) teaming at a server allows protection
against single physical NIC failure. It allows grouping of two or more physical NICs
and treating them as a single logical device. NIC teaming eliminates the single point
of failure associated with a single physical NIC.
• Configuration of redundant switches to account for a switch failure
• Configuration of multiple storage array ports to mitigate a port failure
• RAID and hot spare configuration to ensure continuous operation in the event of disk
failure
• Implementation of a redundant storage array at a remote site to mitigate local site
failure
• Implementing server (or compute) clustering, a fault-tolerance mechanism whereby
two or more servers in a cluster access the same set of data volumes. Clustered
servers exchange a heartbeat to inform each other about their health. If one of the
servers or hypervisors fails, the other server or hypervisor can take up the workload.
• Implementing a VM Fault Tolerance mechanism ensures BC in the event of a server
failure. This technique creates duplicate copies of each VM on another server so that
when a VM failure is detected, the duplicate VM can be used for failover. The two
VMs are kept in synchronization with each other in order to perform successful
failover.
➢ In LAN-based backup, the clients, backup server, storage node, and backup device are
connected to the LAN (see Fig 3.8). The data to be backed up is transferred from the
backup client (source), to the backup device (destination) over the LAN, which may
affect network performance.
➢ This impact can be minimized by adopting a number of measures, such as configuring
separate networks for backup and installing dedicated storage nodes for some
application servers.
➢ The emergence of low-cost disks as a backup medium has enabled disk arrays to be
attached to the SAN and used as backup devices. A tape backup of these data backups
on the disks can be created and shipped offsite for disaster recovery and long-term
Prof. Sowmya C V, Dept of CSE, SKIT 17
Storage Area Networks/18CS822 Module-4
retention.
➢ The mixed topology uses both the LAN-based and SAN-based topologies, as shown in
Fig 3.10. This topology might be implemented for several reasons, including cost,
server location, reduction in administrative overhead, and performance considerations.
➢ There are two approaches for performing a backup in a virtualized environment: the
traditional backup approach and the image-based backup approach.
➢ In the traditional backup approach, a backup agent is installed either on the virtual
machine (VM) or on the hypervisor.
➢ Fig 3.17 shows the traditional VM backup approach.
➢ If the backup agent is installed on a VM, the VM appears as a physical server to the
agent. The backup agent installed on the VM backs up the VM data to the backup
device. The agent does not capture VM files, such as the virtual BIOS file, VM swap
file, logs, and configuration fi les. Therefore, for a VM restore, a user needs to manually
re-create the VM and then restore data onto it.
➢ If the backup agent is installed on the hypervisor, the VMs appear as a set of files to the
agent. So, VM files can be backed up by performing a file system backup from a
hypervisor. This approach is relatively simple because it requires having the agent just
on the hypervisor instead of all the VMs.
➢ The traditional backup method can cause high CPU utilization on the server being
backed up.
➢ So the backup should be performed when the server resources are idle or during a low
activity period on the network.
➢ And also allocate enough resources to manage the backup on each server when a large
number of VMs are in the environment.
➢ Image-based backup operates at the hypervisor level and essentially takes a snapshot of
the VM.
➢ It creates a copy of the guest OS and all the data associated with it (snapshot of VM disk
files), including the VM state and application configurations. The backup is saved as a
single file called an “image,” and this image is mounted on the separate physical
machine–proxy server, which acts as a backup client.
➢ The backup software then backs up these image files normally. (see Fig 3.18).
➢ This effectively offloads the backup processing from the hypervisor and transfers the
load on the proxy server, thereby reducing the impact to VMs running on the
hypervisor.
➢ Image-based backup enables quick restoration of a VM.