Failover Cluster: Download The Authoritative Guide: Download The Authoritative Guide
Failover Cluster: Download The Authoritative Guide: Download The Authoritative Guide
Download the authoritative guide: Data Center Guide: Optimizing Your Data Center Strategy
Download the authoritative guide: Cloud Computing: Using the Cloud for Competitive
Advantage
A failover cluster is a set of computer servers that work together to provide either high availability (HA)
or continuous availability (CA). If one of the servers goes down, another node in the cluster can
assume its workload with either minimum or no downtime through a process referred to as failover.
Some failover clusters use physical servers only, whereas others involve virtual machines (VMs).
The main purpose of a failover cluster is to provide either CA or HA for applications and services. Also
referred to as fault tolerant (FT) clusters, CA clusters allow end users to keep utilizing applications and
services without experiencing any timeouts if a server fails. With HA clusters, on the other hand, a
user might undergo a brief interruption in service, but the system will recover automatically with no
data loss and minimum downtime.
A cluster is made up of two or more nodes, or servers, which are generally connected through physical
cables in addition to software. Other kinds of clustering technology can be used for purposes such as
load balancing, storage, and concurrent or parallel processing. Some implementations combine
failover clusters with additional clustering technology.
To protect your data, a dedicated network connects the failover cluster nodes, providing essential CA
or HA backup.
How Failover Clusters Work
While CA failover clusters are designed for 100 percent availability, HA clusters attempt 99.999
percent availability, also known as “five nines,” for downtime amounting to no more than 5.26 minutes
yearly. As a trade off for their greater availability, though, CA clusters are more costly to implement,
due to increased hardware requirements.
In a simple two-node configuration, for example, if Node 1 fails, Node 2 uses the heartbeat connection
to recognize the failure and then configures itself as the active node. Clustering software installed on
every node in the cluster makes sure than clients connect to an active node.
Some cluster management software provides HA for virtual machines (VMs) by pooling them and the
physical servers they reside on into a cluster. If failure occurs, the VMs on the failed host are restarted
on alternate hosts.
Shared storage does pose a risk as a potential single point of failure. However, the use of RAID 6
together with RAID 10 can help to ensure that service will continue even if two hard drives fail.
If all servers are plugged into the same power grid, electrical power can represent another single point
of failure. Yet the nodes can be safeguarded by equipping each with a separate uninterruptible power
supply (UPS).
CA requires the organization to use formatted computer equipment, plus a secondary UPS. CA
systems can also compensate for many different sorts of failures.
A fault tolerant system can automatically detect a failure of not just a hard drive but a computer
processor unit, I/O subsystem, power supply, or network component, for instance. The failure point
can be immediately identified, and a backup component or procedure can take its place instantly
without interruption in service.
In a CA failover cluster, the operating system (OS) is outfitted with an interface permitting a software
programmer to do checkpoints of critical data at predetermined points in a transaction.
Clustering software can also be used to group together two or more servers to act as a single virtual
server. You can also create many other CA failover setups. For example, a cluster might be configured
so that if one of the virtual servers fails, the others respond by temporarily removing the virtual server
from the cluster. It then automatically redistributes the workload among the remaining servers until the
downed server is ready to go online again.
An alternative to CA failover clusters is use of a “double” hardware server in which all physical
components are duplicated. Calculations are done independently and simultaneously on the same
hardware system. Yet this option can be even more expensive.
These “double” hardware systems perform synchronization by using a dedicated node that keeps tabs
on the results coming from both physical servers. Stratus, a maker of these specialized fault tolerant
hardware servers, promises that system downtime won’t amount to more than 32 seconds each year.
However, the cost of one Stratus server with dual CPUs for each synchronized module is estimated at
approximately $160,000 per synchronized nodule.
Many other types of organizations also use either CA clusters or fault tolerant computers for mission
critical applications, such as businesses in the fields of manufacturing, logistics, and retailing.
Applications include e-commerce, order management, and employee time clock systems, for example.
For clustering applications and services requiring only “five nines” uptime, though, high availability
clusters are generally regarded as adequate.
Disaster Recovery
Disaster recovery is another practical application for failover clusters. Of course, it’s highly advisable
for failover servers to be housed at remote sites in the event that a disaster such as a fire or flood
takes out all physical hardware and software in the primary data center.
In Windows Server 2016 and 2019, for example, Microsoft provides Storage Replica, a technology
allowing replication of volumes between servers for disaster recovery. The technology includes a
“stretch failover” feature for failover clusters spanning two geographic sites.
By stretching failover clusters, organizations can replicate among multiple data centers. If a disaster
strikes at one location, all data continues to exist on failover servers at other sites.
Database Replication
According to Microsoft, the company originally introduced Windows Server Failover Cluster (WSFC) in
Windows Server 2016 mainly to protect “mission-critical” applications such as its SQL Server database
and Microsoft Exchange communications server.
Other database providers, too, offer failover cluster technology for database replication. MySQL
Cluster, for example, includes a heartbeat mechanism for instant failure detection, typically within one
second, to other nodes in the cluster, with no service interruptions to clients. A geographic replication
feature enables databases to be mirrored to remote locations.
WFSC includes Microsoft’s previous Cluster Shared Volume (CSV) technology to provide a
consistent, distributed namespace for accessing shared storage from all nodes. In addition, WSFC
supports CA file share storage for SQL Server and Microsoft Hyper-V cluster VMs. It also supports HA
roles running on physical servers and Hyper-V cluster VMs. Here is a Hyper-V cluster diagram.