An Introduction To SQL Server Clustering
An Introduction To SQL Server Clustering
By : Brad McGehee
If your mission-critical SQL Server should experience a motherboard failure, how long will it be
down? One hour, four hours, a day, or longer? How much will this cost your business in lost
sales or productivity? And perhaps even more important to you, what will it do to your stress
level?
Being a SQL Server DBA can be demanding and stressful, especially as the success of your
company is often a function of your SQL Server’s uptime. While we, as DBAs, have some
control over the uptime of our SQL Servers, we don’t have full control. There is not much we
As you may already be aware, there is one way to help boost your SQL Server’s uptime, and
that is by clustering SQL Servers. This way, should one SQL Server fail in the cluster, another
clustered server will automatically take over, keeping downtime to minutes, instead of hours
or more.
The purpose of this article is to introduce you to SQL Server clustering, along with its pros and
cons. If you are considering clustering SQL Server to help reduce potential downtime, this
What is Clustering?
Clustering can be best described as a technology that automatically allows one physical server
to take over the tasks and responsibilities of another physical server that has failed. The
obvious goal behind this, given that all computer hardware and software will eventually fail, is
to ensure that users running mission-critical applications will have little or no downtime when
such a failure occurs. Downtime can be very expensive, and our goal as DBA is to help reduce
it as much as possible.
More specifically, clustering refers to a group of two or more servers (generally called nodes)
that work together and represent themselves as a single virtual server to a network. In other
words, when a client connects to clustered SQL Servers, it thinks there is only a single SQL
Server, not more than one. When one of the nodes fails, its responsibilities are taken over by
another server in the cluster, and the end-user notices little, if any differences before, during,
Microsoft added clustering features to its operating system when they introduced Windows NT
Server 4.0 Enterprise Edition several years ago. The actual clustering feature was called MSCS
(Microsoft Clustering Server). While some brave folks actually put the software into
production, I personally avoided it as it was not as dependable as Microsoft led you to believe.
Also, about this same time, SQL Server 6.5 Enterprise Edition was released, allowing it to be
clustered. This was a very crude attempt at clustering SQL Server that was rarely
Later, when SQL Server 7.0 became available, there were major improvements in SQL Server
clustering, but no way was it still good enough, as Windows NT 4.0 Server Enterprise MSCS
was still being used at its foundation, and it just wasn’t good enough for maintaining high-
availability servers.
Fortunately, Microsoft’s second attempt at clustering, now called the Microsoft Cluster Service,
in Windows 2000 Advanced Server and Windows 2000 Datacenter Server, is much better.
While I won’t claim it is perfect, at least I now am willing to use it to cluster mission-critical
SQL Servers. Cluster Service not only works well when clustering SQL Server 7.0, it is even
better if you want to cluster SQL Server 2000, which has greatly enhanced clustering support.
One very important aspect of clustering that often gets overlooked is that it is not a complete
backup system for your applications. It is only one part of a multi-part strategy required to
The main benefits that clustering provides is the ability to recover from failed server hardware
(excluding the shared disk) and failed software, such as failed services or a server lockup. It is
not designed to protect data, to protect against a shared disk array from failing, to prevent
hack attacks, to protect against network failure, or to prevent SQL Server from other potential
You will also need to purchase a shared disk array (more on this later) that offers redundancy,
make tape backups, put the server behind a firewall, make sure your network connections
have redundancy, use battery backup, and locate the server in a secure facility, among many
other steps you can take. So don’t think that clustering is all you need for creating a highly
called Active/Active or an Active/Passive cluster. Each has its own pros and cons. Let’s look at
An Active/Active SQL Server cluster means that SQL Server is running on both nodes of a two-
way cluster. Each copy of SQL Server acts independently, and users see two different SQL
Servers. If one of the SQL Servers in the cluster should fail, then the failed instance of SQL
Server will failover to the remaining server. This means that then both instances of SQL
As you can imagine, if two instances have to run on one physical server, performance can be
An Active/Passive SQL Server cluster refers to a SQL Server cluster where only one instance of
SQL Server is running on one of the physical servers in the cluster, and the other physical
server does nothing, other that waiting to takeover should the primary node should fail.
From a performance perspective, this is the better solution. On the other hand, this option
makes less productive use of your physical hardware, which means this solution is more
expensive.
overall it will provide better performance. Assuming you have the budget, this is what I
recommend.
Two- or Four-Node Clustering?
SQL Server can be clustered using two nodes (using Windows 2000 Advanced Server), or it
can be clustered using more than two nodes (using Windows 2000 Datacenter). Since I don’t
personally have any experience is three or four node clustering, I won’t be discussing it here.
But for the most part, what I say about two-node clustering also applies to three- or four-node
clustering.
cluster node, one of the SQL Servers is referred to as the primary node, and the second one is
referred to as the secondary node. In an Active/Passive cluster design, SQL Server will run on
the primary node, and should the primary node fail, then the secondary node will take over.
When you build a two-node cluster using Windows 2000 Advanced Server and Microsoft
Clustering Service, each node must be connected to a shared disk array using either SCSI
Typically, this shared disk array is a stand-alone unit that houses a RAID 5 or RAID 10 disk
array. All of the shared data in the cluster must be stored on this disk array, otherwise when a
failover occurs, the secondary node in the cluster cannot access it. As I have already
mentioned earlier, clustering does not help protect data or the shared disk array that it is
stored on. Because of this, it is very important that you select a shared disk array that is very
Besides both servers being connected to a shared disk array, both nodes of the cluster are
also connected to each other via a private network. This private network is used for each node
to keep track of the status of the other node. For example, if the primary node experiences a
hardware failure, the secondary node will detect this and will automatically initiate a failover.
So how do clients who are accessing SQL Server know what to do when a failover occurs in a
cluster? This is the cleverest part of Microsoft Cluster Service. Essentially what happens in a
SQL Server cluster is that you assign SQL Server its own virtual name and virtual TCP/IP
address. This name and address is shared by both of the servers in the cluster.
Typically, a client will connect to the SQL Server cluster using the virtual name used by the
cluster. And as far as a client is concerned, there is only one physical SQL Server, not two.
Assuming that the primary node of the SQL Server cluster is the node running SQL Server on
an Active/Passive cluster design, then the primary node will respond to the client’s requests.
But if the primary node fails, and failover to the secondary node occurs, the cluster will still
retain the same SQL Server virtual name and TCP/IP address, although now a new physical
During the failover period, which can last several minutes (the exact amount of time depends
on the number and sizes of the databases on SQL Server, and how active they are), clients
will be unable to access SQL Server, so there is a small amount of downtime when failover
occurs.
How the client software reacts to the failover process depends on the software. Some software
will just wait the failover out, and when the failover has completed, it will continue just as
nothing had happened. Some software will present a message box on the screen, describing a
lost connection. Other client software will not know what to do, and users may have to exit,
and then reload the client before they can access SQL Server again.
As part of the testing process when implementing a SQL Server cluster, it is important to find
out how all of the client software that connects to SQL Server reacts to a failover. This way,
you can inform your users of what to expect, so they are better able to deal with it when it
does occur.
Once a failover occurs, you will want to find out what caused the failover, and then take the
necessary action and correct the problem. Once the problem has been fixed, the next step is
to failover SQL Server back to the primary node from the secondary node. You can schedule to
Implementing SQL Server clustering is a big decision, and one fraught with many gochas.
Before you undertake such a large and important project, you will want to carefully evaluate
the pros and cons of clustering, which include, but are not limited to:
required.
It allows you to perform upgrades without forcing users off the system for extended
periods of time.
maintenance.
Failing back is quick, and can be done whenever the primary server if fixed and put
back on-line.
In some cases, clustering can be used to increase the scalability of an application. For
example, if a current cluster is getting too busy, another server could be added to the
cluster to expand the resources and help boost the performance of the application.
Cons of Clustering
More expensive than other failover alternatives, such as log shipping or stand-by
servers.
One SQL Server 7.0 Enterprise or SQL Server 2000 Enterprise Licenses for
To cluster more than two nodes, you will need the following:
Two or More SQL Server 7.0 Enterprise or SQL Server 2000 Enterprise Licenses
I want to emphasize that you always want to go with the latest service packs, as many
following:
Two servers with a minimum of 256MB RAM and a single Pentium III CPU.
One shared disk array that supports RAID 5 or 10, either SCSI or fibre channel.
Each server must have at least one local SCSI hard disk on its own SCSI controller.
Each server must have a SCSI or fiber channel adapter to talk to the shared disk
array. The shared disk array cannot use the SCSI controller used by the local hard disk
or CD-ROM.
Each server must have two PCI network cards (one for the private connection and one
How you size the physical servers (CPUs, RAM, amount of shared disk array) is very similar to
how you would size a non-clustered server if you plan to use an Active/Passive configuration.
But, if you intend to use an Active/Active configuration, then ideally each physical server
needs to be sized to run all instances of SQL Server that run on both the primary and
secondary nodes, should failover occur and both instances of SQL Server have to run on the
Ideally, both physical servers should be identical in hardware, drivers, software, and
configuration. There are some exceptions to this allowed, but I would not recommend making
them. The closer each physical server is to each other, the less problems you will have.
Another very important consideration when selecting clustering hardware is that is must be on
supported system? Before Microsoft will support your cluster, all of the cluster hardware you
select (servers, cards, shared array, etc.) must have been tested as an entire system and
approved by Microsoft. If the system you purchase isn’t an approved system, then Microsoft
will not help you if you call them, even if you pay them for support.
Even if all of the individual components of your cluster system have been approved
individually by Microsoft for clustering, if they have not all been tested as a single system,
then you will still not receive any support. That’s why you need to check out the HCL carefully
before you order your equipment to ensure that you purchase an approved clustering system.
Unfortunately, Microsoft’s HCL website is often outdated. What this means is that as new
equipment comes out (and we always seem to want to use the latest and the greatest), it may
not be added to their approved cluster systems for months. You can of course take the risk of
selecting equipment that is not part of an approved Microsoft system (hoping that it will
eventually be tested and certified by Microsoft), but should you have a problem and call
This topic could fill a book, but unfortunately, there is not enough room here for that. So what
I will do here is just cover the basics so you have some feel for what you face.
Based on my experience with SQL Server clustering, you should keep the following in mind:
Buy an approved cluster system and have it delivered to you 4-8 weeks before you
plan to implement it. This will give you lots of time to learn about it and work out any
the game to help you plan your implementation. If you have never done this before,
you will need all the help you can get. Unfortunately, there is not a lot of good
information available on clustering, so you will have to fend for yourself more than you
usually do.
Once your cluster is up and running, test it thoroughly as possible using the same
databases and clients you intend to use on it when you start production.
Develop a plan to regularly test the cluster to ensure that it will failover as expected.
Create a disaster recovery plan for what you intend to do should you loose all the
Implementing and managing a cluster is complex. You will want to assign a team of top
clustering that you didn’t have before. And I hope I haven't scared you away from SQL Server
clustering.
While SQL Server clustering is not an easy challenge, it is often a worthwhile one. After
installing two SQL Server clusters myself, and currently working on a third, I feel that SQL
Server clustering is very valuable for many organizations, and in fact it has reduced my stress
own level somewhat. Now I don’t have to worry (as much) about extended periods of down
time. Now that the clusters are up, they haven’t presented any problems and are purring
away.
http://www.sql-server-performance.com/clustering_intro1.asp
http://www.sql-server-performance.com/clustering_resources.asp
http://support.microsoft.com/kb/q273673/