Spectrum Scale Stretched Cluster Best Practices

Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Spectrum Scale Expert Talks

Episode 2:

Best Practices for


building a stretched-
cluster

Show notes: Join our conversation:


www.spectrumscaleug.org/experttalks www.spectrumscaleug.org/join
About the user group
• Independent, work with IBM to develop events
• Not a replacement for PMR!
• Email and Slack community
• www.spectrumscaleug.org/join

#SSUG
Agenda

• What is a Spectrum Scale stretched cluster?


• Components of a stretched cluster
• Quorum concerns with stretched clusters
• File system replication with failure groups
• File system descriptor quorum
• Bringing this all together
• Some final considerations

Spectrum Scale Expert Talk: Stretched-cluster design 3


© Copyright IBM Corporation 2020.
Introduction

… or, what is a stretched cluster and why would I want one?

Spectrum Scale Expert Talk: Stretched-cluster design 4


© Copyright IBM Corporation 2020.
Spectrum Scale business resiliency through replicated redundancy

Active/Passive Active/Active
Application layer • Individual applications may have their own Individual applications may have
means of doing asynchronous replication. their own means of doing
• Aspera, rsync – These need to run regularly synchronous replication.
(maybe use cron)
File system layer Spectrum Scale AFM-DR Spectrum Scale failure groups and
replication – “stretched cluster”

Block layer • Block synchronous replication, configured for Block synchronous replication,
active/passive configured for active/active
• Block point-in-time copy

Spectrum Scale Expert Talk: Stretched-cluster design 5


© Copyright IBM Corporation 2020.
What is a Spectrum Scale “stretched cluster”?
Active Active
Application nodes Application nodes
(Spectrum Scale clients) (Spectrum Scale clients)

This is a single Spectrum Scale cluster is configured


using nodes and storage from two data centers.
Spectrum Scale file system
• In other words, it is “stretched” between two sites,
connected by a WAN.
File systems in such a cluster are available to
systems at both sites and may be actively used
concurrently by both sites. WAN

File systems may judiciously use “failure group”


replication to ensure both sites have a current
instance of all the data.
Site A Site B
Careful design can ensure one site remains active,
even if the other site (or link) fails.
The result is an active/active highly available
synchronously replicated Spectrum Scale file system.

Spectrum Scale Expert Talk: Stretched-cluster design 6


© Copyright IBM Corporation 2020.
How are stretched file systems being used today?
Active Active
Application nodes Application nodes
(Spectrum Scale clients) (Spectrum Scale clients)

Financial services
• Major hedge fund for financial HPC storage (30-40
Spectrum Scale file system
miles between sites)

Automotive
• Major manufacturer using a stretched cluster for
WAN
HPC storage (about 40 miles)

Life sciences
• Major hospitals using stretched clusters for critical Site A Site B
patient documents.

How far? About 50-100 miles, but it depends very


much on the WAN link and the workload…

Spectrum Scale Expert Talk: Stretched-cluster design 7


© Copyright IBM Corporation 2020.
Stretched clusters

… and why node quorum matters

Spectrum Scale Expert Talk: Stretched-cluster design 8


© Copyright IBM Corporation 2020.
Cluster and file system layers in Spectrum Scale

Layers of
Spectrum Scale File system

File system
The cluster layer is Storage pool
the physical layer
resources “Disk” “Disk” “Disk” “Disk”
FS / Token
(systems, volumes) managers
and logical Cluster NSD NSD NSD NSD
abstractions manager
(nodes, NSDs). Cluster Volume Volume Volume Volume
layer (LUN), (LUN), (LUN), (LUN),
The file system vdisk vdisk vdisk vdisk
layer creates the
namespaces and
manages their
associated storage
abstractions. “Nodes” Storage

Stretched clusters
enable stretched
file systems.

Spectrum Scale Expert Talk: Stretched-cluster design 9


© Copyright IBM Corporation 2020.
Spectrum Scale clusters

A Spectrum Scale cluster is a group of Spectrum


Node 1
Scale systems, or nodes, configured into a single
administrative grouping:
• All nodes have a common view of the data.
• The nodes are tightly coupled, trusting each others’
authentication of users. Node 2
• A cluster can have several Spectrum Scale file systems.

LAN
SAN
• A cluster may share a file system with an authenticated
remote cluster.
Node 3

Importantly, the cluster ensures all nodes in the


cluster have a consistent view of the Spectrum Scale
file systems, even when more than one node is
actively accessing a file system (or even the same
Node 4
file).

Spectrum Scale Expert Talk: Stretched-cluster design 10


© Copyright IBM Corporation 2020.
Cluster manager and the active cluster

Many nodes may be configured to be part of the cluster.


Node 1
The active cluster is the set of nodes currently communicating with
each other and sharing resources.

A cluster manager keeps track of which nodes are currently part of


the active cluster, controls access to managed cluster resources,
and maintains the configuration of the cluster.
Node 2

To avoid the cluster manager from being a single point of failure, this
is a role that may run on one of several cluster nodes.

LAN
SAN
Because failures happen, the cluster manager may need to expel
“dead” nodes and fence them from resources like shared disks: Node 3
• Perhaps a node has failed
• Perhaps a node’s network has failed

Generally a disk lease is used to grant a node access to the disk


volumes (NSDs) managed by the cluster.
Node 4
Disk lease renewal also functions as a heartbeat, so the cluster manager
knows which nodes are part of the active cluster.

Spectrum Scale Expert Talk: Stretched-cluster design 11


© Copyright IBM Corporation 2020.
Node quorum

Several nodes are designated as quorum nodes,


Node 1
which elect the cluster manager from amongst Quorum
themselves. • One of these becomes the
new cluster manager.
• Both nodes remain in the
With node quorum, a simple majority of the quorum active cluster.

nodes must be active and communicating to choose Node 2


Quorum
a cluster manager.

LAN
• Generally choose a small odd number of quorum nodes, like 3, 5
or possibly 7. More quorum nodes lengthens recovery time –
there is no benefit to choosing more than 7. Node 3

These nodes will be expelled


To keep the file system consistent and prevent data from the active cluster, until
loss, there must never be more than one cluster they ask to rejoin.
manager!
• Such a condition would be called “split-brain”, also known as Node 4
Quorum
disaster.

Spectrum Scale Expert Talk: Stretched-cluster design 12


© Copyright IBM Corporation 2020.
Checking and changing quorum status

To see which nodes in the cluster are quorum nodes: # mmgetstate -aLs

Node number Node name Quorum Nodes up Total nodes GPFS state Remarks
mmlscluster -------------------------------------------------------------------------------------
1 a-scale01 2 3 7 active quorum node
2 b-scale01 2 3 7 active quorum node
3 tiebreak 2 3 7 active quorum node
4 a-scale02 2 3 7 active

To determine the current quorum state, use: 5


6
a-scale03
b-scale02
2
2
3
3
7
7
active
active
7 b-scale03 2 3 7 active
mmgetstate -aLs
Summary information
---------------------
Number of nodes defined in the cluster: 7
Number of local nodes active in the cluster: 7
To designate a node as a quorum node: Number of remote nodes joined in this cluster:
Number of quorum nodes defined in the cluster:
0
3
Number of quorum nodes active in the cluster: 3
mmchnode --quorum -N NODENAME Quorum = 2, Quorum achieved

To designate a node as no longer being a quorum


node:
mmchnode --nonquorum -N NODENAME

Spectrum Scale Expert Talk: Stretched-cluster design 13


© Copyright IBM Corporation 2020.
Checking cluster manager status

To determine which node is currently the cluster # mmlsmgr


file system manager node
manager (omit -c to also show file system ---------------- ------------------
fs1 10.0.200.11 (a-scale01)
managers):
Cluster manager node: 10.0.200.11 (a-scale01)

# mmchmgr -c b-scale01
mmlsmgr -c Appointing node 10.0.200.21 (b-scale01) as cluster manager
Node 10.0.200.21 (b-scale01) has taken over as cluster manager

To move the cluster manager function to a particular # mmlsmgr -c


Cluster manager node: 10.0.200.21 (b-scale01)
quorum node:

mmchmgr -c NODENAME

Spectrum Scale Expert Talk: Stretched-cluster design 14


© Copyright IBM Corporation 2020.
Clusters can be… s t r e t c h e d

This is still a single Spectrum Scale cluster, but in two


“Tiebreaker quorum
parts or “sites”, separated or “stretched” with a WAN node”
link. Q
• If one site fails, the other site should still be available.
• If the WAN link between the sites fails, one site should still be
available – we accept that the other site will fail. LAN WAN LAN

Each site has the same number of quorum nodes –


but to choose a cluster manager, we need more than
Q Q
half the quorum nodes.
Site 1 Site 2

This requirement is met using an additional quorum


node that is not part of either site.
• This is typically called the “tiebreaker quorum node”.
• If the tiebreaker quorum node is down, the two sites can form
quorum without it.

Spectrum Scale Expert Talk: Stretched-cluster design 15


© Copyright IBM Corporation 2020.
Stretched file systems,
failure groups, and replication

… because redundancy and repetition can be good things

Spectrum Scale Expert Talk: Stretched-cluster design 16


© Copyright IBM Corporation 2020.
Replication and Failure Groups 1234567 fileA

12345 fileB
A storage pool is a class of storage device. Every NSD (disk volume) is Replication
assigned to a pool when it is added to a file system. factor is 2.
A failure group indicates the failure domain of an NSD (often linked to the
location). Every NSD is assigned to a failure group when it is added to a 1 2 4 5 7 1 3 4
file system.
Nsd01
For every file system, there is a metadata replication factor and default Failure Group 1
data replication factor –these may be 1, 2, or 3 (but no higher than the
maximums set for the file system).

Every file has a storage pool and a replication factor, r, associated with it:

1 3 4 6 7 2 3 5
• Generally r will be the default data replication factor, but it may be adjusted for an
individual file (up to the maximum data replication factor for the file system)
• Every block of every file has r instances (“replicas”), each in the same pool, but
different failure groups. Nsd02
• This is not disk mirroring, but the effect is similar.
Failure Group 2
Every block of a file is in the same storage pool.

Failure groups are useful to separate fault tolerant regions (storage


server, rack, row, data centers, etc.)

Judicious use of replication enables updating file system components


(NSD servers, disk firmware, etc.) while the file system remains active.
2 3 5 6 1 2 4 5
Nsd03
Failure Group 3

Spectrum Scale Expert Talk: Stretched-cluster design 17


© Copyright IBM Corporation 2020.
Creating failure groups

When adding disks to a file system, be sure to include %nsd:


a failureGroup clause in the stanza. nsd=d1
device=/dev/dm-2
This example adds in a couple of disks to both failure servers=scale01,scale02
groups 1 and 2. failureGroup=1

%nsd:
nsd=d2
mmcrnsd -F fs1-new.stanza device=/dev/dm-4
mmadddisk fs1 -F fs1-new.stanza servers=scale02,scale01
failureGroup=2

%nsd:
The file system also must be configured to support nsd=d3
replication – the maximum data replication (-R) and device=/dev/dm-6
servers=scale01,scale02
metadata replication (-M) can not be changed later. failureGroup=1

%nsd:
mmcrfs fs1 -F fs1.stanza -m 2 -M 3 -r 2 -R 3 \ nsd=d4
device=/dev/dm-7
-Q yes -A yes -i 4k -S relatime --filesetdf -k all \
servers=scale02,scale01
-T /scale/fs1 failureGroup=2

Spectrum Scale Expert Talk: Stretched-cluster design 18


© Copyright IBM Corporation 2020.
Stretched clusters enable… s t r e t c h e d file systems

Because the cluster is stretched, all file systems are


“Tiebreaker quorum
visible from nodes in both sites. node”
Q

A stretched file system uses replication and failure


groups to ensure all metadata and data has a replica
LAN WAN LAN
at both sites.
• If a site should fail, the other site will be able to continue to read
and write to the local failure group.
• When the failed site returns, the recovered failure group will be
Q Q
updated to reflect changes it missed.

SAN SAN
A stretched cluster may have both stretched and
“unstretched” file systems. Failure group 1
• A “unstretched” file system – one whose storage is only at one
site – will become unavailable if that site goes down. Failure group 2

Spectrum Scale Expert Talk: Stretched-cluster design 19


© Copyright IBM Corporation 2020.
Synchronous writes

Writes are synchronous – all replicas are written in parallel.


(OS buffering helps mitigate the WAN performance penalty.)

Updating files also require a log record (metadata) written 8

9
16

17 EXP3524

before the data is written – this is also synchronous.


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

8 16

9 17 EXP3524

NSD
Server

Ethernet
Ethernet
When a site fails, its disks are placed in the stopped state. failureGroup 1 8

9
16

17 EXP3524

However, as files are written, storage is still allocated on the


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

stopped disks.
8 16

9 17 EXP3524

NSD
Server
failureGroup 2
Other conditions may affect replication – the -K replication
strictness flag affects these cases. Leave this at the default of WAN
whenpossible (see appendix for more details).

Missed updates are checked and corrected when failed disks


are brought back online.

Spectrum Scale Expert Talk: Stretched-cluster design 20


© Copyright IBM Corporation 2020.
Which replica is read?

You can avoid the WAN latency penalty when reading data by Subnet (Example: Subnet (Example:
setting readReplicaPolicy. 192.168.10.0/24) 192.168.20.0/24)

The readReplicaPolicy controls how Scale chooses a


replica when a node reads a file block: 0 1 2 3 4 5 6
8

7 8 9 10 11 12 13 14 15
16

17

System x3650 M4
EXP3524

DEFAULT – Any replica may be used to satisfy the request. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

local – If a replica can be obtained through a direct connection (SAN), use


8 16

9 17 EXP3524

NSD
that one first. If a replica is on an NSD server on the same layer 3 network Server

Ethernet
Ethernet
(subnet) as the requesting node, use that replica. Finally, any replica may be failureGroup 1 8 16

used.
9 17 EXP3524

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

fastest – Based on disk statistics, use the “fastest” disk.


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

8 16

9 17 EXP3524

NSD
Server
Consider aligning failure group strategy to networks, to facilitate failureGroup 2
using the local setting.
WAN
The fastest setting is best used when intersite latency is large
relative to disk seek times.
• fastestPolicyMinDiffPercent
• fastestPolicyNumReadSamples
• fastestPolicyMaxValidPeriod
• fastestPolicyCmpThreshold

Spectrum Scale Expert Talk: Stretched-cluster design 21


© Copyright IBM Corporation 2020.
Limitations of failure groups

Metadata blocks are written with checksums, so if a


replica has a URE, another replica is used. Cosmic ray
zaps disks

Data blocks, unless written to Spectrum Scale RAID 8

9
16

17 EXP3524

disks (e.g, an ESS), have no checksums, so UREs can


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

Read request 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

returns error
corrupt data.
8 16

9 17 EXP3524

(T10-PI or PRRC) NSD


Server
• Defend against this by using RAID6 and enabling T10-PI or

Ethernet
Ethernet
failureGroup 1
parity read checks; or use an ESS.
8 16

9 17 EXP3524

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

• Note that nsdCksumTraditional=yes will enable data 8

9
16

17 EXP3524

block checksum on network transfers, not on the disk NSD


Server
storage itself. failureGroup 2

Try again from

Do not replicate to thin-provisioned storage WAN another failure


group
without using appropriate thinDiskType NSD
specifications. If an underlying volume
unexpectedly fills, log recovery may also fail, leaving
the file system offline.

Spectrum Scale Expert Talk: Stretched-cluster design 22


© Copyright IBM Corporation 2020.
Checking the replication status of file systems and files
Replicating
metadata
Use mmlsfs to check on the replication capabilities # mmlsfs fs1 -m -M -r -R
flag value description
of the file system. ------------------- ------------------------
-m 2
-----------------------------------
Default number of metadata replicas
-M 2 Maximum number of metadata replicas
Use mmlsattr to check the replication status of a -r 2 Default number of data replicas
-R 2 Maximum number of data replicas
file.
Replicating
Use mmchfs to change the default replication of a file # mmlsattr /scale/fs1/big*
replication factors
data
system (as permitted by the “maximum” settings), metadata(max) data(max) file [flags]
------------- --------- ---------------
followed by mmrestripefs. 2 ( 2)
2 ( 2)
2 ( 2) /scale/fs1/big
1 ( 2) /scale/fs1/bigger [unbalanced]

The policy engine is able to change replication status


of a file, but it can not determine which failure groups
may be used. (The mmchattr command can also
change the replication factor of a file.)

Spectrum Scale Expert Talk: Stretched-cluster design 23


© Copyright IBM Corporation 2020.
Recovering from failed disks

To start disks after failure and correct any missed ∙∙∙ ∙∙∙ inode table
updates, use: ∙∙∙ ∙∙∙
mmchdisk FSNAME start -a

inode
Recovery uses the “PIT” mechanism to split the work !
over multiple nodes. !
• By default, this work is done using all nodes in the cluster.
• The defaultHelperNodes configuration setting limit this
work to a subset of nodes. Indirect inode
!
!
Inodes are scanned to determine which have the
“missed update” flag set.
• These represent files with blocks needing updates.
Data blocks
• The data block pointers are scanned to determine which
have missed updates.
• The updated block is copied into the missed block, and the
“missed update” flags are cleared.

Spectrum Scale Expert Talk: Stretched-cluster design 24


© Copyright IBM Corporation 2020.
File system descriptor quorum
…an annoying detail

… or, why two failure groups are never enough!

Spectrum Scale Expert Talk: Stretched-cluster design 25


© Copyright IBM Corporation 2020.
A third failure group for File System Descriptor quorum

The file system descriptor is a data structure Failure group 3


“Tiebreaker quorum
describing attributes and disks in the file system. node”
Q
Only a very few disks of a file system have current file
system descriptors (see appendix), but a majority
must be available to mount the file system.
LAN WAN LAN
When there are at least 3 disks, then 3 active
replicas are maintained, each in a different failure
group if there are at least 3 failure groups.
Each of our main failure groups will receive a copy of Q Q
the active file system descriptor.
SAN SAN
• Put a third failure group on the tiebreaker quorum
node, so if we have node quorum, we also have file
Failure group 1
system descriptor quorum.
Failure group 2

Spectrum Scale Expert Talk: Stretched-cluster design 26


© Copyright IBM Corporation 2020.
Setting up descriptor-only disks

A disk may be offered as a candidate for holding only


a file system descriptor by indicating its usage as
descOnly in the stanza file used to add it to the file
system.
descOnly only offers the disk as a candidate for an
active file system descriptor. %nsd:
• It alone is not sufficient to guarantee an active file descriptor nsd=fs1desc3
is written there! device=/dev/sdb
• In other words, it repels data and metadata, but doesn’t servers=tiebreak
attract an active file system descriptor. usage=descOnly
• Other rules, like being the only candidate disk in a third or failureGroup=3
fifth failure group, can force the issue.
A descOnly disk can be small (128 MiB is enough).
These are completely different than “tiebreaker
quorum disks” – do not configure them as such!

Spectrum Scale Expert Talk: Stretched-cluster design 27


© Copyright IBM Corporation 2020.
Operations on file system descriptors

To determine which disks have active file system # mmlsdisk fs1 -L


disk driver sector failure holds holds storage
name type size group metadata data status availability disk id pool remarks
descriptors: ------------ -------- ------ ----------- --------
a01a nsd 512 1 Yes
-----
Yes
-------------
ready
------------ ------- ------------ ---------
up 1 system
b01a nsd 512 2 Yes Yes ready up 2 system desc
a02a nsd 512 1 Yes Yes ready up 3 system desc
b02a nsd 512 2 Yes Yes ready up 4 system

mmlsdisk FSNAME -L
a03a nsd 512 1 Yes Yes ready up 5 system
b03a nsd 512 2 Yes Yes ready up 6 system
a01b nsd 512 1 Yes Yes ready up 7 system
b01b nsd 512 2 Yes Yes ready up 8 system
a02b nsd 512 1 Yes Yes ready up 9 system
b02b nsd 512 2 Yes Yes ready up 10 system

To move the file system descriptor off a disk, first


a03b nsd 512 1 Yes Yes ready up 11 system
b03b nsd 512 2 Yes Yes ready up 12 system
a01c nsd 512 1 Yes Yes ready up 13 system

suspend it (at which point another candidate is b01c


a02c
nsd
nsd
512
512
2 Yes
1 Yes
Yes
Yes
ready
ready
up
up
14 system
15 system
b02c nsd 512 2 Yes Yes ready up 16 system
chosen), then resume it. a03c
b03c
nsd
nsd
512
512
1 Yes
2 Yes
Yes
Yes
ready
ready
up
up
17 system
18 system
fs1desc3 nsd 512 3 No No ready up 19 system desc
Number of quorum disks: 3
Read quorum value: 2
Write quorum value: 2

Spectrum Scale Expert Talk: Stretched-cluster design 28


© Copyright IBM Corporation 2020.
a Spectrum Scale stretched cluster

… bringing this all together

Spectrum Scale Expert Talk: Stretched-cluster design 29


© Copyright IBM Corporation 2020.
Recipe: What goes into a stretched cluster? [Critical]
Active Active
Application nodes Application nodes
(Spectrum Scale clients) (Spectrum Scale clients)
For a stretched cluster, we need:
1. All disks in each of two sites are assigned to that
site’s failure group – and total capacity of each Spectrum Scale file system
failure group should be the same. The file system
must use 2-way replication and whenpossible
replication strictness.
2. Each site has 1 or possibly 2 quorum nodes (must
be the same at both sites).
3. Reliable high-bandwidth, low-latency (ideally less WAN
than about 10ms) WAN link between both sites, as
well as to the tiebreaker quorum node. 1 2
4. A tiebreaker quorum node (ideally outside either
data center) is part of the cluster, as either the third
or fifth quorum node. Q
Site A Q Site B
• For each stretched file system in the cluster, the tiebreaker
quorum node needs a small disk (or even a partition), about
128MiB, joined to the file system as a third failure group.
• Keep the cluster manager function off the tiebreaker quorum
node. 3
Q
Tiebreaker quorum node
with a descOnly disk for
each stretched file system

Spectrum Scale Expert Talk: Stretched-cluster design 30


© Copyright IBM Corporation 2020.
Recipe: What goes into a stretched cluster? [Best practice]
Active Active
Application nodes Application nodes
(Spectrum Scale clients) (Spectrum Scale clients)

Some additional best practices:


• Design the network so each site is different layer 3 subnet,
allowing the use of readReplicaPolicy=local. Spectrum Scale file system

• Do not assign the manager role to the tiebreaker quorum


node – it typically is not well-enough connected to the sites
to be a suitable token or file system manager node.
WAN

• Learn how to use node classes, and create node classes to


1 2
help manage site-specific node roles:
• Aquorum, Bquorum
• Aces, Bces
Q
Site A Q Site B
• Ansd, Bnsd

• Choose a set of nodes that will be enlisted for PIT workers


and define defaultHelperNodes. 3
Q
Tiebreaker quorum node
with a descOnly disk for
each stretched file system

Spectrum Scale Expert Talk: Stretched-cluster design 31


© Copyright IBM Corporation 2020.
Stretched cluster with Elastic Storage System (ESS)

Typically we want each site with an ESS to have its own


management system (EMS).
A word of caution: We are using the word “cluster” in
two different ways. Q

• Set up separate xCAT clusters at each site. CES

InfiniBand fabric 2
InfiniBand fabric 1
• Configure everything as a single Spectrum Scale cluster (before Q
establishing recovery groups). CES

Ethernet
Ethernet
Federate performance monitoring, so the GUI shows
performance of entire Spectrum Scale cluster. 8

9
ESS
16

17 EXP3524
8

9
ESS
16

17 EXP3524

Provisioning

Provisioning
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

Ideally keep quorum function off ionodes (perhaps


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

Service

Service
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4

placing it on CES nodes).


8 16

8 16 9 17 EXP3524

9 17 EXP3524

Use Ethernet for the daemon network, to span the WAN. EMS EMS

(It’s possible to bridge IPoIB traffic but don’t expect WAN


performance.)
Each site can have local InfiniBand fabrics (make sure
verbsPorts specifications puts each into a separate Q
Tiebreaker quorum node
fabric). with a descOnly disk for
each file system

Spectrum Scale Expert Talk: Stretched-cluster design 32


© Copyright IBM Corporation 2020.
After a failure…

Site B nodes are down, but we still have enough


# mmgetstate -aL

Node number Node name Quorum Nodes up Total nodes GPFS state Remarks

quorum nodes to be quorate. -------------------------------------------------------------------------------------


1 a-scale01 2 2 7 active quorum node
2 b-scale01 0 0 7 unknown quorum node
3 tiebreak 2 2 7 active quorum node
4 a-scale02 2 2 7 active
5 a-scale03 2 2 7 active

mmlsdisk shows that site B (failure group 2) disks 6


7
b-scale02
b-scale03
0
0
0
0
7
7
unknown
unknown

are marked down. # mmlsdisk fs1


disk driver sector failure holds holds storage
name type size group metadata data status availability pool
------------ -------- ------ ----------- -------- ----- ------------- ------------ ------------

From site A nodes, the file system remains fully


a01a nsd 512 1 Yes Yes ready up system
b01a nsd 512 2 Yes Yes ready down system
a02a nsd 512 1 Yes Yes ready up system

functional – applications can both read and write b02a


a03a
nsd
nsd
512
512
2 Yes
1 Yes
Yes
Yes
ready
ready
down
up
system
system

data.
b03a nsd 512 2 Yes Yes ready down system
a01b nsd 512 1 Yes Yes ready up system
b01b nsd 512 2 Yes Yes ready down system
a02b nsd 512 1 Yes Yes ready up system
b02b nsd 512 2 Yes Yes ready down system
a03b nsd 512 1 Yes Yes ready up system
b03b nsd 512 2 Yes Yes ready down system
a01c nsd 512 1 Yes Yes ready up system
b01c nsd 512 2 Yes Yes ready down system
a02c nsd 512 1 Yes Yes ready up system
b02c nsd 512 2 Yes Yes ready down system
a03c nsd 512 1 Yes Yes ready up system
b03c nsd 512 2 Yes Yes ready down system
fs1desc3 nsd 512 3 No No ready up system

# ls /scale/fs1
big big2 bigger testset

Spectrum Scale Expert Talk: Stretched-cluster design 33


© Copyright IBM Corporation 2020.
Failback

After site B nodes are back up, we find nodes are


# mmhealth cluster show node

Component Node Status Reasons

healthy other than disks are down. ------------------------------------------------------------------------------------------


NODE a-scale01 HEALTHY -
NODE tiebreak HEALTHY -
NODE b-scale01 HEALTHY disk_down
NODE a-scale03 HEALTHY -
NODE a-scale02 HEALTHY -

Disks will remain down until explicitly started. NODE


NODE
b-scale03
b-scale02
HEALTHY
HEALTHY
disk_down
disk_down

# mmlsdisk fs1
disk driver sector failure holds holds storage
name type size group metadata data status availability pool

Meanwhile, even on the site B nodes, the replicated ------------ -------- ------ ----------- --------
a01a nsd 512 1 Yes
-----
Yes
-------------
ready
------------
up
------------
system

file system is fully accessible.


b01a nsd 512 2 Yes Yes ready down system
a02a nsd 512 1 Yes Yes ready up system
b02a nsd 512 2 Yes Yes ready down system
a03a nsd 512 1 Yes Yes ready up system
b03a nsd 512 2 Yes Yes ready down system
a01b nsd 512 1 Yes Yes ready up system
b01b nsd 512 2 Yes Yes ready down system
a02b nsd 512 1 Yes Yes ready up system
b02b nsd 512 2 Yes Yes ready down system
a03b nsd 512 1 Yes Yes ready up system
b03b nsd 512 2 Yes Yes ready down system
a01c nsd 512 1 Yes Yes ready up system
b01c nsd 512 2 Yes Yes ready down system
a02c nsd 512 1 Yes Yes ready up system
b02c nsd 512 2 Yes Yes ready down system
a03c nsd 512 1 Yes Yes ready up system
b03c nsd 512 2 Yes Yes ready down system
fs1desc3 nsd 512 3 No No ready up system

# ls /scale/fs1
big big2 bigger testset

Spectrum Scale Expert Talk: Stretched-cluster design 34


© Copyright IBM Corporation 2020.
Recovery

Start all disks at once:


# mmchdisk fs1 start -a
mmnsddiscover: Attempting to rediscover the disks. This may take a while ...
mmnsddiscover: Finished.
tiebreak: Rediscovered nsd server access to fs1desc3.
a-scale01: Rediscovered nsd server access to a01b.
a-scale03: Rediscovered nsd server access to a03a.
a-scale03: Rediscovered nsd server access to a03b.
mmchdisk fs1 start -a a-scale02: Rediscovered nsd server access to a02a.
b-scale03: Rediscovered nsd server access to b03a.
a-scale02: Rediscovered nsd server access to a02b.
b-scale02: Rediscovered nsd server access to b02a.
b-scale02: Rediscovered nsd server access to b02b.

Note that there is no need to restripe the file system b-scale01: Rediscovered nsd server access to b01a.
b-scale01: Rediscovered nsd server access to b01b.

after disks are started – the mmchdisk command will


a-scale01: Rediscovered nsd server access to a01c.
a-scale01: Rediscovered nsd server access to a01a.
a-scale03: Rediscovered nsd server access to a03c.
update the disks as it brings them online. a-scale02: Rediscovered nsd server access to a02c.
b-scale03: Rediscovered nsd server access to b03b.
b-scale03: Rediscovered nsd server access to b03c.
b-scale02: Rediscovered nsd server access to b02c.
b-scale01: Rediscovered nsd server access to b01c.
Scanning file system metadata, phase 1 ...
100 % complete on Fri May 8 12:52:52 2020
Scan completed successfully.
Scanning file system metadata, phase 2 ...
100 % complete on Fri May 8 12:52:52 2020
Scan completed successfully.
Scanning file system metadata, phase 3 ...
Scan completed successfully.
Scanning file system metadata, phase 4 ...
100 % complete on Fri May 8 12:52:52 2020
Scan completed successfully.
Scanning file system metadata, phase 5 ...
100 % complete on Fri May 8 12:52:53 2020
Scan completed successfully.
Scanning user file metadata ...
100.00 % complete on Fri May 8 12:52:53 2020 ( 93184 inodes with total 3717 MB data
processed)
Scan completed successfully.

Spectrum Scale Expert Talk: Stretched-cluster design 35


© Copyright IBM Corporation 2020.
After recovery

All disks are now marked as available.


# mmlsdisk fs1
disk driver sector failure holds holds storage
name type size group metadata data status availability pool
------------ -------- ------ ----------- -------- ----- ------------- ------------ --------
----
a01a nsd 512 1 Yes Yes ready up system
b01a nsd 512 2 Yes Yes ready up system
a02a nsd 512 1 Yes Yes ready up system
b02a nsd 512 2 Yes Yes ready up system
a03a nsd 512 1 Yes Yes ready up system
b03a nsd 512 2 Yes Yes ready up system
a01b nsd 512 1 Yes Yes ready up system
b01b nsd 512 2 Yes Yes ready up system
a02b nsd 512 1 Yes Yes ready up system
b02b nsd 512 2 Yes Yes ready up system
a03b nsd 512 1 Yes Yes ready up system
b03b nsd 512 2 Yes Yes ready up system
a01c nsd 512 1 Yes Yes ready up system
b01c nsd 512 2 Yes Yes ready up system
a02c nsd 512 1 Yes Yes ready up system
b02c nsd 512 2 Yes Yes ready up system
a03c nsd 512 1 Yes Yes ready up system
b03c nsd 512 2 Yes Yes ready up system
fs1desc3 nsd 512 3 No No ready up system

Spectrum Scale Expert Talk: Stretched-cluster design 36


© Copyright IBM Corporation 2020.
Some limitations and mitigations
Active Active
Application nodes Application nodes
(Spectrum Scale clients) (Spectrum Scale clients)

A stretched file system provides active/active high


availability, but it is still using a single Spectrum
Scale cluster with file systems stretched across sites. Spectrum Scale file system

• Thus the entire cluster, or any stretched file system,


is subject to misconfiguration or a software error
destroying everything.
• Use callbacks to save configuration state after each WAN
change – see, for example:
/usr/lpp/mmfs/samples/mmsdrbackup.sample 1 2

• Use data protection for the entire cluster.


Q
Site A Q Site B

3
Q
Tiebreaker quorum node
with a descOnly disk for
each file system

Spectrum Scale Expert Talk: Stretched-cluster design 37


© Copyright IBM Corporation 2020.
Options for the tiebreaker quorum node
Active Active
Application nodes Application nodes
(Spectrum Scale clients) (Spectrum Scale clients)
Ideally use an actual third site.
• Perhaps it can go into an off-premises cloud VM, such as in a
public cloud?
• Be sure if either storage site fails, the other will still have access Spectrum Scale file system
to the tiebreaker node.

Sometimes it is unavoidable to have the tiebreaker node


in one of your main sites.
• Put it into the site you “favor” – the primary site – this is the site WAN
that should stay up if the WAN link fails.
• If possible, isolate the tiebreaker node from common failures
that may affect the rest of the primary site – put it on different 1 2
power circuits, network switches, etc.
• The tiebreaker node can be a virtual machine.
• Keep in mind that if the favored site fails, the other site will fail – Q
Site A Q Site B
but you can get Spectrum Scale back up quickly.
• Consider preparing a standby tiebreaker node at the secondary
site.
3
Remember you need a little storage for file system Q
descriptor NSDs. Tiebreaker quorum node
with a descOnly disk for
each file system

Spectrum Scale Expert Talk: Stretched-cluster design 38


© Copyright IBM Corporation 2020.
Standby tiebreaker node?
Active Active
Application nodes Application nodes
(Spectrum Scale clients) (Spectrum Scale clients)

When it is unavoidable for the tiebreaker quorum


node to reside in the primary site…
• In the event of an unplanned failure of the primary site goes Spectrum Scale file system
down, the secondary site will also be down. This will be a
quorum emergency.
• A standby quorum node in the secondary site can facilitate
the recovery process.
• For a planned outage of the primary site, the quorum and file
system descriptor functions can be migrated, before the WAN
outage, from the normal tiebreaker node to the standby
tiebreaker node.

3
The standby quorum node should be actively part of Site A Q Site B
the Spectrum Scale cluster. Standby file system Tiebreaker quorum node Standby tiebreaker node
descriptor disks should already be NSDs. with a descOnly NSD for with an unused NSD for
each file system
each file system
• Do not make this a quorum node until needed.
• Do not add the NSDs to the file systems until needed.

Spectrum Scale Expert Talk: Stretched-cluster design 39


© Copyright IBM Corporation 2020.
Final considerations

… and the last word

Spectrum Scale Expert Talk: Stretched-cluster design 40


© Copyright IBM Corporation 2020.
The intersite network
Active Active
Application nodes Application nodes
(Spectrum Scale clients) (Spectrum Scale clients)

The WAN link needs sufficient bandwidth for the


expected workload.
Spectrum Scale file system
• Pay attention to congestion from other users of this
link.
• Pay attention to bonding (and its misuse).

WAN
Latency needs to be minimized, but what is tolerable
depends on workload.
Latency is a function of distance, as well as other
factors. Up to 300km is regularly tested. Site A Site B
Generally I recommend the daemon network be
Ethernet (you can still use an InfiniBand fabric for
local traffic)
Q
Tiebreaker quorum node
with a descOnly disk for
each file system

Spectrum Scale Expert Talk: Stretched-cluster design 41


© Copyright IBM Corporation 2020.
Where does the cluster manager run?

Make sure the cluster manager runs on a quorum


node at one of the sites, not at a third-site tiebreaker
quorum node. Quorum
C
• If the WAN link between sites fails, the site with the
cluster manager will remain up. Quorum Quorum
• In particular, if one site is “primary”, run it there. A B
• Callbacks could be used to automate keeping the cluster
manager off the tiebreaker node.
• We want to avoid a situation where the intersite link fails,
but the tiebreaker is still visible to both sites. WAN
• If it were the cluster manager, it would try to keep both sites
active, but nodes will be requesting expels when they find they
can’t communicate cross-site.

Site A Site B
This probably limits us to 2-way stretched clusters,
even though 3-way replication is supported.

Spectrum Scale Expert Talk: Stretched-cluster design 42


© Copyright IBM Corporation 2020.
Token management
File system

File system
Storage pool
layer
While token management is a file system layer FS / Token
“Disk” “Disk” “Disk” “Disk”

concept, the token managers are common cluster


managers
Cluster NSD NSD NSD NSD
manager

resources. Cluster
layer
Volume
(LUN),
Volume
(LUN),
Volume
(LUN),
Volume
(LUN),

• Nodes will use token managers on both sides of the WAN


vdisk vdisk vdisk vdisk

link – workloads with bursts of token activity will see


latency. “Nodes” Storage
• Increasing maxStatCache can reduce token traffic after Active Active

cache is warm.
Application nodes Application nodes
(Spectrum Scale clients) (Spectrum Scale clients)

If there are no token managers, the file system Spectrum Scale file system

manager will be on a quorum node and be the only


token manager for that file system.
• If one site is actually the primary site for all file systems, put
WAN
all the token managers there.
• If a single node is sufficient to manage token traffic, it may 1 2
be possible to have no manager nodes, and perhaps use
callbacks to ensure an appropriate quorum node is chosen Site A Site B
Q Q
for each file system’s file system manager.

3
Q
Spectrum Scale Expert Talk: Stretched-cluster design 43
© Copyright IBM Corporation 2020.
Interactions with some other Spectrum Scale features

Protocol nodes AFM home


• If the two sites are in different address spaces, then • Typically requires protocol nodes, and for high
CES floating addresses can not float between sites. availability, both sites will need these protocol
• High latency between sites will impact performance nodes. Make sure AFM mappings at the cache site
of SMB cluster file locking. can use either site’s protocol nodes. If the home
CES addresses don’t cross sites, the cache may
disconnect if a site goes down. (Here be dragons.
Transparent Cloud Tiering Leverage callbacks at cache site.)
• Be sure to place TCT gateway nodes at both sites.
AFM cache
• All nodes communicate with AFM gateway nodes,
so write latency is magnified.

Spectrum Scale Expert Talk: Stretched-cluster design 44


© Copyright IBM Corporation 2020.
Conclusions: Spectrum Scale stretched cluster architecture
Active Active
Application nodes Application nodes
(Spectrum Scale clients) (Spectrum Scale clients)

Experience shows the Spectrum Scale stretched


cluster architecture is a solid architecture for
business resiliency. Spectrum Scale file system

WAN

1 2

Q
Site A Q Site B

3
Q

Spectrum Scale Expert Talk: Stretched-cluster design 45


© Copyright IBM Corporation 2020.
Thank you!
Please help us to improve Spectrum Scale with your feedback
• If you get a survey in email or a popup from the GUI,
please respond
Spectrum Scale
• We read every single reply
User Group

The Spectrum Scale (GPFS) User Group is


free to join and open to all using, interested in
using or integrating IBM Spectrum Scale.

The format of the group is as a web community


with events held during the year, hosted by our
members or by IBM.

See our web page for upcoming events and


presentations of past events. Join our
conversation via mail and Slack.

www.spectrumscaleug.org
Appendix
Tuning quorum

… and what makes a good quorum node?

Spectrum Scale Expert Talk: Stretched-cluster design 50


© Copyright IBM Corporation 2020.
Tuning disk leasing

failureDetectionTime – How many seconds it minMissedPingTimeout,


takes to detect a node is down (default is 35 maxMissedPingTimeout – Sets the range in which
seconds, same duration as a disk lease). calculated “missed ping timeout” (MPT) may fall
(default between 3 and 60).
• Default MPT is leaseRecoveryWait-5.
leaseRecoveryWait – When a node fails, wait until
• After a lease expires, the cluster manager will ping a node,
known leases expire, then wait this many seconds,
waiting MPT seconds for a response. After that, the node is
before starting recovery. Default is 35 seconds. expelled.
Intent is to give “in-flight” I/O time to get through
controllers on to disks.
totalPingTimeout – Nodes responding to ICMP
pings but not sending heartbeats will be declared
usePersistentReserve – enables SCSI persistent dead after this timeout (default 120 seconds).
reserve for disk fencing. Please check documentation
for guidance.

Spectrum Scale Expert Talk: Stretched-cluster design 51


© Copyright IBM Corporation 2020.
Considerations in choosing quorum nodes

Choose the smallest odd number of nodes that meet Choose reliable nodes!
necessary redundancy constraints. Network usage is light, but critical.
• For a stretched cluster, this would typically be 3 or Quorum function needs local file system /var/mmfs
maybe 5 quorum nodes. to be responsive (not blocked by other system I/O to
• Each site would have the same number (1 or 2) quorum same disk or controller).
nodes.
• An ESS ionode is a poor choice for a quorum
• A tiebreaker quorum node is also needed, typically a third
site, or somehow independent of either site failing. node, since its internal NVMe is on the same disk
controller as the local OS disk.
• Only use tiebreaker disk quorum for small clusters,
where all quorum nodes directly access the Try to isolate quorum nodes from common failure
tiebreaker disks. modes:
• Put them in separate racks, on separate circuits, etc.

Spectrum Scale Expert Talk: Stretched-cluster design 52


© Copyright IBM Corporation 2020.
Components of a resilient cluster

A resilient cluster uses multiple techniques to survive


local failures:
• Multiple links between disks and NSD servers (or application
nodes), using multiple controllers and multiple HBAs –
multipath is handled by the OS.
• Multiple NSD servers (or application nodes) connected to each A B
disk.
• If this is not an ESS:
• Use RAID6 volumes.
• Enable T10-PI or parity read checks – Spectrum Scale
replication will not protect you from UREs (except with NSD servers
metadata).
LAN
• Plan a robust quorum configuration, using tiebreaker disks if
needed.
• Consider the use of multiple failure groups to protect against
larger failures. NSD client /
Application nodes
• Particularly ensure that a single failure won’t break file
system descriptor quorum.

Spectrum Scale Expert Talk: Stretched-cluster design 53


© Copyright IBM Corporation 2020.
Appendix
File system replication strictness

Spectrum Scale Expert Talk: Stretched-cluster design 54


© Copyright IBM Corporation 2020.
Synchronous writes and strictness

When all failure groups are on-line, the replication no whenpossible always
proceeds as expected.
All disks in Allocates in all Allocates in all Allocates in all
a failure failure groups failure group failure groups
When a site experiences a failure, its disks are placed in group are
the the stopped state. Writes will still allocate space in stopped
the stopped failure group, and the disks will receive All disks in Only allocates in Only allocates in Fails with out of
updates when started. a failure unsuspended unsuspended space.2
group are failure group.3 failure group.2,3
suspended
Other conditions may prevent normal allocation of
One failure Allocates in Fails with out of Fails with out of
replicas in a failure group. The -K replication group runs failure groups space. space.
enforcement parameter controls what happens in these out of with space.1,3
cases: space
no – If at least one replica can be allocated, the write completes successfully.
whenpossible – If enough failure groups are online, all replicas must be In all cases, replication factor of files is set.
allocated to report success. If not enough are available, do not enforce
replication. This is the default and is generally the correct setting for a stretched
cluster.
always – All required replicas must be allocated; else the write fails with
1 Failure is silent
ENOSPC. 2 Warning when you place disks in this state
3 Recovery requires mmrestripefs -r

Spectrum Scale Expert Talk: Stretched-cluster design 55


© Copyright IBM Corporation 2020.
Appendix
File system descriptor quorum

… or, why two failure groups are never enough!

Spectrum Scale Expert Talk: Stretched-cluster design 56


© Copyright IBM Corporation 2020.
File system descriptor quorum

Failure group 1
The file system descriptor is a data structure describing
attributes and disks in the file system.
• Every disk in a file system has a replica of the file FSDesc

system descriptor, starting at sector 8.


• However, when there are more than 3 disks in a file
system, only several replicas are guaranteed to be FSDesc
active and up to date.
• If there are at least 5 failure groups, then 5 active replicas are
maintained, each in a different failure group.
• Otherwise, if there are at least 3 disks, then 3 active replicas are Failure group 2
maintained, each in a different failure group if there are at
least 3 failure groups.
• Otherwise maintain an active replica on each disk (there are only Even with replication,
1 or 2 disks). this file system will not
• Note that storage pools are ignored when choosing disks. mount if only failure
group 2 is active!
For each file system, a majority of its active file system FSDesc
descriptor replicas must be available for that file system
to be mounted – this is file system descriptor quorum.

Spectrum Scale Expert Talk: Stretched-cluster design 57


© Copyright IBM Corporation 2020.
Setting up descriptor-only disks

A disk may be offered as a candidate for holding only


a file system descriptor by indicating its usage as
descOnly in the stanza file used to add it to the file
system.
descOnly only offers the disk as a candidate for an
active file system descriptor. %nsd:
• It alone is not sufficient to guarantee an active file descriptor nsd=fs1desc3
is written there! device=/dev/sdb
• In other words, it repels data and metadata, but doesn’t servers=tiebreak
attract an active file system descriptor. usage=descOnly
• Other rules, like being the only candidate disk in a third or failureGroup=3
fifth failure group, can force the issue.
A descOnly disk can be small (128 MiB is enough).
These are completely different than “tiebreaker
quorum disks”!

Spectrum Scale Expert Talk: Stretched-cluster design 58


© Copyright IBM Corporation 2020.
Operations on file system descriptors

To determine which disks have active file system # mmlsdisk fs1 -L


disk driver sector failure holds holds storage
name type size group metadata data status availability disk id pool remarks
descriptors: ------------ -------- ------ ----------- --------
a01a nsd 512 1 Yes
-----
Yes
-------------
ready
------------ ------- ------------ ---------
up 1 system
b01a nsd 512 2 Yes Yes ready up 2 system desc
a02a nsd 512 1 Yes Yes ready up 3 system desc
b02a nsd 512 2 Yes Yes ready up 4 system

mmlsdisk FSNAME -L
a03a nsd 512 1 Yes Yes ready up 5 system
b03a nsd 512 2 Yes Yes ready up 6 system
a01b nsd 512 1 Yes Yes ready up 7 system
b01b nsd 512 2 Yes Yes ready up 8 system
a02b nsd 512 1 Yes Yes ready up 9 system
b02b nsd 512 2 Yes Yes ready up 10 system

To move the file system descriptor off a disk, first


a03b nsd 512 1 Yes Yes ready up 11 system
b03b nsd 512 2 Yes Yes ready up 12 system
a01c nsd 512 1 Yes Yes ready up 13 system

suspend it (at which point another candidate is b01c


a02c
nsd
nsd
512
512
2 Yes
1 Yes
Yes
Yes
ready
ready
up
up
14 system
15 system
b02c nsd 512 2 Yes Yes ready up 16 system
chosen), then resume it. a03c
b03c
nsd
nsd
512
512
1 Yes
2 Yes
Yes
Yes
ready
ready
up
up
17 system
18 system
fs1desc3 nsd 512 3 No No ready up 19 system desc
Number of quorum disks: 3
Read quorum value: 2
Write quorum value: 2

Spectrum Scale Expert Talk: Stretched-cluster design 59


© Copyright IBM Corporation 2020.
File system descriptors and pools

Failure group 1
Even though file system descriptors are metadata,
they are not bound to the system pool.
FSDesc
system

Disks receiving active file system descriptors are


chosen based solely on failure group, irrespective of FSDesc
the pool. data

• If two disks are in the same failure domain, assign them


to the same failure group, even if they are in different
storage pools. Failure group 2

• If you ignore this rule, you may find failure descriptors are
not where you expected them to be, generally when your
stretched cluster didn’t stay available during a site failure! system

FSDesc
There is no way to force the active file system
data
descriptors to be maintained on a specific class of
storage.

Spectrum Scale Expert Talk: Stretched-cluster design 60


© Copyright IBM Corporation 2020.
Appendix
Dealing with “quorum emergencies”

… because if anything can go wrong, it will!

Spectrum Scale Expert Talk: Stretched-cluster design 61


© Copyright IBM Corporation 2020.
What is a quorum emergency?
Active Active
Application nodes Application nodes
(Spectrum Scale clients) (Spectrum Scale clients)

Sometimes we lose both a site and the tiebreaker


quorum node, or too many file system descriptor
disks. Spectrum Scale file system

• It isn’t unusual to have the tiebreaker node in one


of the sites themselves and hope for the best!
• It isn’t uncommon for someone to forget the file
system descriptor disk for a file system (like the WAN
CES shared root).
1 2

To get back into production, we must:


Q
Site A Q Site B
• Restore cluster quorum
• Restore file system descriptor quorum for all file
systems we need online. 3
Q
Tiebreaker quorum node
with a descOnly disk for
each file system

Spectrum Scale Expert Talk: Stretched-cluster design 62


© Copyright IBM Corporation 2020.
Emergency recovery (CCR configuration)

Shutdown Spectrum Scale on SURVIVORS:


mmshutdown -N SURVIVORS

Remove quorum function from DOWNQUORUM nodes


(comma separated):
mmchnode --nonquorum --force --N DOWNQUORUM
(You will be prompted to confirm this operation!)

For each file system FSNAME, migrate active file


system descriptors from the failure groups that are
down (DOWNFG1, DOWNFG2):
mmfsctl FSNAME exclude -G DOWNFG1
mmfsctl FSNAME exclude -G DOWNFG2

Start up Spectrum Scale on surviving nodes.

Spectrum Scale Expert Talk: Stretched-cluster design 63


© Copyright IBM Corporation 2020.
Failback from a quorum emergency

This will require another outage of the cluster.


First restart Spectrum Scale on the nodes that
survived, without mounting the file systems (-A no).
Do not bring up Spectrum Scale on the nodes that
had failed.
Restore the failed tiebreaker site as a quorum node,
then bring up Spectrum Scale on it.
Restore the quorum function to the failed quorum
nodes, then bring back up all the failed nodes.
Create a file of all the NSDs attached to the nodes
that had failed. Use it to again enable them to be file
system descriptor candidates.

Spectrum Scale Expert Talk: Stretched-cluster design 64


© Copyright IBM Corporation 2020.

You might also like