Spectrum Scale Stretched Cluster Best Practices

Spectrum Scale Expert Talks
Episode 2:
Best Practices for

building a stretched-
cluster
Show notes: Join our conversation:

www.spectrumscaleug.org/experttalks www.spectrumscaleug.org/join
About the user group
• Independent, work with IBM to develop events
• Not a replacement for PMR!
• Email and Slack community
• www.spectrumscaleug.org/join
#SSUG
Agenda
• What is a Spectrum Scale stretched cluster?

• Components of a stretched cluster
• Quorum concerns with stretched clusters
• File system replication with failure groups
• File system descriptor quorum
• Bringing this all together
• Some final considerations
Spectrum Scale Expert Talk: Stretched-cluster design 3

© Copyright IBM Corporation 2020.
Introduction
… or, what is a stretched cluster and why would I want one?

Spectrum Scale business resiliency through replicated redundancy
Active/Passive Active/Active
Application layer • Individual applications may have their own Individual applications may have
means of doing asynchronous replication. their own means of doing
• Aspera, rsync – These need to run regularly synchronous replication.
(maybe use cron)
File system layer Spectrum Scale AFM-DR Spectrum Scale failure groups and
replication – “stretched cluster”
Block layer • Block synchronous replication, configured for Block synchronous replication,
active/passive configured for active/active
• Block point-in-time copy

What is a Spectrum Scale “stretched cluster”?
Active Active
Application nodes Application nodes
(Spectrum Scale clients) (Spectrum Scale clients)
This is a single Spectrum Scale cluster is configured

using nodes and storage from two data centers.
Spectrum Scale file system
• In other words, it is “stretched” between two sites,
connected by a WAN.
File systems in such a cluster are available to
systems at both sites and may be actively used
concurrently by both sites. WAN
File systems may judiciously use “failure group”

replication to ensure both sites have a current
instance of all the data.
Site A Site B
Careful design can ensure one site remains active,
even if the other site (or link) fails.
The result is an active/active highly available
synchronously replicated Spectrum Scale file system.

How are stretched file systems being used today?
Active Active
Financial services
• Major hedge fund for financial HPC storage (30-40
miles between sites)
Automotive
• Major manufacturer using a stretched cluster for
WAN
HPC storage (about 40 miles)
Life sciences
• Major hospitals using stretched clusters for critical Site A Site B
patient documents.
How far? About 50-100 miles, but it depends very

much on the WAN link and the workload…

Stretched clusters
… and why node quorum matters

Cluster and file system layers in Spectrum Scale
Layers of
Spectrum Scale File system
File system
The cluster layer is Storage pool
the physical layer
resources “Disk” “Disk” “Disk” “Disk”
FS / Token
(systems, volumes) managers
and logical Cluster NSD NSD NSD NSD
abstractions manager
(nodes, NSDs). Cluster Volume Volume Volume Volume
layer (LUN), (LUN), (LUN), (LUN),
The file system vdisk vdisk vdisk vdisk
layer creates the
namespaces and
manages their
associated storage
abstractions. “Nodes” Storage
Stretched clusters
enable stretched
file systems.

Spectrum Scale clusters
A Spectrum Scale cluster is a group of Spectrum

Node 1
Scale systems, or nodes, configured into a single
administrative grouping:
• All nodes have a common view of the data.
• The nodes are tightly coupled, trusting each others’
authentication of users. Node 2
• A cluster can have several Spectrum Scale file systems.
LAN
SAN
• A cluster may share a file system with an authenticated
remote cluster.
Node 3
Importantly, the cluster ensures all nodes in the

cluster have a consistent view of the Spectrum Scale
file systems, even when more than one node is
actively accessing a file system (or even the same
Node 4
file).

Cluster manager and the active cluster
Many nodes may be configured to be part of the cluster.

Node 1
The active cluster is the set of nodes currently communicating with
each other and sharing resources.
A cluster manager keeps track of which nodes are currently part of

the active cluster, controls access to managed cluster resources,
and maintains the configuration of the cluster.
Node 2
To avoid the cluster manager from being a single point of failure, this
is a role that may run on one of several cluster nodes.
LAN
SAN
Because failures happen, the cluster manager may need to expel
“dead” nodes and fence them from resources like shared disks: Node 3
• Perhaps a node has failed
• Perhaps a node’s network has failed
Generally a disk lease is used to grant a node access to the disk

volumes (NSDs) managed by the cluster.
Node 4
Disk lease renewal also functions as a heartbeat, so the cluster manager
knows which nodes are part of the active cluster.

Node quorum
Several nodes are designated as quorum nodes,

Node 1
which elect the cluster manager from amongst Quorum
themselves. • One of these becomes the
new cluster manager.
• Both nodes remain in the
With node quorum, a simple majority of the quorum active cluster.
nodes must be active and communicating to choose Node 2

Quorum
a cluster manager.
LAN
• Generally choose a small odd number of quorum nodes, like 3, 5
or possibly 7. More quorum nodes lengthens recovery time –
there is no benefit to choosing more than 7. Node 3
These nodes will be expelled

To keep the file system consistent and prevent data from the active cluster, until
loss, there must never be more than one cluster they ask to rejoin.
manager!
• Such a condition would be called “split-brain”, also known as Node 4
Quorum
disaster.

Checking and changing quorum status
To see which nodes in the cluster are quorum nodes: # mmgetstate -aLs
Node number Node name Quorum Nodes up Total nodes GPFS state Remarks
mmlscluster -------------------------------------------------------------------------------------
1 a-scale01 2 3 7 active quorum node
2 b-scale01 2 3 7 active quorum node
3 tiebreak 2 3 7 active quorum node
4 a-scale02 2 3 7 active
To determine the current quorum state, use: 5

6
a-scale03
b-scale02
2
2
3
3
7
7
active
active
7 b-scale03 2 3 7 active
mmgetstate -aLs
Summary information
---------------------
Number of nodes defined in the cluster: 7
Number of local nodes active in the cluster: 7
To designate a node as a quorum node: Number of remote nodes joined in this cluster:
Number of quorum nodes defined in the cluster:
0
3
Number of quorum nodes active in the cluster: 3
mmchnode --quorum -N NODENAME Quorum = 2, Quorum achieved
To designate a node as no longer being a quorum

node:
mmchnode --nonquorum -N NODENAME

Checking cluster manager status
To determine which node is currently the cluster # mmlsmgr

file system manager node
manager (omit -c to also show file system ---------------- ------------------
fs1 10.0.200.11 (a-scale01)
managers):
Cluster manager node: 10.0.200.11 (a-scale01)
# mmchmgr -c b-scale01
mmlsmgr -c Appointing node 10.0.200.21 (b-scale01) as cluster manager
Node 10.0.200.21 (b-scale01) has taken over as cluster manager
To move the cluster manager function to a particular # mmlsmgr -c

Cluster manager node: 10.0.200.21 (b-scale01)
quorum node:
mmchmgr -c NODENAME

Clusters can be… s t r e t c h e d
This is still a single Spectrum Scale cluster, but in two

“Tiebreaker quorum
parts or “sites”, separated or “stretched” with a WAN node”
link. Q
• If one site fails, the other site should still be available.
• If the WAN link between the sites fails, one site should still be
available – we accept that the other site will fail. LAN WAN LAN
Each site has the same number of quorum nodes –

but to choose a cluster manager, we need more than
Q Q
half the quorum nodes.
Site 1 Site 2
This requirement is met using an additional quorum

node that is not part of either site.
• This is typically called the “tiebreaker quorum node”.
• If the tiebreaker quorum node is down, the two sites can form
quorum without it.

Stretched file systems,
failure groups, and replication
… because redundancy and repetition can be good things

Replication and Failure Groups 1234567 fileA
12345 fileB
A storage pool is a class of storage device. Every NSD (disk volume) is Replication
assigned to a pool when it is added to a file system. factor is 2.
A failure group indicates the failure domain of an NSD (often linked to the
location). Every NSD is assigned to a failure group when it is added to a 1 2 4 5 7 1 3 4
file system.
Nsd01
For every file system, there is a metadata replication factor and default Failure Group 1
data replication factor –these may be 1, 2, or 3 (but no higher than the
maximums set for the file system).
Every file has a storage pool and a replication factor, r, associated with it:
1 3 4 6 7 2 3 5
• Generally r will be the default data replication factor, but it may be adjusted for an
individual file (up to the maximum data replication factor for the file system)
• Every block of every file has r instances (“replicas”), each in the same pool, but
different failure groups. Nsd02
• This is not disk mirroring, but the effect is similar.
Failure Group 2
Every block of a file is in the same storage pool.
Failure groups are useful to separate fault tolerant regions (storage

server, rack, row, data centers, etc.)
Judicious use of replication enables updating file system components

(NSD servers, disk firmware, etc.) while the file system remains active.
2 3 5 6 1 2 4 5
Nsd03
Failure Group 3

Creating failure groups
When adding disks to a file system, be sure to include %nsd:

a failureGroup clause in the stanza. nsd=d1
device=/dev/dm-2
This example adds in a couple of disks to both failure servers=scale01,scale02
groups 1 and 2. failureGroup=1
%nsd:
nsd=d2
mmcrnsd -F fs1-new.stanza device=/dev/dm-4
mmadddisk fs1 -F fs1-new.stanza servers=scale02,scale01
failureGroup=2
%nsd:
The file system also must be configured to support nsd=d3
replication – the maximum data replication (-R) and device=/dev/dm-6
servers=scale01,scale02
metadata replication (-M) can not be changed later. failureGroup=1
%nsd:
mmcrfs fs1 -F fs1.stanza -m 2 -M 3 -r 2 -R 3 \ nsd=d4
device=/dev/dm-7
-Q yes -A yes -i 4k -S relatime --filesetdf -k all \
servers=scale02,scale01
-T /scale/fs1 failureGroup=2

Stretched clusters enable… s t r e t c h e d file systems
Because the cluster is stretched, all file systems are

visible from nodes in both sites. node”
Q
A stretched file system uses replication and failure

groups to ensure all metadata and data has a replica
LAN WAN LAN
at both sites.
• If a site should fail, the other site will be able to continue to read
and write to the local failure group.
• When the failed site returns, the recovered failure group will be
Q Q
updated to reflect changes it missed.
SAN SAN
A stretched cluster may have both stretched and
“unstretched” file systems. Failure group 1
• A “unstretched” file system – one whose storage is only at one
site – will become unavailable if that site goes down. Failure group 2

Synchronous writes
Writes are synchronous – all replicas are written in parallel.

(OS buffering helps mitigate the WAN performance penalty.)
Updating files also require a log record (metadata) written 8
9
16
17 EXP3524
before the data is written – this is also synchronous.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
8 16
9 17 EXP3524
NSD
Server
Ethernet
Ethernet
When a site fails, its disks are placed in the stopped state. failureGroup 1 8
9
16
17 EXP3524
However, as files are written, storage is still allocated on the

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
stopped disks.
8 16
9 17 EXP3524
NSD
Server
failureGroup 2
Other conditions may affect replication – the -K replication
strictness flag affects these cases. Leave this at the default of WAN
whenpossible (see appendix for more details).
Missed updates are checked and corrected when failed disks

are brought back online.

Which replica is read?
You can avoid the WAN latency penalty when reading data by Subnet (Example: Subnet (Example:
setting readReplicaPolicy. 192.168.10.0/24) 192.168.20.0/24)
The readReplicaPolicy controls how Scale chooses a

replica when a node reads a file block: 0 1 2 3 4 5 6
8
7 8 9 10 11 12 13 14 15
16
17
System x3650 M4
EXP3524
DEFAULT – Any replica may be used to satisfy the request. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
local – If a replica can be obtained through a direct connection (SAN), use

8 16
9 17 EXP3524
NSD
that one first. If a replica is on an NSD server on the same layer 3 network Server
Ethernet
Ethernet
(subnet) as the requesting node, use that replica. Finally, any replica may be failureGroup 1 8 16
used.
9 17 EXP3524
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
fastest – Based on disk statistics, use the “fastest” disk.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
8 16
9 17 EXP3524
NSD
Server
Consider aligning failure group strategy to networks, to facilitate failureGroup 2
using the local setting.
WAN
The fastest setting is best used when intersite latency is large
relative to disk seek times.
• fastestPolicyMinDiffPercent
• fastestPolicyNumReadSamples
• fastestPolicyMaxValidPeriod
• fastestPolicyCmpThreshold

Limitations of failure groups
Metadata blocks are written with checksums, so if a

replica has a URE, another replica is used. Cosmic ray
zaps disks
Data blocks, unless written to Spectrum Scale RAID 8
9
16
17 EXP3524
disks (e.g, an ESS), have no checksums, so UREs can

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
Read request 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
returns error
corrupt data.
8 16
9 17 EXP3524
(T10-PI or PRRC) NSD

Server
• Defend against this by using RAID6 and enabling T10-PI or
Ethernet
Ethernet
failureGroup 1
parity read checks; or use an ESS.
8 16
9 17 EXP3524
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
• Note that nsdCksumTraditional=yes will enable data 8
9
16
17 EXP3524
block checksum on network transfers, not on the disk NSD

Server
storage itself. failureGroup 2
Try again from
Do not replicate to thin-provisioned storage WAN another failure

group
without using appropriate thinDiskType NSD
specifications. If an underlying volume
unexpectedly fills, log recovery may also fail, leaving
the file system offline.

Checking the replication status of file systems and files
Replicating
metadata
Use mmlsfs to check on the replication capabilities # mmlsfs fs1 -m -M -r -R
flag value description
of the file system. ------------------- ------------------------
-m 2
-----------------------------------
Default number of metadata replicas
-M 2 Maximum number of metadata replicas
Use mmlsattr to check the replication status of a -r 2 Default number of data replicas
-R 2 Maximum number of data replicas
file.
Replicating
Use mmchfs to change the default replication of a file # mmlsattr /scale/fs1/big*
replication factors
data
system (as permitted by the “maximum” settings), metadata(max) data(max) file [flags]
------------- --------- ---------------
followed by mmrestripefs. 2 ( 2)
2 ( 2)
2 ( 2) /scale/fs1/big
1 ( 2) /scale/fs1/bigger [unbalanced]
The policy engine is able to change replication status

of a file, but it can not determine which failure groups
may be used. (The mmchattr command can also
change the replication factor of a file.)

Recovering from failed disks
To start disks after failure and correct any missed ∙∙∙ ∙∙∙ inode table
updates, use: ∙∙∙ ∙∙∙
mmchdisk FSNAME start -a
inode
Recovery uses the “PIT” mechanism to split the work !
over multiple nodes. !
• By default, this work is done using all nodes in the cluster.
• The defaultHelperNodes configuration setting limit this
work to a subset of nodes. Indirect inode
!
!
Inodes are scanned to determine which have the
“missed update” flag set.
• These represent files with blocks needing updates.
Data blocks
• The data block pointers are scanned to determine which
have missed updates.
• The updated block is copied into the missed block, and the
“missed update” flags are cleared.

File system descriptor quorum
…an annoying detail
… or, why two failure groups are never enough!

A third failure group for File System Descriptor quorum
The file system descriptor is a data structure Failure group 3

describing attributes and disks in the file system. node”
Q
Only a very few disks of a file system have current file
system descriptors (see appendix), but a majority
must be available to mount the file system.
LAN WAN LAN
When there are at least 3 disks, then 3 active
replicas are maintained, each in a different failure
group if there are at least 3 failure groups.
Each of our main failure groups will receive a copy of Q Q
the active file system descriptor.
SAN SAN
• Put a third failure group on the tiebreaker quorum
node, so if we have node quorum, we also have file
Failure group 1
system descriptor quorum.
Failure group 2

Setting up descriptor-only disks
A disk may be offered as a candidate for holding only

a file system descriptor by indicating its usage as
descOnly in the stanza file used to add it to the file
system.
descOnly only offers the disk as a candidate for an
active file system descriptor. %nsd:
• It alone is not sufficient to guarantee an active file descriptor nsd=fs1desc3
is written there! device=/dev/sdb
• In other words, it repels data and metadata, but doesn’t servers=tiebreak
attract an active file system descriptor. usage=descOnly
• Other rules, like being the only candidate disk in a third or failureGroup=3
fifth failure group, can force the issue.
A descOnly disk can be small (128 MiB is enough).
These are completely different than “tiebreaker
quorum disks” – do not configure them as such!

Operations on file system descriptors
To determine which disks have active file system # mmlsdisk fs1 -L

disk driver sector failure holds holds storage
name type size group metadata data status availability disk id pool remarks
descriptors: ------------ -------- ------ ----------- --------
a01a nsd 512 1 Yes
-----
Yes
-------------
ready
------------ ------- ------------ ---------
up 1 system
b01a nsd 512 2 Yes Yes ready up 2 system desc
a02a nsd 512 1 Yes Yes ready up 3 system desc
b02a nsd 512 2 Yes Yes ready up 4 system
mmlsdisk FSNAME -L
a03a nsd 512 1 Yes Yes ready up 5 system
a01b nsd 512 1 Yes Yes ready up 7 system
b01b nsd 512 2 Yes Yes ready up 8 system
To move the file system descriptor off a disk, first

a01c nsd 512 1 Yes Yes ready up 13 system
suspend it (at which point another candidate is b01c

a02c
nsd
nsd
512
512
2 Yes
1 Yes
Yes
Yes
ready
ready
up
up
14 system
15 system
b02c nsd 512 2 Yes Yes ready up 16 system
chosen), then resume it. a03c
b03c
nsd
nsd
512
512
1 Yes
2 Yes
Yes
Yes
ready
ready
up
up
17 system
18 system
fs1desc3 nsd 512 3 No No ready up 19 system desc
Number of quorum disks: 3
Read quorum value: 2
Write quorum value: 2

a Spectrum Scale stretched cluster
… bringing this all together

Recipe: What goes into a stretched cluster? [Critical]
Active Active
For a stretched cluster, we need:
1. All disks in each of two sites are assigned to that
site’s failure group – and total capacity of each Spectrum Scale file system
failure group should be the same. The file system
must use 2-way replication and whenpossible
replication strictness.
2. Each site has 1 or possibly 2 quorum nodes (must
be the same at both sites).
3. Reliable high-bandwidth, low-latency (ideally less WAN
than about 10ms) WAN link between both sites, as
well as to the tiebreaker quorum node. 1 2
4. A tiebreaker quorum node (ideally outside either
data center) is part of the cluster, as either the third
or fifth quorum node. Q
Site A Q Site B
• For each stretched file system in the cluster, the tiebreaker
quorum node needs a small disk (or even a partition), about
128MiB, joined to the file system as a third failure group.
• Keep the cluster manager function off the tiebreaker quorum
node. 3
Q
Tiebreaker quorum node
with a descOnly disk for
each stretched file system

Recipe: What goes into a stretched cluster? [Best practice]
Active Active
Some additional best practices:

• Design the network so each site is different layer 3 subnet,
allowing the use of readReplicaPolicy=local. Spectrum Scale file system
• Do not assign the manager role to the tiebreaker quorum

node – it typically is not well-enough connected to the sites
to be a suitable token or file system manager node.
WAN
• Learn how to use node classes, and create node classes to

1 2
help manage site-specific node roles:
• Aquorum, Bquorum
• Aces, Bces
Q
Site A Q Site B
• Ansd, Bnsd
• Choose a set of nodes that will be enlisted for PIT workers

and define defaultHelperNodes. 3
Q
each stretched file system

Stretched cluster with Elastic Storage System (ESS)
Typically we want each site with an ESS to have its own

management system (EMS).
A word of caution: We are using the word “cluster” in
two different ways. Q
• Set up separate xCAT clusters at each site. CES
InfiniBand fabric 2
InfiniBand fabric 1
• Configure everything as a single Spectrum Scale cluster (before Q
establishing recovery groups). CES
Ethernet
Ethernet
Federate performance monitoring, so the GUI shows
performance of entire Spectrum Scale cluster. 8
9
ESS
16
17 EXP3524
8
9
ESS
16
17 EXP3524
Provisioning
Provisioning
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
Ideally keep quorum function off ionodes (perhaps

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
Service
Service
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
placing it on CES nodes).

8 16
8 16 9 17 EXP3524
9 17 EXP3524
Use Ethernet for the daemon network, to span the WAN. EMS EMS
(It’s possible to bridge IPoIB traffic but don’t expect WAN

performance.)
Each site can have local InfiniBand fabrics (make sure
verbsPorts specifications puts each into a separate Q
fabric). with a descOnly disk for
each file system

After a failure…
Site B nodes are down, but we still have enough

# mmgetstate -aL
Node number Node name Quorum Nodes up Total nodes GPFS state Remarks
quorum nodes to be quorate. -------------------------------------------------------------------------------------

1 a-scale01 2 2 7 active quorum node
2 b-scale01 0 0 7 unknown quorum node
3 tiebreak 2 2 7 active quorum node
mmlsdisk shows that site B (failure group 2) disks 6

7
b-scale02
b-scale03
0
0
0
0
7
7
unknown
unknown
are marked down. # mmlsdisk fs1

name type size group metadata data status availability pool
------------ -------- ------ ----------- -------- ----- ------------- ------------ ------------
From site A nodes, the file system remains fully

a01a nsd 512 1 Yes Yes ready up system
b01a nsd 512 2 Yes Yes ready down system
functional – applications can both read and write b02a

a03a
nsd
nsd
512
512
2 Yes
1 Yes
Yes
Yes
ready
ready
down
up
system
system
data.
a01b nsd 512 1 Yes Yes ready up system
b01b nsd 512 2 Yes Yes ready down system
a01c nsd 512 1 Yes Yes ready up system
b01c nsd 512 2 Yes Yes ready down system
fs1desc3 nsd 512 3 No No ready up system
# ls /scale/fs1
big big2 bigger testset

Failback
After site B nodes are back up, we find nodes are

# mmhealth cluster show node
Component Node Status Reasons
healthy other than disks are down. ------------------------------------------------------------------------------------------

NODE a-scale01 HEALTHY -
NODE tiebreak HEALTHY -
NODE b-scale01 HEALTHY disk_down
Disks will remain down until explicitly started. NODE

NODE
b-scale03
b-scale02
HEALTHY
HEALTHY
disk_down
disk_down
# mmlsdisk fs1
Meanwhile, even on the site B nodes, the replicated ------------ -------- ------ ----------- --------
a01a nsd 512 1 Yes
-----
Yes
-------------
ready
------------
up
------------
system
file system is fully accessible.

# ls /scale/fs1
big big2 bigger testset

Recovery
Start all disks at once:

# mmchdisk fs1 start -a
mmnsddiscover: Attempting to rediscover the disks. This may take a while ...
mmnsddiscover: Finished.
tiebreak: Rediscovered nsd server access to fs1desc3.
a-scale01: Rediscovered nsd server access to a01b.
a-scale03: Rediscovered nsd server access to a03a.
mmchdisk fs1 start -a a-scale02: Rediscovered nsd server access to a02a.
b-scale03: Rediscovered nsd server access to b03a.
b-scale02: Rediscovered nsd server access to b02a.
b-scale02: Rediscovered nsd server access to b02b.
Note that there is no need to restripe the file system b-scale01: Rediscovered nsd server access to b01a.
after disks are started – the mmchdisk command will

a-scale01: Rediscovered nsd server access to a01c.
a-scale01: Rediscovered nsd server access to a01a.
a-scale03: Rediscovered nsd server access to a03c.
update the disks as it brings them online. a-scale02: Rediscovered nsd server access to a02c.
b-scale03: Rediscovered nsd server access to b03c.
Scanning file system metadata, phase 1 ...
100 % complete on Fri May 8 12:52:52 2020
Scan completed successfully.
Scanning user file metadata ...
100.00 % complete on Fri May 8 12:52:53 2020 ( 93184 inodes with total 3717 MB data
processed)

After recovery
All disks are now marked as available.

# mmlsdisk fs1
------------ -------- ------ ----------- -------- ----- ------------- ------------ --------
----
b01a nsd 512 2 Yes Yes ready up system
b01b nsd 512 2 Yes Yes ready up system
b01c nsd 512 2 Yes Yes ready up system

Some limitations and mitigations
Active Active
A stretched file system provides active/active high

availability, but it is still using a single Spectrum
Scale cluster with file systems stretched across sites. Spectrum Scale file system
• Thus the entire cluster, or any stretched file system,

is subject to misconfiguration or a software error
destroying everything.
• Use callbacks to save configuration state after each WAN
change – see, for example:
/usr/lpp/mmfs/samples/mmsdrbackup.sample 1 2
• Use data protection for the entire cluster.

Q
Site A Q Site B
3
Q
each file system

Options for the tiebreaker quorum node
Active Active
Ideally use an actual third site.
• Perhaps it can go into an off-premises cloud VM, such as in a
public cloud?
• Be sure if either storage site fails, the other will still have access Spectrum Scale file system
to the tiebreaker node.
Sometimes it is unavoidable to have the tiebreaker node

in one of your main sites.
• Put it into the site you “favor” – the primary site – this is the site WAN
that should stay up if the WAN link fails.
• If possible, isolate the tiebreaker node from common failures
that may affect the rest of the primary site – put it on different 1 2
power circuits, network switches, etc.
• The tiebreaker node can be a virtual machine.
• Keep in mind that if the favored site fails, the other site will fail – Q
Site A Q Site B
but you can get Spectrum Scale back up quickly.
• Consider preparing a standby tiebreaker node at the secondary
site.
3
Remember you need a little storage for file system Q
descriptor NSDs. Tiebreaker quorum node
each file system

Standby tiebreaker node?
Active Active
When it is unavoidable for the tiebreaker quorum

node to reside in the primary site…
• In the event of an unplanned failure of the primary site goes Spectrum Scale file system
down, the secondary site will also be down. This will be a
quorum emergency.
• A standby quorum node in the secondary site can facilitate
the recovery process.
• For a planned outage of the primary site, the quorum and file
system descriptor functions can be migrated, before the WAN
outage, from the normal tiebreaker node to the standby
tiebreaker node.
3
The standby quorum node should be actively part of Site A Q Site B
the Spectrum Scale cluster. Standby file system Tiebreaker quorum node Standby tiebreaker node
descriptor disks should already be NSDs. with a descOnly NSD for with an unused NSD for
each file system
each file system
• Do not make this a quorum node until needed.
• Do not add the NSDs to the file systems until needed.

Final considerations
… and the last word

The intersite network
Active Active
The WAN link needs sufficient bandwidth for the

expected workload.
• Pay attention to congestion from other users of this
link.
• Pay attention to bonding (and its misuse).
WAN
Latency needs to be minimized, but what is tolerable
depends on workload.
Latency is a function of distance, as well as other
factors. Up to 300km is regularly tested. Site A Site B
Generally I recommend the daemon network be
Ethernet (you can still use an InfiniBand fabric for
local traffic)
Q
each file system

Where does the cluster manager run?
Make sure the cluster manager runs on a quorum

node at one of the sites, not at a third-site tiebreaker
quorum node. Quorum
C
• If the WAN link between sites fails, the site with the
cluster manager will remain up. Quorum Quorum
• In particular, if one site is “primary”, run it there. A B
• Callbacks could be used to automate keeping the cluster
manager off the tiebreaker node.
• We want to avoid a situation where the intersite link fails,
but the tiebreaker is still visible to both sites. WAN
• If it were the cluster manager, it would try to keep both sites
active, but nodes will be requesting expels when they find they
can’t communicate cross-site.
Site A Site B
This probably limits us to 2-way stretched clusters,
even though 3-way replication is supported.

Token management
File system
File system
Storage pool
layer
While token management is a file system layer FS / Token
“Disk” “Disk” “Disk” “Disk”
concept, the token managers are common cluster

managers
Cluster NSD NSD NSD NSD
manager
resources. Cluster
layer
Volume
(LUN),
Volume
(LUN),
Volume
(LUN),
Volume
(LUN),
• Nodes will use token managers on both sides of the WAN

vdisk vdisk vdisk vdisk
link – workloads with bursts of token activity will see

latency. “Nodes” Storage
• Increasing maxStatCache can reduce token traffic after Active Active
cache is warm.
If there are no token managers, the file system Spectrum Scale file system
manager will be on a quorum node and be the only

token manager for that file system.
• If one site is actually the primary site for all file systems, put
WAN
all the token managers there.
• If a single node is sufficient to manage token traffic, it may 1 2
be possible to have no manager nodes, and perhaps use
callbacks to ensure an appropriate quorum node is chosen Site A Site B
Q Q
for each file system’s file system manager.
3
Q
Interactions with some other Spectrum Scale features
Protocol nodes AFM home

• If the two sites are in different address spaces, then • Typically requires protocol nodes, and for high
CES floating addresses can not float between sites. availability, both sites will need these protocol
• High latency between sites will impact performance nodes. Make sure AFM mappings at the cache site
of SMB cluster file locking. can use either site’s protocol nodes. If the home
CES addresses don’t cross sites, the cache may
disconnect if a site goes down. (Here be dragons.
Transparent Cloud Tiering Leverage callbacks at cache site.)
• Be sure to place TCT gateway nodes at both sites.
AFM cache
• All nodes communicate with AFM gateway nodes,
so write latency is magnified.

Conclusions: Spectrum Scale stretched cluster architecture
Active Active
Experience shows the Spectrum Scale stretched

cluster architecture is a solid architecture for
business resiliency. Spectrum Scale file system
WAN
1 2
Q
Site A Q Site B
3
Q

Thank you!
Please help us to improve Spectrum Scale with your feedback
• If you get a survey in email or a popup from the GUI,
please respond
Spectrum Scale
• We read every single reply
User Group
The Spectrum Scale (GPFS) User Group is

free to join and open to all using, interested in
using or integrating IBM Spectrum Scale.
The format of the group is as a web community

with events held during the year, hosted by our
members or by IBM.
See our web page for upcoming events and

presentations of past events. Join our
conversation via mail and Slack.
www.spectrumscaleug.org
Appendix
Tuning quorum
… and what makes a good quorum node?

Tuning disk leasing
failureDetectionTime – How many seconds it minMissedPingTimeout,

takes to detect a node is down (default is 35 maxMissedPingTimeout – Sets the range in which
seconds, same duration as a disk lease). calculated “missed ping timeout” (MPT) may fall
(default between 3 and 60).
• Default MPT is leaseRecoveryWait-5.
leaseRecoveryWait – When a node fails, wait until
• After a lease expires, the cluster manager will ping a node,
known leases expire, then wait this many seconds,
waiting MPT seconds for a response. After that, the node is
before starting recovery. Default is 35 seconds. expelled.
Intent is to give “in-flight” I/O time to get through
controllers on to disks.
totalPingTimeout – Nodes responding to ICMP
pings but not sending heartbeats will be declared
usePersistentReserve – enables SCSI persistent dead after this timeout (default 120 seconds).
reserve for disk fencing. Please check documentation
for guidance.

Considerations in choosing quorum nodes
Choose the smallest odd number of nodes that meet Choose reliable nodes!
necessary redundancy constraints. Network usage is light, but critical.
• For a stretched cluster, this would typically be 3 or Quorum function needs local file system /var/mmfs
maybe 5 quorum nodes. to be responsive (not blocked by other system I/O to
• Each site would have the same number (1 or 2) quorum same disk or controller).
nodes.
• An ESS ionode is a poor choice for a quorum
• A tiebreaker quorum node is also needed, typically a third
site, or somehow independent of either site failing. node, since its internal NVMe is on the same disk
controller as the local OS disk.
• Only use tiebreaker disk quorum for small clusters,
where all quorum nodes directly access the Try to isolate quorum nodes from common failure
tiebreaker disks. modes:
• Put them in separate racks, on separate circuits, etc.

Components of a resilient cluster
A resilient cluster uses multiple techniques to survive

local failures:
• Multiple links between disks and NSD servers (or application
nodes), using multiple controllers and multiple HBAs –
multipath is handled by the OS.
• Multiple NSD servers (or application nodes) connected to each A B
disk.
• If this is not an ESS:
• Use RAID6 volumes.
• Enable T10-PI or parity read checks – Spectrum Scale
replication will not protect you from UREs (except with NSD servers
metadata).
LAN
• Plan a robust quorum configuration, using tiebreaker disks if
needed.
• Consider the use of multiple failure groups to protect against
larger failures. NSD client /
Application nodes
• Particularly ensure that a single failure won’t break file
system descriptor quorum.

Appendix
File system replication strictness

Synchronous writes and strictness
When all failure groups are on-line, the replication no whenpossible always
proceeds as expected.
All disks in Allocates in all Allocates in all Allocates in all
a failure failure groups failure group failure groups
When a site experiences a failure, its disks are placed in group are
the the stopped state. Writes will still allocate space in stopped
the stopped failure group, and the disks will receive All disks in Only allocates in Only allocates in Fails with out of
updates when started. a failure unsuspended unsuspended space.2
group are failure group.3 failure group.2,3
suspended
Other conditions may prevent normal allocation of
One failure Allocates in Fails with out of Fails with out of
replicas in a failure group. The -K replication group runs failure groups space. space.
enforcement parameter controls what happens in these out of with space.1,3
cases: space
no – If at least one replica can be allocated, the write completes successfully.
whenpossible – If enough failure groups are online, all replicas must be In all cases, replication factor of files is set.
allocated to report success. If not enough are available, do not enforce
replication. This is the default and is generally the correct setting for a stretched
cluster.
always – All required replicas must be allocated; else the write fails with
1 Failure is silent
ENOSPC. 2 Warning when you place disks in this state
3 Recovery requires mmrestripefs -r

Appendix
… or, why two failure groups are never enough!

Failure group 1
The file system descriptor is a data structure describing
attributes and disks in the file system.
• Every disk in a file system has a replica of the file FSDesc
system descriptor, starting at sector 8.

• However, when there are more than 3 disks in a file
system, only several replicas are guaranteed to be FSDesc
active and up to date.
• If there are at least 5 failure groups, then 5 active replicas are
maintained, each in a different failure group.
• Otherwise, if there are at least 3 disks, then 3 active replicas are Failure group 2
maintained, each in a different failure group if there are at
least 3 failure groups.
• Otherwise maintain an active replica on each disk (there are only Even with replication,
1 or 2 disks). this file system will not
• Note that storage pools are ignored when choosing disks. mount if only failure
group 2 is active!
For each file system, a majority of its active file system FSDesc
descriptor replicas must be available for that file system
to be mounted – this is file system descriptor quorum.

Setting up descriptor-only disks
A disk may be offered as a candidate for holding only

a file system descriptor by indicating its usage as
descOnly in the stanza file used to add it to the file
system.
descOnly only offers the disk as a candidate for an
active file system descriptor. %nsd:
• It alone is not sufficient to guarantee an active file descriptor nsd=fs1desc3
is written there! device=/dev/sdb
• In other words, it repels data and metadata, but doesn’t servers=tiebreak
attract an active file system descriptor. usage=descOnly
• Other rules, like being the only candidate disk in a third or failureGroup=3
fifth failure group, can force the issue.
A descOnly disk can be small (128 MiB is enough).
These are completely different than “tiebreaker
quorum disks”!

Operations on file system descriptors
To determine which disks have active file system # mmlsdisk fs1 -L

name type size group metadata data status availability disk id pool remarks
descriptors: ------------ -------- ------ ----------- --------
a01a nsd 512 1 Yes
-----
Yes
-------------
ready
------------ ------- ------------ ---------
up 1 system
b01a nsd 512 2 Yes Yes ready up 2 system desc
a02a nsd 512 1 Yes Yes ready up 3 system desc
mmlsdisk FSNAME -L
a03a nsd 512 1 Yes Yes ready up 5 system
To move the file system descriptor off a disk, first

a01c nsd 512 1 Yes Yes ready up 13 system
suspend it (at which point another candidate is b01c

a02c
nsd
nsd
512
512
2 Yes
1 Yes
Yes
Yes
ready
ready
up
up
14 system
15 system
b02c nsd 512 2 Yes Yes ready up 16 system
chosen), then resume it. a03c
b03c
nsd
nsd
512
512
1 Yes
2 Yes
Yes
Yes
ready
ready
up
up
17 system
18 system
fs1desc3 nsd 512 3 No No ready up 19 system desc
Number of quorum disks: 3
Read quorum value: 2
Write quorum value: 2

File system descriptors and pools
Failure group 1
Even though file system descriptors are metadata,
they are not bound to the system pool.
FSDesc
system
Disks receiving active file system descriptors are

chosen based solely on failure group, irrespective of FSDesc
the pool. data
• If two disks are in the same failure domain, assign them

to the same failure group, even if they are in different
storage pools. Failure group 2
• If you ignore this rule, you may find failure descriptors are
not where you expected them to be, generally when your
stretched cluster didn’t stay available during a site failure! system
FSDesc
There is no way to force the active file system
data
descriptors to be maintained on a specific class of
storage.

Appendix
Dealing with “quorum emergencies”
… because if anything can go wrong, it will!

What is a quorum emergency?
Active Active
Sometimes we lose both a site and the tiebreaker

quorum node, or too many file system descriptor
disks. Spectrum Scale file system
• It isn’t unusual to have the tiebreaker node in one

of the sites themselves and hope for the best!
• It isn’t uncommon for someone to forget the file
system descriptor disk for a file system (like the WAN
CES shared root).
1 2
To get back into production, we must:

Q
Site A Q Site B
• Restore cluster quorum
• Restore file system descriptor quorum for all file
systems we need online. 3
Q
each file system

Emergency recovery (CCR configuration)
Shutdown Spectrum Scale on SURVIVORS:

mmshutdown -N SURVIVORS
Remove quorum function from DOWNQUORUM nodes

(comma separated):
mmchnode --nonquorum --force --N DOWNQUORUM
(You will be prompted to confirm this operation!)
For each file system FSNAME, migrate active file

system descriptors from the failure groups that are
down (DOWNFG1, DOWNFG2):
mmfsctl FSNAME exclude -G DOWNFG1
mmfsctl FSNAME exclude -G DOWNFG2
Start up Spectrum Scale on surviving nodes.

Failback from a quorum emergency
This will require another outage of the cluster.

First restart Spectrum Scale on the nodes that
survived, without mounting the file systems (-A no).
Do not bring up Spectrum Scale on the nodes that
had failed.
Restore the failed tiebreaker site as a quorum node,
then bring up Spectrum Scale on it.
Restore the quorum function to the failed quorum
nodes, then bring back up all the failed nodes.
Create a file of all the NSDs attached to the nodes
that had failed. Use it to again enable them to be file
system descriptor candidates.


Spectrum Scale Stretched Cluster Best Practices

Uploaded by

Copyright:

Available Formats

Spectrum Scale Stretched Cluster Best Practices

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spectrum Scale Stretched Cluster Best Practices

Uploaded by

Copyright:

Available Formats

Spectrum Scale Expert Talks

Best Practices for

Show notes: Join our conversation:

• What is a Spectrum Scale stretched cluster?

Spectrum Scale Expert Talk: Stretched-cluster design 3

… or, what is a stretched cluster and why would I want one?

Spectrum Scale Expert Talk: Stretched-cluster design 4

Spectrum Scale Expert Talk: Stretched-cluster design 5

This is a single Spectrum Scale cluster is configured

File systems may judiciously use “failure group”

Spectrum Scale Expert Talk: Stretched-cluster design 6

How far? About 50-100 miles, but it depends very

Spectrum Scale Expert Talk: Stretched-cluster design 7

… and why node quorum matters

Spectrum Scale Expert Talk: Stretched-cluster design 8

Spectrum Scale Expert Talk: Stretched-cluster design 9

A Spectrum Scale cluster is a group of Spectrum

Importantly, the cluster ensures all nodes in the

Spectrum Scale Expert Talk: Stretched-cluster design 10

Many nodes may be configured to be part of the cluster.

A cluster manager keeps track of which nodes are currently part of

Generally a disk lease is used to grant a node access to the disk

Spectrum Scale Expert Talk: Stretched-cluster design 11

Several nodes are designated as quorum nodes,

nodes must be active and communicating to choose Node 2

These nodes will be expelled

Spectrum Scale Expert Talk: Stretched-cluster design 12

To determine the current quorum state, use: 5

To designate a node as no longer being a quorum

Spectrum Scale Expert Talk: Stretched-cluster design 13

To determine which node is currently the cluster # mmlsmgr

To move the cluster manager function to a particular # mmlsmgr -c

Spectrum Scale Expert Talk: Stretched-cluster design 14

This is still a single Spectrum Scale cluster, but in two

Each site has the same number of quorum nodes –

This requirement is met using an additional quorum

Spectrum Scale Expert Talk: Stretched-cluster design 15

… because redundancy and repetition can be good things

Spectrum Scale Expert Talk: Stretched-cluster design 16

Failure groups are useful to separate fault tolerant regions (storage

Judicious use of replication enables updating file system components

Spectrum Scale Expert Talk: Stretched-cluster design 17

When adding disks to a file system, be sure to include %nsd:

Spectrum Scale Expert Talk: Stretched-cluster design 18

Because the cluster is stretched, all file systems are

A stretched file system uses replication and failure

Spectrum Scale Expert Talk: Stretched-cluster design 19

Writes are synchronous – all replicas are written in parallel.

Updating files also require a log record (metadata) written 8

before the data is written – this is also synchronous.

However, as files are written, storage is still allocated on the

Missed updates are checked and corrected when failed disks

Spectrum Scale Expert Talk: Stretched-cluster design 20

The readReplicaPolicy controls how Scale chooses a

local – If a replica can be obtained through a direct connection (SAN), use

fastest – Based on disk statistics, use the “fastest” disk.

Spectrum Scale Expert Talk: Stretched-cluster design 21

Metadata blocks are written with checksums, so if a