Spectrum Scale Stretched Cluster Best Practices
Spectrum Scale Stretched Cluster Best Practices
Spectrum Scale Stretched Cluster Best Practices
Episode 2:
#SSUG
Agenda
Active/Passive Active/Active
Application layer • Individual applications may have their own Individual applications may have
means of doing asynchronous replication. their own means of doing
• Aspera, rsync – These need to run regularly synchronous replication.
(maybe use cron)
File system layer Spectrum Scale AFM-DR Spectrum Scale failure groups and
replication – “stretched cluster”
Block layer • Block synchronous replication, configured for Block synchronous replication,
active/passive configured for active/active
• Block point-in-time copy
Financial services
• Major hedge fund for financial HPC storage (30-40
Spectrum Scale file system
miles between sites)
Automotive
• Major manufacturer using a stretched cluster for
WAN
HPC storage (about 40 miles)
Life sciences
• Major hospitals using stretched clusters for critical Site A Site B
patient documents.
Layers of
Spectrum Scale File system
File system
The cluster layer is Storage pool
the physical layer
resources “Disk” “Disk” “Disk” “Disk”
FS / Token
(systems, volumes) managers
and logical Cluster NSD NSD NSD NSD
abstractions manager
(nodes, NSDs). Cluster Volume Volume Volume Volume
layer (LUN), (LUN), (LUN), (LUN),
The file system vdisk vdisk vdisk vdisk
layer creates the
namespaces and
manages their
associated storage
abstractions. “Nodes” Storage
Stretched clusters
enable stretched
file systems.
LAN
SAN
• A cluster may share a file system with an authenticated
remote cluster.
Node 3
To avoid the cluster manager from being a single point of failure, this
is a role that may run on one of several cluster nodes.
LAN
SAN
Because failures happen, the cluster manager may need to expel
“dead” nodes and fence them from resources like shared disks: Node 3
• Perhaps a node has failed
• Perhaps a node’s network has failed
LAN
• Generally choose a small odd number of quorum nodes, like 3, 5
or possibly 7. More quorum nodes lengthens recovery time –
there is no benefit to choosing more than 7. Node 3
To see which nodes in the cluster are quorum nodes: # mmgetstate -aLs
Node number Node name Quorum Nodes up Total nodes GPFS state Remarks
mmlscluster -------------------------------------------------------------------------------------
1 a-scale01 2 3 7 active quorum node
2 b-scale01 2 3 7 active quorum node
3 tiebreak 2 3 7 active quorum node
4 a-scale02 2 3 7 active
# mmchmgr -c b-scale01
mmlsmgr -c Appointing node 10.0.200.21 (b-scale01) as cluster manager
Node 10.0.200.21 (b-scale01) has taken over as cluster manager
mmchmgr -c NODENAME
12345 fileB
A storage pool is a class of storage device. Every NSD (disk volume) is Replication
assigned to a pool when it is added to a file system. factor is 2.
A failure group indicates the failure domain of an NSD (often linked to the
location). Every NSD is assigned to a failure group when it is added to a 1 2 4 5 7 1 3 4
file system.
Nsd01
For every file system, there is a metadata replication factor and default Failure Group 1
data replication factor –these may be 1, 2, or 3 (but no higher than the
maximums set for the file system).
Every file has a storage pool and a replication factor, r, associated with it:
1 3 4 6 7 2 3 5
• Generally r will be the default data replication factor, but it may be adjusted for an
individual file (up to the maximum data replication factor for the file system)
• Every block of every file has r instances (“replicas”), each in the same pool, but
different failure groups. Nsd02
• This is not disk mirroring, but the effect is similar.
Failure Group 2
Every block of a file is in the same storage pool.
%nsd:
nsd=d2
mmcrnsd -F fs1-new.stanza device=/dev/dm-4
mmadddisk fs1 -F fs1-new.stanza servers=scale02,scale01
failureGroup=2
%nsd:
The file system also must be configured to support nsd=d3
replication – the maximum data replication (-R) and device=/dev/dm-6
servers=scale01,scale02
metadata replication (-M) can not be changed later. failureGroup=1
%nsd:
mmcrfs fs1 -F fs1.stanza -m 2 -M 3 -r 2 -R 3 \ nsd=d4
device=/dev/dm-7
-Q yes -A yes -i 4k -S relatime --filesetdf -k all \
servers=scale02,scale01
-T /scale/fs1 failureGroup=2
SAN SAN
A stretched cluster may have both stretched and
“unstretched” file systems. Failure group 1
• A “unstretched” file system – one whose storage is only at one
site – will become unavailable if that site goes down. Failure group 2
9
16
17 EXP3524
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
8 16
9 17 EXP3524
NSD
Server
Ethernet
Ethernet
When a site fails, its disks are placed in the stopped state. failureGroup 1 8
9
16
17 EXP3524
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
stopped disks.
8 16
9 17 EXP3524
NSD
Server
failureGroup 2
Other conditions may affect replication – the -K replication
strictness flag affects these cases. Leave this at the default of WAN
whenpossible (see appendix for more details).
You can avoid the WAN latency penalty when reading data by Subnet (Example: Subnet (Example:
setting readReplicaPolicy. 192.168.10.0/24) 192.168.20.0/24)
7 8 9 10 11 12 13 14 15
16
17
System x3650 M4
EXP3524
DEFAULT – Any replica may be used to satisfy the request. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
9 17 EXP3524
NSD
that one first. If a replica is on an NSD server on the same layer 3 network Server
Ethernet
Ethernet
(subnet) as the requesting node, use that replica. Finally, any replica may be failureGroup 1 8 16
used.
9 17 EXP3524
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
8 16
9 17 EXP3524
NSD
Server
Consider aligning failure group strategy to networks, to facilitate failureGroup 2
using the local setting.
WAN
The fastest setting is best used when intersite latency is large
relative to disk seek times.
• fastestPolicyMinDiffPercent
• fastestPolicyNumReadSamples
• fastestPolicyMaxValidPeriod
• fastestPolicyCmpThreshold
9
16
17 EXP3524
returns error
corrupt data.
8 16
9 17 EXP3524
Ethernet
Ethernet
failureGroup 1
parity read checks; or use an ESS.
8 16
9 17 EXP3524
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
9
16
17 EXP3524
To start disks after failure and correct any missed ∙∙∙ ∙∙∙ inode table
updates, use: ∙∙∙ ∙∙∙
mmchdisk FSNAME start -a
inode
Recovery uses the “PIT” mechanism to split the work !
over multiple nodes. !
• By default, this work is done using all nodes in the cluster.
• The defaultHelperNodes configuration setting limit this
work to a subset of nodes. Indirect inode
!
!
Inodes are scanned to determine which have the
“missed update” flag set.
• These represent files with blocks needing updates.
Data blocks
• The data block pointers are scanned to determine which
have missed updates.
• The updated block is copied into the missed block, and the
“missed update” flags are cleared.
mmlsdisk FSNAME -L
a03a nsd 512 1 Yes Yes ready up 5 system
b03a nsd 512 2 Yes Yes ready up 6 system
a01b nsd 512 1 Yes Yes ready up 7 system
b01b nsd 512 2 Yes Yes ready up 8 system
a02b nsd 512 1 Yes Yes ready up 9 system
b02b nsd 512 2 Yes Yes ready up 10 system
InfiniBand fabric 2
InfiniBand fabric 1
• Configure everything as a single Spectrum Scale cluster (before Q
establishing recovery groups). CES
Ethernet
Ethernet
Federate performance monitoring, so the GUI shows
performance of entire Spectrum Scale cluster. 8
9
ESS
16
17 EXP3524
8
9
ESS
16
17 EXP3524
Provisioning
Provisioning
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
Service
Service
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 System x3650 M4
8 16 9 17 EXP3524
9 17 EXP3524
Use Ethernet for the daemon network, to span the WAN. EMS EMS
Node number Node name Quorum Nodes up Total nodes GPFS state Remarks
data.
b03a nsd 512 2 Yes Yes ready down system
a01b nsd 512 1 Yes Yes ready up system
b01b nsd 512 2 Yes Yes ready down system
a02b nsd 512 1 Yes Yes ready up system
b02b nsd 512 2 Yes Yes ready down system
a03b nsd 512 1 Yes Yes ready up system
b03b nsd 512 2 Yes Yes ready down system
a01c nsd 512 1 Yes Yes ready up system
b01c nsd 512 2 Yes Yes ready down system
a02c nsd 512 1 Yes Yes ready up system
b02c nsd 512 2 Yes Yes ready down system
a03c nsd 512 1 Yes Yes ready up system
b03c nsd 512 2 Yes Yes ready down system
fs1desc3 nsd 512 3 No No ready up system
# ls /scale/fs1
big big2 bigger testset
# mmlsdisk fs1
disk driver sector failure holds holds storage
name type size group metadata data status availability pool
Meanwhile, even on the site B nodes, the replicated ------------ -------- ------ ----------- --------
a01a nsd 512 1 Yes
-----
Yes
-------------
ready
------------
up
------------
system
# ls /scale/fs1
big big2 bigger testset
Note that there is no need to restripe the file system b-scale01: Rediscovered nsd server access to b01a.
b-scale01: Rediscovered nsd server access to b01b.
3
Q
Tiebreaker quorum node
with a descOnly disk for
each file system
3
The standby quorum node should be actively part of Site A Q Site B
the Spectrum Scale cluster. Standby file system Tiebreaker quorum node Standby tiebreaker node
descriptor disks should already be NSDs. with a descOnly NSD for with an unused NSD for
each file system
each file system
• Do not make this a quorum node until needed.
• Do not add the NSDs to the file systems until needed.
WAN
Latency needs to be minimized, but what is tolerable
depends on workload.
Latency is a function of distance, as well as other
factors. Up to 300km is regularly tested. Site A Site B
Generally I recommend the daemon network be
Ethernet (you can still use an InfiniBand fabric for
local traffic)
Q
Tiebreaker quorum node
with a descOnly disk for
each file system
Site A Site B
This probably limits us to 2-way stretched clusters,
even though 3-way replication is supported.
File system
Storage pool
layer
While token management is a file system layer FS / Token
“Disk” “Disk” “Disk” “Disk”
resources. Cluster
layer
Volume
(LUN),
Volume
(LUN),
Volume
(LUN),
Volume
(LUN),
cache is warm.
Application nodes Application nodes
(Spectrum Scale clients) (Spectrum Scale clients)
If there are no token managers, the file system Spectrum Scale file system
3
Q
Spectrum Scale Expert Talk: Stretched-cluster design 43
© Copyright IBM Corporation 2020.
Interactions with some other Spectrum Scale features
WAN
1 2
Q
Site A Q Site B
3
Q
www.spectrumscaleug.org
Appendix
Tuning quorum
Choose the smallest odd number of nodes that meet Choose reliable nodes!
necessary redundancy constraints. Network usage is light, but critical.
• For a stretched cluster, this would typically be 3 or Quorum function needs local file system /var/mmfs
maybe 5 quorum nodes. to be responsive (not blocked by other system I/O to
• Each site would have the same number (1 or 2) quorum same disk or controller).
nodes.
• An ESS ionode is a poor choice for a quorum
• A tiebreaker quorum node is also needed, typically a third
site, or somehow independent of either site failing. node, since its internal NVMe is on the same disk
controller as the local OS disk.
• Only use tiebreaker disk quorum for small clusters,
where all quorum nodes directly access the Try to isolate quorum nodes from common failure
tiebreaker disks. modes:
• Put them in separate racks, on separate circuits, etc.
When all failure groups are on-line, the replication no whenpossible always
proceeds as expected.
All disks in Allocates in all Allocates in all Allocates in all
a failure failure groups failure group failure groups
When a site experiences a failure, its disks are placed in group are
the the stopped state. Writes will still allocate space in stopped
the stopped failure group, and the disks will receive All disks in Only allocates in Only allocates in Fails with out of
updates when started. a failure unsuspended unsuspended space.2
group are failure group.3 failure group.2,3
suspended
Other conditions may prevent normal allocation of
One failure Allocates in Fails with out of Fails with out of
replicas in a failure group. The -K replication group runs failure groups space. space.
enforcement parameter controls what happens in these out of with space.1,3
cases: space
no – If at least one replica can be allocated, the write completes successfully.
whenpossible – If enough failure groups are online, all replicas must be In all cases, replication factor of files is set.
allocated to report success. If not enough are available, do not enforce
replication. This is the default and is generally the correct setting for a stretched
cluster.
always – All required replicas must be allocated; else the write fails with
1 Failure is silent
ENOSPC. 2 Warning when you place disks in this state
3 Recovery requires mmrestripefs -r
Failure group 1
The file system descriptor is a data structure describing
attributes and disks in the file system.
• Every disk in a file system has a replica of the file FSDesc
mmlsdisk FSNAME -L
a03a nsd 512 1 Yes Yes ready up 5 system
b03a nsd 512 2 Yes Yes ready up 6 system
a01b nsd 512 1 Yes Yes ready up 7 system
b01b nsd 512 2 Yes Yes ready up 8 system
a02b nsd 512 1 Yes Yes ready up 9 system
b02b nsd 512 2 Yes Yes ready up 10 system
Failure group 1
Even though file system descriptors are metadata,
they are not bound to the system pool.
FSDesc
system
• If you ignore this rule, you may find failure descriptors are
not where you expected them to be, generally when your
stretched cluster didn’t stay available during a site failure! system
FSDesc
There is no way to force the active file system
data
descriptors to be maintained on a specific class of
storage.