Gpfsworkshop2010 Tutorial v17 2
Gpfsworkshop2010 Tutorial v17 2
Gpfsworkshop2010 Tutorial v17 2
Programming, Configuration, Environment and Performance Perspectives Tutorial for GPFS versions 3.3 and earlier
TBDs: Add SoFS example under NAS in Taxonomy Add SoFS to section on NFS and CNFS Add snapshots GPFS SNMP support Guide lines on where to use NFS pros and cons for using GPFS for home directories
OS Commands see pp.79-80 in Concepts, Planning, Installation Guide. OS Calls see pp.80-83 in Concepts, Planning, Installation Guide. GPFS Command Processing see pp. 83-84 in Concepts, Planning, Installation Guide. GPFS Port Usage see pp. 122-124 of Advanced Admin Guide
Biographical Sketch Dr. Ray Paden is currently an HPC Technical Architect with world wide scope in IBM's Deep Computing organization, a position he has held since June, 2000. His particular areas of focus include HPC storage systems, performance optimization and cluster design. Before joining IBM, Dr. Paden worked as software engineer doing systems programming and performance optimization for 6 years in the oil industry. He also served in the Computer Science Department at Andrews University for 13 years, including 4 years as department chair. He has a Ph.D. from the Illinois Institute of Technology in Computer Science. He has done research and published papers in the areas of parallel algorithms and combinatorial optimization, performance tuning, file systems, and computer education. He has served in various capacities on the planning committee for the Supercomputing conference since 2000. He is currently a member of ACM, IEEE and Sigma Xi. As a professor, he has won awards for excellence in both teaching and research. He has also received the Outstanding Innovation Award from IBM.
Day 2
Session 1 (8:30 AM - 10:00 AM, 10:30 AM - NOON) Review written exercise from previous day GPFS System Administration GPFS Configuration Example Optional GPFS lab exercise: install and configure GPFS Session 2 (1:30 PM - 3:00 PM, 3:30 PM - 5:00 PM) Specific topics selected based on attendee interests; topics include
1. GPFS planning and design (intended for customers who have purchased GPFS) 2. Information Life Cycle Management (ILM) and HSM Product Integration 3. Clustered NFS (CNFS) 4. SoNAS and SoFS 5. Snapshots 6. Disaster Recovery 7. SNMP Support 8. Miscellaneous Best Practices 9. GPFS Roadmap (requires NDA)
COMMENT: The material in this slide set is detailed and comprehensive; it requires 3 full days to cover it in its entirety (including the hands on lab). However, this tutorial is generally covered in 2 days at customer sites by including only the material relavant to the customer.
Example #2 - Elaborate
IB Switch
x3550-01 x3550-02
NSD Server-01
GbE GbE
IB 4xDDR
2xFC8
GbE
GbE
GbE
GbE
TbE
2 x FC4
NSD Server-02
GbE GbE
IB 4xDDR
2xFC8
NSD Server-03
GbE GbE
IB 4xDDR
2xFC8
NSD Server-04
GbE GbE
Controller-A
Controller-B
IB 4xDDR
2xFC8
DS3400
12 disks (SAS or SATA)
ESM-A
ESM-B
1 2 1 2
3 4 3 4
60-disk Drawer SAS or SATA
GbE GbE
EXP3000
12 disks (SAS or SATA)
COMMENT:
These diagrams are intended to illustrate the range of possibilities for configurations that could be used for a test system to do the lab exercise. There are many other possibilities, including the use of "internal" SCSI or SAS drives.
Ideally, this config needs 160 x SAS drives or 300 x SATA drivers.
NOTE: Administrative GbE network not shown.
Tutorial Objectives
Conceptual understanding of GPFS
With a conceptual understanding and a man page, a sysadm can do anything!
theory
Targeted Audience
system administrators systems and application programmers system architects computer center managers
Requirements
cluster experience in keeping with one of the previous backgrounds
1. Introduction
Parallel Clustered
user data and metadata flows between all nodes and all disks in parallel 1 to 1000's of nodes under common rubric
compute node
compute node
compute node
compute node
compute node
compute node
User data and metadata flows between all nodes and all disks in parallel
Multiple tasks distributed over multiple nodes simultaneously access file data Multi-task applications access common files in parallel Files span multiple disks File system overhead operations are distributed and done in parallel Provides a consistent global name space across all nodes of the cluster
compute node
compute node
compute node
compute node
compute node
compute node
The promise of parallel I/O is increased performance and robustness in a cluster and it naturally maps to the architecture of a cluster.
The challenge of parallel I/O is that it is a more complex model of I/O to use and manage.
compute node
compute node
compute node
compute node
compute node
compute node
What is GPFS?
General: supports wide range of applications and configurations Cluster: from large (4000+ in a multi-cluster) to small (only 1 node) clusters Parallel: user data and metadata flows between all nodes and all disks in parallel HPC: supports high performance applications Flexible: tuning parameters allow GPFS to be adapted to many environments Capacity: from high (4+ PB) to low capacity (only 1 disk) Global: Works across multiple nodes, clusters and labs (i.e., LAN, SAN, WAN) Heterogenous:
Native GPFS on AIX, Linux, Windows as well as NFS and CIFS Works with almost any block storage device
Shared disk: all user and meta data are accessible from any disk to any node RAS: reliability, accessibility, serviceability Ease of use: GPFS is not a black box, yet it is relatively easy to use and manage Basic file system features: POSIX API, journaling, both parallel and non-parallel access Advanced features: ILM, integrated with tape, disaster recovery, SNMP, snapshots, robust NFS support, hints
What is GPFS?
Typical Example
Aggregate Performance and Capacity Data rate: streaming rate < 5 GB/s, 4 KB transaction rate < 40,000 IOP/s Usable capacity < 240 TB IB LAN*
x3550-01 x3550-02 x3550-03 x3550-04 x3550-05 x3550-06 x3550-07 x3550-08 x3550-09 x3550-10 x3550-11 x3550-12 x3550-13 x3550-14 x3550-15 x3550-16 x3550-17 x3550-18 x3550-19 x3550-20 x3550-21 x3550-22 x3550-23 x3550-24 x3550-25 x3550-26 x3550-27 x3550-28 x3550-29 x3550-30 x3550-31 x3550-32 x3550-33
NSD Server-01
GbE GbE
IB 4xDDR
2xFC8
LAN Configuration
Performance scales linearly in the number of storage servers Add capacity without increasing the number of servers Add performance by adding more servers and/or storage Inexpensively scale out the number of clients
x3550-34 x3550-35 x3550-36 x3550-37 x3550-38 x3550-39 x3550-40 x3550-41 x3550-42 x3550-43 x3550-44 x3550-45 x3550-46 x3550-47 x3550-48 x3550-49 x3550-50 x3550-51 x3550-52 x3550-53 x3550-54 x3550-55 x3550-56 x3550-57 x3550-58 x3550-59 x3550-60 x3550-61 x3550-62 x3550-63 x3550-64
NSD Server-02
GbE GbE
IB 4xDDR
2xFC8
NSD Server-03
GbE GbE
IB 4xDDR
2xFC8
NSD Server-04
GbE GbE
IB 4xDDR
2xFC8
1 2 1 2
3 4 3 4
60-disk Drawer
GbE GbE
GbE GbE
60-disk Drawer
60-disk Drawer
60-disk Drawer
60-disk Drawer
Though not shown, a cluster like this will generally include an administrative GbE network.
What is GPFS?
Another Typcial Example
Aggregate Performance and Capacity Data rate: streaming rate < 5 GB/s, 4 KB transaction rate < 40,000 IOP/s Usable capacity < 240 TB FC8 LAN
1 2 1 2
3 4 3 4
60-disk Drawer
GbE GbE
GbE GbE
60-disk Drawer
60-disk Drawer
60-disk Drawer
60-disk Drawer
SAN Configuration
Performance scales linearly in the number of servers Add capacity without increasing the number of servers Add performance by adding more servers and/or storage
GPFS is not a client/server file system like NFS, CIFS (Samba) or AFS/DFS with a single file server.
GPFS nodes can be an NFS or CIFS server, but GPFS treats them like any other application.
client
client
client
client
LAN
file server
(e.g., NFS or Samba)
client
client
client SAN
client
GPFS avoids the bottlenecks introduced by centralized file and/or metadata servers.
Today
GPFS is a general purpose clustered parallel file system tunable for many workloads on many configurations.
Winterhawk
BlueGene/P iDataPlex
P6 p595
BladeCenter/H
EDA (Electronic Design Automation) General Business National Labs Petroleum SMB (Small and Medium sized Business) Universities Weather Modeling
LARGE Clusters
Smaller Number of Big Nodes Larger Number of Small Nodes
Herd of Elephants
Army of Ants
small Clusters
smaller
Clusters
Conventional I/O Asynchronous I/O Networked File Systems Network Attached Storage (NAS) Basic Clustered File Systems SAN File Systems Multi-component Clustered File Systems High Level Parallel I/O
Conventional I/O
Used generally for "local file systems"
the basic, "no frills, out of the box" file system
If they are a native FS, they are integrated into the OS (e.g., caching done via VMM) Examples: ext3, JFS, NTFS, ReiserFS, XFS
Asynchronous I/O
Abstractions allowing multiple threads/tasks to safely and simultaneously access a common file Parallelism available if its supported in the base file system Included in the POSIX 4 standard
not necessarily supported on all Unix operating systems non-blocking I/O built on top of a base file system
Examples:
commonly available under real time operating systems Supported today on various "flavors" of standard Unix AIX, Solaris, Linux (starting with 2.6)
uses POSIX I/O API, but not its semantics traditional NFS configurations limited by "single server" bottleneck while NFS is not designed parallel file access, by placing restrictions on an application's file access and/or doing non-parallel I/O, it may be possible to get "good enough" performance NFS clients available for Windows, but POSIX to NTFS mapping is awkward GPFS provides a high availability version of NFS called Clustered NFS
LAN
file & metadata server
SAN Fabric
Storage Controller
A1 A2 A3 A4 A5 A6 A7 A8
COMMENT: Traditionally, a single NFS/CIFS file server manages both user data and metadata operations which "gates" performance/scaling and presents a single point of failure risk. Products (e.g., CNFS) are available that provide multiple server designs to avoid this issue.
Generally based on Ethernet LANs Is this just a subclass of the networked file systems level?
Examples
Netapps (also rebranded as IBM nSeries)
Provides excellent performance for IOPS and transaction processing workloads with favorable temporal locality.
file system overhead operations is distributed and done in parallel there are no single server bottlenecks
some FS's allow a single component architecture where file clients and file servers are combined
yields very good scaling for asynchronous applications
file clients access file data through file servers via the LAN Example: GPFS (IBM), GFS (Sistina/Redhat), IBRIX Fusion
Storage Controller
SAN Fabric
File system overhead operations are distributed across the entire cluster and is done in parallel; it is not concentrated in any given place. There is no single server bottleneck. User data and metadata flows betweem all nodes and all disks via the file servers.
All disks connected to all file client/server nodes via the SAN
file data accessed via the SAN, not the LAN
removes need for expensive LAN where high BW is required (e.g., IB, Myrinet)
SAN client
metadata servers
File system protocol is concentrated in the metadata server and is not done in parallel; all file client/server nodes must coordinate file access via the metadata server. There are generally no client only nodes in this type of cluster, and hence the need for large scaling is not needed.
Multi-component architecture
Lustre: file clients, file servers, metadata server Panasas: file clients, director blade
director blade encapsulates file service, metadata service, storage controller operations
file clients access file data through file servers or director blades via the LAN Examples: Lustre, Panasas
Lustre: Linux only, Panasas: Linux and Windows. object oriented disks
Lustre emulates object oriented disks Panasas uses actual OO disks; user can only use Panasas disks
Do OO disks really add value to the FS? Other FS's efficiently accomplish the same thing at a higher level.
Panasas
file client file client LAN director blade
metadata server file server storage controller
file client
file client
metadata servers
concentrated protocol management
director blade
metadata server file server storage controller
SAN Fabric
Storage Controller
A1 A2 A3 A4 A5 A6 A7 A8
disks
disks
While different in many ways, Lustre and Panasas are similar in that they both have concentrated file system overhead operations (i.e., protocol management). The Panasas design, however, scales the number of protocol managers proportionally to the number of disks and is less of a bottleneck than for Lustre.
Requires significant source code modification for use in legacy codes, but it has the adavantages of being a standard (e.g., syntactic portability) Examples: IBM MPI, MPICH, OpenMPI, Scali MPI
Clusters are intended to provide cost effective performance scaling. Thus it is imperative that I/O and computational performance keep pace with each other.
A cluster designed to perform TFLOP calculations must be able to access up to 100's of GB of data per second. A cluster is no faster than its slowest component (large or small!)
Anecdote: 200 years ago, when a tree fell across the road and your ox wasn't big enough to move it out of the way, you didn't go grow a bigger ox; you got more oxen.
Rear Admiral Grace Murray Hopper Computer Pioneer 1906-1992
I/O can, but need not be a large contributor to f in clusters. I call the inefficiency represented by the term f in Amdahl's law "Amdahl inefficiency" or "Amdahl overhead". Consider a job on a 32 node/64 CPU Linux Cluster (LC). This job, when executed on a single node accessing a local scratch disk, devotes 10% of its job time writing to a file. By contrast, the LC writes via NFS to a single file server preventing parallel I/O operation. Assume the following...
the file server is the same "out of the box" Linux system used for the sequential test the Ethernet connection rate used for NFS exceeds the sequential job's write rate number crunching and file reading phases of the job runs perfectly parallel (i.e., are small enough to be ignored)
In other words, the writes are sequentialized. What are the speedup and efficiency values for this job?
Number of Tasks 8 16 32 64
Let's examine a simple disk I/O program and modify it to do parallel disk I/O so that we can better appreciate the tasks that a parallel file system must do and that GPFS does to allow a programmer to do parallel I/O safely.
1. mapping function (i.e., locating proper data across multiple disks over different nodes) 2. message passing (i.e., shipping data between client nodetask and disk located on remote server) 3. caching system (e.g., coherence, aging, swapping, "data-shipping", etc.) 4. parallel programming model (e.g., data striping, data decomposition, node to disk access patterns) 5. critical section programming 6. performance tuning 7. maintain state information 8. provide an API That's a lot of work!
GPFS has 100's of KLOCs
Provide a parallel I/O system conforming to the POSIX API standard. This allows you to write an application code to access one file without worrying too much about what the other tasks are doing.
You can't be blind, but you can focus on application needs without worrying too much about system issues. If the code is sequential, you can get the full benefits of a parallel file system without worrying at all about it!!
tid = spawn_task(ntask); fd = open(fid_out, O_WRONLY | O_CREAT, 0777); for (k = tid; k < nrec; k+=ntask) The proper way to open a file is to set { a barrier and use O_TRUNC; however, this seems to work OK for do_something(buf, bsz); this simple example. soff = (offset_t)k * (offset_t)bsz; If O_TRUNC is used without a barrier or some equivalent timing primitive, llseek(fd, soff, SEEK_SET); write(fd, buf, bsz); } close(fd); return 0; }
In reality you will need more than this, but it will be application oriented ONLY!
records written to the file before subsequent tasks open the file will be "clobbered".
Which disk/file to write to (there is only one file seen by all tasks) If some other task/job has opened the file If somebody else is writing to the file right now Cache coherence If its portable ... its POSIX compliant!
int main() { int fd, k, nrec = 1024, bsz = 16384, ntask = 2, tid; char *fid_out = "myfile", buf[bsz]; offset_t soff; /* 64 bit seek offset */ tid = spawn_task(ntask); fd = open(fid_out, O_RDWR | O_CREAT, 0777); while ((soff = find_record())) /* assume soff%bsz == 0 */ { critical section begin llseek(fd, soff, SEEK_SET); read(fd, buf, bsz); for (k = ntask; k < bsz; k += ntask) buf[k] = do_something(...); llseek(fd, soff, SEEK_SET); write(fd, buf, bsz); critical section end } close(ofd); return 0; }
Task 0, node J acquires lock Task 0, node J reads record N from disk Task 0, node J modifies buf[] at indicies 0, 2, 4, 6, 8, ... Task 0, node J writes record N to local cache Task 0, node J releases lock Task 1, node K acquires lock Task 1, node K reads record N from disk Task 1, node K modifies buf[] at indicies 1, 3, 5, 7, 9, ... Node J flushes cache Task 1, node K writes record N to local cache Task 1, node K releases lock Node K flushes cache clobbering Task 0's modifications!
he does not know its in node J's cache
This scenario is quite possible under NFS, for example, since it is not cache coherent (after all, its not truly parallel!).
GPFS maintains cache coherence (among many other parallel tasks) making parallel access to a common file safe (by taking the usual concurrency precautions such as using locks or semaphores).
Comments on NFS NFS V3 has cleaned much of this up. By using -noac option opening the file with the O_SYNC flag, parallel writes can be done more safely, though this contributes to Amdahl inefficiency by sequentializing parallel writes. However, this is not fool proof. Some customer codes fail under NFS using these options where they run safely without error under GPFS. NFS V4 holds more promise, but the verdict is still out. And parallel NFS (pNFS) is on the horizon...
Well there are 2 things to worry about: 1. Normal precautions against RAW, WAR, WAW errors 2. Performance issues
overlapping records sequentializes file access and contributes to "Amdahl inefficiency"
RAW = Read After Write WAR = Write After Read WAW = Write After Write
The next several sections survey basic architectural, organizational and topological features of GPFS. This provides a conceptual understanding for GPFS.
This helps
applications and systems programmers to more effectively utilize GPFS system administrators and architects to more effectively design and maintain a GPFS infrastructure
4. GPFS Architecture
1. Client vs. Server 2. LAN Model 3. SAN Model 4. Mixed SAN/LAN Model
Software Architecture Perspective: No There is no single-server bottleneck, no protocol manager for data transfer. The mmfsd daemon runs symetrically on all nodes. All nodes can and do access the file system via virtual disks (i.e., NSDs). All nodes can, if disks are physically attached to them, provide physical disk access for corresponding virtual disks.
NSD
SW layer in GPFS providing a "virtual" view of a disk virtual disks which correspond to LUNs in the NSD servers with a bijective mapping
Client #1
nsd1 nsd2 nsd3 nsd4
o o o
Client #2
nsd1 nsd2 nsd3 nsd4
o o o
Client #3
nsd1 nsd2 nsd3 nsd4
o o o
Client #4
nsd1 nsd2 nsd3 nsd4
o o o
Client #5
nsd1 nsd2 nsd3 nsd4
o o o
Client #6
nsd1 nsd2 nsd3 nsd4
o o o
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
nsd12
nsd12
nsd12
nsd12
nsd12
nsd12
LUN
Logical Unit Abstraction of a disk
AIX - hdisk Linux - SCSI device
Server #1
nsd1 nsd2 nsd3 nsd4
o o o
Server #2
nsd1 nsd2 nsd3 nsd4
o o o
Server #3
nsd1 nsd2 nsd3 nsd4
o o o
Server #4
nsd1 nsd2 nsd3 nsd4
o o o
Redundancy
Each LUN can have upto 8 servers. If a server fails, the next one in the list takes over.
There are 2 servers per NSD, a primary and backup server.
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
nsd12
nsd12
nsd12
nsd12
Redundancy
Each server has 2 connections to the disk controller providing redundancy
RAID Controller
A7 A8 A9 A10 A11 A12
Zoning Zoning is the process by which RAID sets are assigned to controller ports and HBAs GPFS achieves its best performance by mapping each RAID array to a single LUN in the host. Twin Tailing For redundancy, each RAID array is zoned to appear as a LUN on 2 or more hosts.
SAN Client #1
nsd1 nsd2 nsd3 nsd4 nsd12
o o o
SAN Client #2
SAN Client #3
nsd1 nsd2 nsd3 nsd4 nsd12
L1 L4 L7 L10 L2 L5 L8 L11 L3 L6 L9 L12
o o o
SAN Client #4
nsd1 nsd2 nsd3 nsd4 nsd12
L1 L4 L7 L10 L2 L5 L8 L11 L3 L6 L9 L12
o o o
SAN Client #5
nsd1 nsd2 nsd3 nsd4 nsd12
L1 L4 L7 L10 L2 L5 L8 L11 L3 L6 L9 L12
o o o
SAN Client #6
nsd1 nsd2 nsd3 nsd4 nsd12
L1 L4 L7 L10 L2 L5 L8 L11 L3 L6 L9 L12
o o o
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
L1 L4 L7 L10
L2 L5 L8 L11
L3 L6 L9 L12
L1 L4 L7 L10
L2 L5 L8 L11
L3 L6 L9 L12
RAID Controller
A7 A8 A9 A10 A11 A12
CAUTION: A SAN configuration is not recommended for larger clusters (e.g., >= 64 since queue depth must be set small (e.g., 1)
SAN Topology
User data and metadata only traverse SAN; only overhead data traverses the LAN Disks attach to all nodes in the cluster Applications run on all nodes in the cluster Works well for small clusters
too expense to scale out to large clusters (e.g., largest production SAN cluster is 250+ nodes) ideal for a "herd of elephants" configuration (i.e., small number of large systems)
SAN client
SAN client
SAN client
SAN client
LAN client
LAN client
LAN client
LAN client
It is necessary to declare a subset (e.g., 2 nodes) of the SAN clients to be primary/backup NSD servers. Alternatively, dedicated NSD servers can be attached to the SAN fabric.
A1 A2 A9
A7 A8 A12
COMMENTS:
Nodes 1 - 4 (i.e., SAN clients) GPFS operates in SAN mode User and meta data traverse the SAN Tokens and heartbeat traverse the LAN Nodes 5 - 8 (i.e., LAN clients) GPFS operates in LAN mode User data, meta data, tokens, heartbeat traverse the LAN
COMMON EXAMPLE
Nodes 1 - 4: P6p575 or P6p595 Nodes 5 - 8: iDataPlex or blades
Symetric Clusters
LAN Fabric
1
client server
client server
client server
client server
client server
client server
client server
client server
disk drawer
disk drawer
disk drawer
disk drawer
COMMENTS
Requires special bid pricing under new licensing model No distinction between NSD clients and NSD servers not well suited for synchronous applications Provides excellent scaling and performance Not common today given the cost associated with disk controllers Use "twin tailed disk" to avoid single point of failure risks New products may make this popular again. does not necessarily work with any disk drawer do validation test first example: DS3200 - yes, EXP3000 - no Can be done using internal SCSI Problem: exposed to single point of failure risk Solution: use GPFS mirroring
Its application/customer dependent! Each configuration has its limitations and its strong points. And each one is commonly used. The following pages illustrate specific GPFS configurations.
5. Performance Features
Six related performance features in GPFS 1. Multithreading 2. Striping 3. File caching 4. Byte range locking 5. Blocks and sub-blocks 6. Access pattern optimization
Multithreaded Architecture
GPFS can spawn up to
512 threads/node for 32 bit kernels 1024 threads/node for 64 bit kernels there is one thread per block (i.e., each block is an IOP)
large records may require multiple threads
The key to GPFS performance is "deep prefetch" which due its multithreaded architecture and is facilitated by
GPFS pagepool striping (which allows multiple disks to spin simultaneously) access pattern optimizations for sequential and strided access or the explicit use of hints
Data Striping
GPFS stripes successive blocks of each file across successive disks Disk I/O for sequential reads and writes is done in parallel (prefetch, write behind) Make no assumptions about the striping pattern Block size is configured when file system is configured, and is not programmable transparent to programmer
Disk Pool
Server Nodes
Application Nodes
increasing file offset ----> Block 1 Block 2 Block 3 Block 4 Block 5 Block 6
.....
one file
Definitions
working set: a subset of the data that is actively being used spatial locality: successive accesses are clustered in space (e.g., seek offset) this is used for predictable access patterns (e.g., sequential, strided) temporal locality: successive accesses to the same record are clustered in time
To effectively exploit locality it is necessary to have a cache large enough to hold the working set.
good spatial locality generally requires a smaller working set ideally, adjacent records are accessed once and not needed again good temporal locality often requires a larger working set the longer a block stays in cache, the more times it can be accessed without swapping
GPFS locality
GPFS caching is optimized for spatial locality, but can accomodate temporal locality HPC applications more commonly demonstrate spatial locality
It is used by GPFS for file data, indirect blocks and "system metadata blocks"
Pagepool size
Set by mmchconfig pagepool= {value}
default: 64M These values are generally too small, especially for large blocks (e.g., 4M) min value: 4M max value: 256G for 64 bit OS, 2G for 32 bit OS BUT GPFS will not allocate more than the pagepool parameter setting This can be changed to values between 10% to 90% allocate more than 75% of physical memory e.g., mmchconfig pagepoolMaxPhysMemPct=90 request more memory than the OS will allow
assume blocksize <= 1 MB and number of LUNs <= 12, then let sizeof(pagepool) <= 256 MB assume blocksize >= 2 MB and number of LUNs >= 24, then let sizeof(pagepool) >= 512 MB
Optimizing streaming access requires a smaller pagepool (e.g., up to 1 GB) Optimizing irregular access requires a larger pagepool (e.g., > 1 GB, enough to hold working set)
Pagepool Semantics
GPFS provides a client side caching model with cache coherency
Pagepool can be viewed as a single entity rather than seperate caches for each node
spatial locality
write back (write to cache only) write allocate (allocate cache block before writing)
Irregular (i.e., random) access patterns use LRU caching Prefetch threads
performs write behind for writers performs prefetch for readers
temporal locality
Miscellaneous observations
Pagepool creates implicit asychronous operation
It is an open question as to whether POSIX AIO provides additional benefit under GPFS.
AFS = min (v4, DFS) v4 = sizeof(pagepool) * prefetchPct / number of filesystems / blocksize(FS) TB = AFS / number of streams for this file system
Sequential means "strictly sequential", but these algorithms can be adapted to other regular (i.e., "predictable") access patterns.
determine the target number of buffers per active sequential stream per file system
If a stream (i.e., sequential user) becomes random or inactive after 5 seconds, then its buffers are disowned and given a LRU status where they "age out"
GPFS recognizes the following regular access patterns and sets appropriate prefetching strategies.
Regular Access Patterns
Sequential: This is a strictly sequential pattern. Fuzzy Sequential: The nfsPrefetchStrategy parameter defines a window of 3 to 12 (the default value is 2) contiguous blocks that can be accessed out of order, but cached using write-behind/prefetch semantics (except that write-behind buffers are returned to the LRU pool). While intended to handle out of order NFS accesses (due to thread scheduling of nfsd workers) this algorithm will work with any access pattern demostrating similar locality). Strided: This applies to records of the same size with a consistent offset (forward or backward, including backward sequential) from the previous record. Prefetch threads only access the sectors encapsulating the record. Mmap Strided: Applies where a small set of contiguous pages are accessed that are roughly the same length before each "gap". The prefetch algorithm tries to predict how many pages will be needed for the next stride, but only works within a single GPFS block at a time. Multi-block Random: Applies when 3 or more blocks are accessed in one request that is not sequential. The prefetch algorithm will be applied to the blocks after the first block up to the end of the request.
Remember that despite its name, the prefetch algorithm applies to both write-behind and read-prefetch.
example: if blocksize = 256K and recordsize = 1024K, then access = multi-block random
User Defined: Apply the prefetch algorithm to records allocated via the GPFS multiple access hint (discussed later).
Miscellaneous observations
If you change both the maxblocksize and pagepool parameters at the same time
specify pagepool first if you increase the values specify maxblocksize first if you decrease the values
pagepool_size = largest blocksize * NSDThreads This the largest blocksize from any GPFS file system on the cluster. NSDThreads = min(A1, max(A2, A3)) A1 = nsdMaxWorkerThreads Determine these parameters as follows: A2 = nsdMinWorkerThreads mmfsadm dump config | grep -i nsd A3 = K * nsdThreadsPerDisk K = number of LUNs per NSD server
Heuristic: Don't worry about it! Pick a value that is not too large (e.g., 64 to 128 MB)
COMMENT: This NSD server issue is most important for application environments where some subset of the application nodes have larger pagepools. Since the pagepool is easy to change, empirical methods can also be used to determine an optimum setting.
set large enough to accomodate the number of concurrently open files plus caching for recently used files the default is 1000, but a value as small as 200 is adaquate for traditional HPC applications larger values (e.g., 1000) may improve performance on systems with many small files larger values (e.g., 1000) are needed for a GPFS node used as a login node or an NFS server for large clusters
stat cache
part of the shared segment size = 176 B * maxStatCache best practice: maxStatCache <= 100,000
According to the GPFS documentation, this value can be set as large as 10,000,000 (n.b., 1.7 GB), but such a large value will exceed the shared segment size.
default = 4 * maxFilesToCache larger values are needed when a GPFS node is used as a login node or an NFS server (e.g., 50,000) mmfsd will only allocate as much space as it thinks is safe; if an excessive request is made, it will request at most 4 * maxFilesToCache. This is at best only a heuristic algorithm.
Avoid setting this value unecessarily large. Remember that it only is helpful where temporal locality of stat operations (e.g., ls -l) can be exploited.
If maxFilesToCache or maxStatCache are set too large, mmfsd will not start.
Traditionally file systems have allowed safe concurrent access to a single file from multiple tasks, but only with one task at a time. This was inefficient. GPFS provides a finer grained approach to this allowing multiple tasks to read and write to a file at the same time. GPFS does this using a feature called "byte range locking" which is facilitated by tokens.
write conflict
8 GB File
node 1 locks offsets 0-2 GB node 2 locks offsets 2-4 GB node 3 locks offsets 4-6.5 GB node 4 locks offsets 6-8 GB
byte range locks preserve data integrity byte range locks are transparent to the user byte range lock patterns can be much more intricate
Token Management
Byte range locking facilitated by tokens
A task can access a byte range within a file (e.g., read or write) iff it holds a token for that byte range.
Token management is distributed between 2 components tokenMemLimit controls the number Token Server of tokens per token manager. Default
There can be 1 or more nodes acting as token servers
distributedTokenServer = yes by default (see mmchconfig) designate multiple manager nodes (using mmcrcluster or mmchnode) EXAMPLE: Using default settings, token load is uniformly distributed over the manager nodes a cluster with 256 nodes will have more than 1,200,000 tokens. 1 token manager can process >= 500,000 "tokens" using default settings total tokens = number of nodes * (maxFilesToCache + maxStatCache) + all currently open files
is 512M. As a rule of thumb allow for ~= 600 bytes of token per file per node. In this context, each token is a set of tokens adding up to 600 bytes.
Token Client
There is one token client per node per file system running on behalf of all application tasks running on that node It requests/holds/releases tokens on behalf of the task accessing the file
Token Management
The Process
Offload as much work as possible from the token manager Token semantics tokens allow either read or write access within a byte range token manager responsibility
Reality is far more complex. Tokens are associated with lock objects. Tokens support 12 modes of access and there are 7 lock object types. As a "rule of thumb" allow for about 600 bytes of token (e.g., typically 3 tokens) per file per node.
"coordinates" access to files distributes tokens or a list of nodes holding conflicting tokens to requesting token clients
COMMENTS
Accessing overlapping byte ranges where a file is being modified will sequentialize file operations (n.b., this contributes to Amdahl inefficiency) GPFS write operations are atomic There are 9 classes of tokens in GPFS, but an open file on any node will generally have only 3 classes of tokens associated with it for ~= 600 bytes per file per node.
File operations are suspended until the new token server is ready. The new token server re-creates its token set by collecting the token state from each node in the cluster. If there are multiple manager nodes are running token servers, a simple algorithm using token IDs sort out which tokens belong to which server. What happens if a node or task fails that holds byte range locks? A log corresponding to the failed node is re-played
metadata is restored to consistent state locks are released
How long does token server recovery take? Many variables to consider.
complexity of token state network design and robustness example: 10's of minutes in extreme cases (e.g., cluster with 4000 nodes)
Access Patterns
An application's I/O access pattern describes its I/O transaction sizes and the order in which they are accessed. It is determined by both the application and the file system. Sequentially accessed large application records based on large file system blocks provide the best performance for GPFS (as well as any other file system), but applications can not always do I/O this way. Let's examine GPFS features, tuning and best practices that can determine and compensate (to varying degrees) for access pattern variations.
GPFS Blocks
What is a block?
The largest "chunk" of contiguous data in a GPFS file system The largest "transfer" unit in a GPFS file system If sizeof(record) >= sizeof(block), then GPFS will simultaneously access multiple blocks for that transaction
exampe: if sizeof(record) = 4 MB and sizeof(block) = 1 MB, then this transaction will result in 4 simultaneous GPFS IOPs
Nodes(4): P6p520, RAM = 8G, 2 x FC8, 2 x TbE, 2 x HCA (12xDDR) DCS9900: SATA, 64 tiers, cache size = GPFS blocksize, cache writeback = ON, cache prefetch = 0, NCQ = OFF GPFS: blocksize = DCS9900 cache size, block allocation = scatter, pagepool = 4 GB, maxMBpS = 4000 Application: 16 tasks, record size = GPFS blocksize, file size = 256 GB
GPFS Sub-blocks
GPFS blocks can be divided into 32 sub-blocks
A sub-block is the smallest "chunk" of contiguous data in a GPFS file system a file smaller than a sub-block will occupy the entire sub-block large files begin on block boundary files smaller than a block can be stored in fragments of 1 or more sub-blocks files larger than a sub-block have very little internal fragmentation
Each segment contains bits representing blocks on all disks Each segment is a separately lockable unit Minimizes conflicts between multiple writers Allocation manager provides hints which segments to try sizeof(segment) < blocksize
Allocation Regions
The block allocation map is divided into k regions where k > 32 * number of nodes
the value of k is based on the number of nodes estimate from mmcrfs -n parameter there are at least 32 allocation regions per node there is one or more allocation map segments per allocation region
Guarantees there are 1 or more allocation regions per node if file system < 97% capacity
if mmcrfs -n is set too small, nodes run out of allocation regions prematurely nodes start sharing allocation regions which hurts performance WARNING: it is not easy to change the mmcrfs -n setting... get it right the first time!
Block Allocation
Block Allocation Map Type
File data distribution
GPFS distributes file blocks to a file system's LUNs in a round-robin pattern file blocks are then distributed across each LUN according to the block allocation map type
Type: scatter
randomly distribute file blocks over the LUN (i.e., scattered over the disk) guarantees uniform performance of multitask jobs accessing a common file
compensates for Poisson arrivals
[Scatter] won't get you the best possible performance out of the disk subsystem, but it also avoids getting the worst of it. Yuri Volobuev
Type: cluster
write file data in clusters of contiguous disk blocks (i.e., clustered together on the disk) yields better performance in restricted circumstances
"small" clusters and/or "small" file systems "small" transactions (e.g., 4K)
COMMENT:
There is no guarantee that contguous file blocks on disk will be accessed in the same order that they are mapped to disk. Factors contributing to the randomness of "arrivals" include
a larger number of tasks and/or nodes simultaneously accessing a file a larger number of files simultaneously being accessed the stochastic nature of queueing systems
WARNING: This parameter can only be changed with a destructive rebuild.
Given the variabilities of clustered block allocation, validation testing is recommended before adopting it.
Example
write - number of nodes
blocksize block allocation
Hartner, et. al. Sequential I/O Performance of GPFS on HS20 Blades and DS4800 Storage Server. Technical Report., IBM. 22 x 4+P RAID 5 arrays, 14 x HS20 blades using GPFS as a SAN over FC4
Irregular access patterns with non-well formed IO will often require extra IOPs
Example (2n vs. 10n): If record_size = 1000000 and blocksize = 1048576, then each record will generally span 2 blocks requiring 2 IOPs to read 1 record that fits in 1 block. If the application reads a full block (i.e., 1048576), it will have significantly improved performance even if it does not use all of the data it reads.
COMMENT
seek_offset = 0 is well formed in GPFS
MB/s 100
90 80 70 60 50 40 30 20
1048576
1048576
1048576
1000000
1000000
1048576
10
record size
seq write
App: GPFS: Nodes: Disk:
seq read
random write
random read
4 tasks, 2 nodes, record size = variable, file size = 8 GB version 3.2, blocksize = 1 MB, pagepool = 256 MB P4p615, 2 cores, 4 GB RAM SAN attached SSA (16 disks @ 10 Krpm), JBOD
1000000
1000000
Direct I/O
Direct I/O
Open the file the O_DIRECT flag this flag is considered advisory, not mandatory
the FS can ignore it, but GPFS accepts it (n.b., "buyer beware!")
COMMENT: Use Direct I/O when the GPFS caching mechanism can not compensate for the access pattern. This is not trivial!
I/O Buffer must be memory page aligned for most systems, 4 KB alignment (i.e., buffer_address % 4096 = 0) Seek offset must be sector aligned for GPFS 512 B alignment (i.e., seek_offset % 512 = 0)
Direct I/O bypasses the FS cache mechanism; therefore, the programmer must compensate by manually doing the aligning.
EXAMPLE:
int bsz; /* size of record */ off_t soff = 0; /* seek offset */ char *buf; /* 4K aligned buffer */ void *b1; /* needed for pointer swizzling */ unsigned b2; /* needed for pointer swizzling */ . . . . . . b1 = malloc(bsz + 4096); b2 = (unsigned)b1; b2 = b2 & (unsigned)0xFFFFF000; b2 = b2 + (unsigned)0x00001000; buf = (char*)b2; if (bsz%512 != 0) printf("ERROR: buffer is not block aligned\n"); else soff += (off_t)bsz;
IOP Processing
small transactions (i.e., less than FS block size) small records irregularly distributed over the seek offset space small files poor spatial locality and often poor temporal locality performance is measured in operation rates (e.g., IOP/s) operation counts are high compared to BW common examples: bio-informatics, EDA, rendering, home directories
Transaction Processing
small transactions (i.e., files or records less than the blocksize), but often displaying good temporal locality access efficiency can often be improved by database technology performance is measured in operation rates (e.g., IOP/s) operation counts are high compared to BW common examples: commercial applications
If the pattern is recognized, then the relevant records can be asynchronously pre-loaded into cache. If the access pattern is not recognized by GPFS, then hints can be provided informing GPFS which records can be pre-loaded into cache.
COMMENT: These optimizations assume spatial locality
minimal caching
MB/s
READ RATE
Record Size in KB
App: GPFS: Nodes: Disk: 4 tasks, 2 nodes, record size = variable, file size = 2 GB, well formed I/O version 3.2, blocksize = 256 KB, pagepool = 256 MB P4p615, 2 cores, 4 GB RAM SAN attached SSA (16 disks @ 10 Krpm), JBOD
When GPFS detects a strided order, it prefetches along the stride thus improving performance.
8 tasks @ 1 per node, blocksize = 256 KB, record size = 16 KB, file size = 5 GB WH2 with 14 clients and 2 VSD servers, using 36 GB, 10 Krpm, SSA drives
strides under GPFS 1.2 strides under GPFS 1.3 Write less than 1 MB/s* 17 MB/s Read 11.1 MB/s* 58 MB/s
* The strided rate under GPFS 1.2 is the same as the random (without hints) rate under GPFS 1.3.
But notice that increasing the record size from 16KB to 1024 KB, the rates increase.
8 tasks @ 1 per node, blocksize = 256 KB, file size >= 5 GB WH2 with 14 clients and 2 VSD servers, using 36 GB, 10 Krpm, SSA drives
record size = 1024 KB record size = 16 KB Write 172 MB/s 17 MB/s Read 211 MB/s 58 MB/s
An irregular access pattern does not allow GPFS to anticipate the seek pattern. Therefore it can not prefetch records for reading or preallocate cache blocks for writing.
POSIX I/O is a simple standard covering the basics. Early versions of GPFS stuck closely to this standard. But because of its shortcomings in many environments, IBM has added API extensions to GPFS that go beyond the POSIX I/O API. These extensions are a mixed blessing. While they improve performance and facilitate important semantics not part of POSIX I/O, they are generally not portable.
The GPFS multiple access hint allows the programmer to post future accesses and then prefetches them asynchronously. Reads are improved substantially, writes not as much.
without hints using hints* write rate 33.1 MB/s 38.8 MB/s read rate 18.5 MB/s 63.7 MB/s
The impact of using hints is more significant given a larger number of nodes.
8 tasks, 2 nodes, record size = 128 KB file size = 2 GB, well formed I/O version 3.2, blocksize = 256 KB, pagepool = 256 MB P4p615, 2 cores, 4 GB RAM SAN attached SSA (16 disks @ 10 Krpm), JBOD
The multiple access hint interface is tedious to use, but a simple to use interface can be crafted.
A simple GPFS multiple access hint interface can be designed by the user (hiding the low level tedium) making it easier for high level applications to use hints.
For example...
public: int pio_init_hint(struct pio *p, int maxbsz, int maxhint); int pio_post_hint(struct pio *p, offset_t soff, int nbytes, int nth, int isWrite); int pio_declare_1st_hint(struct pio *p); int pio_xfer(struct pio* p, char* buf, int nth); private: int pio_gen_blk(struct pio *p, int nth, int isWrite); int pio_issue_hint(struct pio*p, int nth); int pio_cancel_hint(int fd);
4 tasks, 2 nodes, record size = variable, file size = 2 GB, well formed I/O version 3.2, blocksize = 256 KB, pagepool = 256 MB P4p615, 2 cores, 4 GB RAM SAN attached SSA (16 disks @ 10 Krpm), JBOD
Record Size in KB
Small Files
Small Files
Increasingly common in clusters today Small blocks work best when the average file size is small (e.g., less than 256K) But do not make GPFS blocks too small select blocksize so that the sub-block size ~= average file size
reduces internal fragmentation produces optimum small file performance block still large enough to support larger files (not every file will be small)
Small file optimization allocate small files "close together" by filling one full block on one disk before moving to the next
flushes small files to disk as individual small IOPs (i.e., in units of sub-blocks) use controller cache to block the small IOPs into larger transaction Will this be better on DCS9900? produces 7.2% improvement in DS4800 benchmarks
prefetchThreads
You need roughly twice as many prefetchThreads as LUNs. Suppose you have K LUNs, then prefetchThreads = 2 * K
pagepool
Set the pagepool large enough so that 20% can hold the buffers for the prefetchThreads prefetchThreads * blocksize < 0.2 * pagepool
Direct I/O
Bypasses the FS cache mechanism Since GPFS is optimized for large records, can this reduce overhead for small records? Use knowledge of application to exploit locality not detected by the file system
COMMENT: This slide needs further refinement doing tests with a more genuninely random pattern for small files.
Streaming Workload
ntasks tree depth files per directory total directories total files record size total data 16 1 1 1 1 1M 65536 MB write 60.7 59.0 943.6 65540 0% 0% 99.9% 1111 1111.2
GPFS Config - version 3.2 - blocksize: 1024 KB - pagepool: 256 MB DS4800 Config - 64 x 15Krpm disks - 4+P RAID 5 - segment size: 256 KB - cache page size: 16 KB - read cache: ON - read ahead: ON - write cache: OFF
write read 2 163.9 90.7 112.1 86.4 1793.3 1383.1 2,620,800 2,555,280 10.0% 7.7% 40.0% 41.0% 50% 51.3% 2 23384 29561 6.0 29.6
function job time (sec) average time (sec) total time (sec) # IOPs % directory ops % open/close These values % write may % read benefit unaturally IOP rate (IOP/s) from Data rate (MB/s) cache.3
COMMENTS: 1. For the most part these are non-cached IOP rates. Moreover, the IOP rates quoted here are based on application transactions. The non-cached IOP rates quoted for storage controllers are based on consistant 4K transactions measured by the controller and not the application. 2. The write IOP rate is based on the harmonic aggregate; however, this measure is slightly compromised by the large job time variance of 24.1%. By comparison, the natural aggregate is 15988. The "true" IOP rate (i.e., when all 16 tasks were active) would be closer to 20000. The variance for the read rate was only 4.0%. 3. The tree traversal algorithm used by this benchmark may lend itself to an unaturally cache friendly situation not typical for many small file access patterns.
In general, GPFS is designed to perform the same functions on each node and the functions performed on behalf of an application are executed on the node where it is generated. However, there are specialized management and overhead operations which are performed globally that affect the operation of the other nodes in the cluster.
Token Managers
Quorum Nodes
These functions are generally overlapped with other dedicated nodes (e.g., NSD servers, login nodes) though in very large clusters (e.g., over 2000 nodes) this must be done carefully so that their function is not impacted by network congestion. Does not require a server license
Metanodes
Problem
Cant afford exclusive inode lock to update file size and mtime Cant afford locking whole indirect blocks
Solution
Metanode (one per file) collects file size, mtime/atime, and indirect block updates from other nodes
How it works
Metanode is elected dynamically and can move dynamically Only the metanode reads & writes inode and indirect blocks Merges inode updates by keeping largest file size and latest mtime Synchronization Shared write lock allows concurrent updates to file size and mtime. Operations that require exact file size/mtime (e.g., stat) conflict with the shared write locks. Operations that may decrease file size or mtime
Comments
This is not a metadata server concentrating metadata operations on 1 or a small number of dedicated nodes. Rather, its a distributed algorithm processing metadata transactions across clients in the cluster. There is minimal overhead per metanode
Configuration Manager
There is a primary and backup configuration manager per GPFS cluster
Specified when the cluster is created using mmcrcluster Common practice assign to manager nodes and/or NSD servers.
Function
Maintains the GPFS configuration file /var/mmfs/gen/mmsdrfs on all nodes in the GPFS cluster. This configuration file can not be updated unless both the primary and backup configuration managers are functioning.
Minimal overhead
Cluster Manager
There is one cluster manager per GPFS cluster Selected by election from the set of quorum nodes
can be changed using mmchmgr
Functions
Monitors disk leases (i.e., "heartbeat") Detects failures and directs recovery within a GPFS cluster determines whether quorum exists
This guarantees that a consistent token management domain exists; if communication were lost between nodes without this rule, the cluster would become partitioned and the partition without a token manager would launch another token management domain (i.e., "split brain")
Manages communications with remote clusters distributes certain configuration changes to remote clusters handles GID/UID mapping requests from remote clusters Selects the file system manager node by default, it is chosen from the set of designated manager nodes choice can be overridden using mmchmgr or mmchconfig commands
Network Considerations
GbE is adaquate!
Heartbeat network traffic is light and packets are small default heartbeat rate = 1 disk lease / 30 sec per node cluster manager for a 4000 node cluster receives 133 disk leases per second But network congestion must not be allowed to interfere with the heartbeat by default, disk lease lasts 35 sec, but a node has last 5 sec to renew lease best practice: assign to a lightly used or dedicated node in clusters over 1000 nodes
Manager Nodes
Designating manager nodes
They are specified when a cluster is created (using mmcrcluster)
can be changed using mmchnode
Function
File system managers Token managers
Best practices
smaller clusters (less than 1000 nodes): commonly overlaped with NSD servers and/or quorum nodes Some customers overlap quorum and manager nodes. larger clusters (more than 1000 nodes): Be careful overlapping them with login nodes. assign to lightly used or dedicated nodes
do not overlap with NSD servers
3. quota management
enforces quotas if it has been enabled (see mmcrfs and mmchfs commands) allocates disk blocks to nodes writing to the file system generally more disk blocks are allocated than requested to reduce need for frequent requests
4. security services
see manual for details some differences appear to exist between AIX and Linux based systems
Low overhead
Token Managers
Token managers run on manager nodes
GPFS selects some number of manager nodes to run token managers
GPFS will only use manager nodes for token managers the number of manager nodes selected is based on the number of GPFS client nodes
Token state for each file system is uniformly distributed over the selected manager nodes
there is 1 token manager per mounted file system on each selected manager node
1 manager node can process >= 500,000 "tokens" using default settings
In this context, a token is a set of several tokens ~= 600 bytes on average. total number of tokens = number of nodes * (maxFilesToCache + maxStatCache) + all currently open files If the selected manager nodes can not hold all of the tokens, GPFS will revoke unused tokens, but if that does not work, the token manager will generate an ENOMEM error. This usually happens when not enough
manager nodes were desiginated.
Function
Maintain token state (see earlier slides)
Overhead
CPU usage is light Memory usage is light to moderate (e.g., at most 512 MB by default)
can be changed using mmchconfig tokenMemLimit=<value>
Message traffic is variable, but not excessive. It is characterized by many small packets
If network congestion impedes token traffic, performance will be compromised, but it will not cause instability. If NSD servers and GPFS clients are also used for token management, large block transfers (e.g., >= 512 KB) may impede token messages. If these issues are impeding token response,
chances are good that users will never notice.
Quorum
Problem
If a key resource fails (e.g., cluster manager or token manager) GPFS will spawn a new one to take over. But if the other one is not truly dead (e.g., network failure), this could create 2 independent resources and corrupt the file system.
Solution
Quorum must be maintained to recover failing nodes. 2 options Node quorum (default): must have at least 3 quorum nodes Node quorum with tiebreaker disks: used in 1 or 2 node clusters
LAN Switch
token mgr
X
If an ISL fails, GPFS must not allow frame 1 to spawn an independent cluster manager frame 2 to spawn an independent token manager
LAN Switch
cluster mgr
frame 1
frame 2
Node Quorum
How it Works
Node quorum is defined as one plus half of the explicitly defined quorum nodes in the GPFS cluster. There are no default quorum nodes. The smallest node quorum is 3 nodes. Selecting quorum nodes: best practices
Use caution in a "five 9's" environment Select nodes most apt to remain active Select nodes that rely on different failure points example: select nodes in different racks or on different power panels. In smaller clusters (e.g., < 1000 nodes) select administrative nodes common examples: NSD servers, login nodes In large clusters, either select dedicated nodes or overlap with manager nodes. do not overlap with NSD servers Select an odd number of nodes (e.g., 3, 5, or 7 nodes) More than 7 nodes is not necessary; it increases failiure recovery time without increasing availability.
LAN Fabric
QN QN QN QN QN
Compute Node Compute Node Compute Node Compute Node Compute Node
frame 1
Compute Node Compute Node Compute Node Compute Node Compute Node
frame 2
Compute Node Compute Node Compute Node Compute Node Compute Node
frame 3
Compute Node Compute Node Compute Node Compute Node Compute Node
frame 4
NSD Server
NSD Server
San Switch
X
Ouch!
LAN Switch
cluster mgr token mgr
No Quorum Rule
frame 1
frame 2
LAN Switch
token mgr
LAN Switch
cluster mgr token mgr
Maintaining Quorum
quorum node
frame 1
frame 2
Node quorum with tiebreaker disks runs with as little as one quorum node available so long as there is access to a majority of the quorum disks.
there can be a maximum of only 2 quorum nodes the number of non-quorum nodes is unlimited (can be as small as zero) there can be 1 to 3 tiebreaker disks (n.b., odd number of disks is best)
tiebreakers disks must be directly accessible from the quorum nodes, but do not have to belong to any particular file system must have a cluster-wide NSD name as defined through the mmcrnsd command tiebreaker disks must be SAN attached (FC or IP) or VSDs
same rules apply in selecting quorum nodes for both quorum options (see previous page) select quorum nodes with mmcrcluster or mmchconfig commands select tiebreaker disks with mmchconfig command
LAN
QN QN
Node #1
Node #2
EXAMPLE: GPFS remains active with the minimum of a single available quorum node and two available tiebreaker disks.
San Switch
d0 d1 d2 d3 TB d4 TB d5 TB
Nodes designated QN are quorum nodes Disks designated TB are tiebreaker disks
Several of the GPFS management functions we have just considered are designed for node failure recovery and maintaining data integrity. In this section, let's take a closer look at how this all fits together.
Disk Lease
Disk Lease (AKA, "heartbeat")
Disk leasing is the mechanism facilitating failure recovery in GPFS. Its a GPFS-specific fencing mechanism. A node can only access the file system if it has a disk lease. If a node fails or can not access the LAN, it can not renew its lease. Recovery begins after a lease expires. It gives time for I/O "in flight" to complete. It guarantees consistent file system metadata logs. It reduces the risk of data corruption during failure recovery processing. Failure recovery will have little or no effect on other nodes in the cluster.
leaseRecoveryWait (default = 35 sec) minMissedPingTimeout (default = 3 sec) maxMissedPingTimeout (default = 60 sec) Best practice: do not alter these defaults without guidence from IBM
There's a reason they are not documented!
The next few slides take a careful look at node failure recovery under several scenarios. There are more cases than this, but the illustrated scenarious cover the basic concepts.
deadman timer
The deadman timer duration = 2/3 * leaseRecoveryWait. It is necessary to set this parameter long enough to be certain that data in flight is gone. If not, data in flight may arrive after the log replays and corrupt the file system. mmfsck can generally fix this.
Problem If an IOP in flight arrives after the log has been replayed, it would arrive out of order and corrupt the file system. This can only happen on nodes which are writing and have direct access to disk. Solution If this node is not completely dead, it starts a "deadman timer" thread once its lease duration expires at time c. If there is an IOP in flight (e.g., hung IOP) when the deadman timer expires, it will panic the kernel to prevent it from completing.
leaseRecoveryWait
failureDetectionTime
initiate recovery process for each file system mounted on a failed node
ensure that failed node no longer has access to FS disks use logs to rebuild metadata that was being modified at the time of failure to a consistent state release locks held by failed node are released mmfsck recovers blocks that have been allocated but assigned to a file during recovery
logging (and rigid sequencing of operations) preserve atomicity of on-disk structures data blocks are written to disk before control structures referencing them
prevents contents of previous data block being accessed in new file
metadata blocks are written/logged so that there will never be a pointer to a block marked unallocated that is not recoverable from the log log recovery is run as part of recovery to node failure affecting locked objects
8. Miscelleaneous
Token Heap
Memory used for processing tokens which is accounted as belonging to GPFS (i.e., mmfsd) Size is negligible except for token manager nodes Similar for both AIX and Linux
Shared Segment
Chunk of common memory available to all tasks; used by GPFS for the inode/stat caches Since it is available to all tasks, the portion used by GPFS is not accounted as belonging to GPFS AIX: unpinned, allocated via shmat (32 bit) or kernel call (64 bit) Linux: pinned, allocated via mmap
Pagepool
AIX: pinned, allocated via shmat (32 bit) or kernel call (64 bit), but not accounted as belonging to GPFS Linux (32/64 bit): pinned, allocated via mmap and is accounted as belonging to GPFS
COMMENT: When using general tools to measure memory usage (e.g., top), the difference in memory allocation mechanisms for a particular GPFS/OS combination leads to different memory accounting. In particular, GPFS memory usage appears larger under Linux than AIX since the pagepool is attributed to GPFS under Linux, but not under AIX.
Components:
Kernel extension / kernel modules AIX: single kernel extension Linux: three kernel modules
tracing, portability/GPL layer, I/O
Daemon (i.e., mmfsd) Commands and scripts ts-command: just a stub that sends params to daemon mm-scripts:
Call ts-commands Some are just wrappers around ts-commands with additional error checking Some do more: manage cluster configuration (update mmsdrfs), gpfs startup/shutdown, etc.
GPFS Subnets
Public
Ethernet Switch
(1 GbE network) subnet 3: 30:30:30.x
Private LAN*
subnet 1: 10.0.10.x
COMMENTS: Build the GPFS cluster using the existing public Ethernet network (i.e., subnet 3)
Private LAN*
subnet 2: 10:0:20.x
a01 a02 a03 a04 a05 a06 a07 a08 a09 a10 a11 a12
Use GPFS subnets to prioritize which subnet a node will use for GPFS transactions
mmchconfig subnets="10:0:10:0" -N nodelst.a* mmchconfig subnets="10:0:20:0" -N nodelst.b*
This will cause the nodes to use their high speed network first if they can find the file they need on it. By default, subnet 30:30:30:0 is the lowest priority subnet and is used as needed; e.g.,
node a05 accesses /f1/xxx via 10.0.10:0 node a05 accesses /f2/zzz via 30:30:30:0
Nodes accessing files over Ethernet may be BW constrained compared to private LANs which are assumed to be a high speed network. See mmchconfig command
San Switch
DS5000-01
San Switch
DS5000-02
* The private LAN is generally a high speed switch; e.g., IB, Myrinet, Federation Assume nodelst.a contains the nodes a01-a12 and nodelst.b contains nodes b01-b12.
Problem: nodes outside the cluster need access to GPFS files Solution: allow nodes outside the cluster to natively (i.e., no NFS) mount the file system
Home cluster responsible for admin, managing locking, recovery, etc. Separately administered remote nodes have limited status
Can request locks and other metadata operations Can do I/O to file system disks over global SAN Are trusted to enforce access control, map user Ids,
Cluster 1 Nodes
Cluster 2 Nodes
Site 1 SAN
Site 2 SAN
Uses:
High-speed data ingestion, postprocessing (e.g. visualization) Sharing data among clusters Separate data and compute sites (Grid) Forming multiple clusters into a supercluster for grand challenge problems
Site 3 SAN
Visualization System
GPFS Multi-Cluster
Example
IP Switch Fabric
Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node NSD A1 NSD A2 NSD A3 NSD A4
SAN
UID/GIDA mmname2uid
Cluster_B mounts /fsA locally as /fsAonB OpenSSL (secure socket layer) provides secure access between clusters for "daemon to daemon" communication using the TCP/IP based GPFS protocol. However, nodes in the remote cluster do not require ssh contact to nodes in the home cluster or vice-verse.
UID MAPPING EXAMPLE (i.e., Credential Mapping) 1. pass Cluster_B UID/GID(s) from I/O thread node to mmuid2name 2. map UID to GUN(s) (Globally Unique Name) 3. send GUN(s) to mmname2uid on node in Cluster_A 4. generate corresponding CLUSTER_A UID/GID(s) 5. send Cluster_A UID/GIDs back to Cluster_B node runing I/O thread (for duration of I/O request)
COMMENTS: mmuid2name and mmname2uid are user written scripts made available to all users in /var/mmfs/etc; these scripts are called ID remapping helper functions (IRHF) and implement access policies simple strategies (e.g, text based file with UID <-> GUN mappings) or 3rd party packages (e.g., Globus Security Infrastruction from Teragrid) can be used to implement the remapping procedures
IP Switch Fabric
GUN mmuid2name UID/GIDB
Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node NSD B1 NSD B2
SAN
GPFS Multi-Cluster
Example
Mount a GPFS file system from Cluster_A onto Cluster_B
On Cluster_A
1. Generate public/private key pair
mmauth genkey new COMMENTS key pair is placed in /var/mmfs/ssl
On Cluster_B
4. Generate public/private key pair
mmauth genkey COMMENTS key pair is placed in /var/mmfs/ssl public key default file name id_rsa.pub
2. Enable authorization
5. Enable authorization
mmauth update . -l AUTHONLY
9. Define cluster name, contact nodes and public key for cluster_A
mmremotecluster add cluster_A -n nsd_A1,nsd_A2 -k Cluster_A.pub
Contact Nodes
The contact nodes are used only when a remote cluster first tries to access the home cluster; one of them sends configuration information to the remote cluster after which there is no further communication. It is recommended that the primary and backup cluster manager be used as the contact nodes.
Cluster C2 x336-01 x336-02 x336-03 x336-04 x336-05 x336-06 x336-07 x336-08 x336-09 x336-10 x336-11 x336-12 x346-13
nsd server
A1 A2 A3
Ethernet Switch
COMMENTS: common Ethernet 2 high speed subnets multiple NSD servers
Cluster C3 x3550-15 x3550-16 x3550-17 x3550-18 x3550-19 x3550-20 x3550-21 x3550-22 x3550-23 x3550-24 x3550-25
M y r i n e t S w i t c h
I n f i n i b a n d S w i t c h
Cluster C1
A4 A5 A6 A7 A8 A9 A10 A11 A12
x3550-26 x3650-27
nsd server
x346-14
nsd server
x3650-28
nsd server
Legacy System
New System
WARNING: The disk controller can only be mounted by nodes with the same OS (i.e., the NSD servers either must all be AIX or they must all be Linux).
Mixed OS Clusters
A single GPFS cluster or GPFS multi-cluster can have nodes running under AIX, Linux and/or Windows at the same time! Restriction
All LUNs for a particular file system must run under the same OS Corollary: the NSD nodes for a given file system must run under same OS Special request (i.e., RPQ) is required to use Windows for an NSD server
9. GPFS Environment
GPFS does not exist in isolation; it must be integrated with other components when designing an overall solution. The following pages look at selected hardware and software components (disk controllers, disks and storage servers) commonly used with GPFS file systems.
Linux tested
3500+ nodes in a multi-cluster (including one cluster with 2560 nodes) 4000+ nodes in a single cluster
AIX
DCS9900, DCS9550 DS3000, DS4000, DS5000 series systems ESS and DS8000 series (i.e., shark) SAN Volume Controler (V1.1 and V1.2, V2.1) 7133 Seriel Disk System (i.e., SSA) EMC Symmetrix DMX (FC attach only) Hitachi Lightning 9900 (HDLM required)
COMMENT: GPFS does not rely on the SCSI persistant reserve for failover. This reduces the risk associated with using non-tested storage controllers with GPFS. However, starting with GPFS 3.2, SCSI persistant reserve is available as an option on NSD servers under AIX (see mmchconfig).
Linux DCS9900, DCS9550 DS3000, DS4000, DS5000 series systems EMC Symmetrix DMX 1000 with PowerPath v3.06 or v3.07
See http://publib.boulder.ibm.com/clresctr/library/gpfsclustersfaq.html for the complete list of tested disk This is NOT the only disk that will work with GPFS. In general, any reasonable block device will work with GPFS. According to the FAQ page, the "GPFS support team will help customers who are using devices outside of this list of tested devices, to solve problems directly related to GPFS, but not problems deemed to be issues with the underlying device's behavior including any performance issues exhibited on untested hardware." Before adopting such devices for use with GPFS, it is urged that the customer first run proof of concept tests.
JBOD: Just a Bunch Of Disk RAID 0: striping without redundancy e.g., 4+0P RAID 1: mirroring RAID 10: striping across mirrored groups m1m1 - m2m2 - m3m3 - m4m4
RAID 5, 4+P
* Since RAID 6 group has 2 redundant disks, it is common to make the RAID group larger (e.g., 8+2p)
RAID Sets
Different OEM vendors use different names for the grouping of disks that I am calling a "RAID Set". I am using this term as a generic alternative. DDN (DCS9000): "Tier" IBM (DS8000): "Array"
"array sites" and "ranks" are closely related terms
LUNs
A LUN (logical unit) is an entry in /dev for Unix based OSs examples
Linux: /dev/sdb AIX: /dev/hdisk2
A rose by any other name has just has many thorns. ;->
All examples in this presentation are configured with 1 LUN per RAID set.
Disk Technology
FC, SAS, SCSI
Enterprise Class
different protocols, same mechanical standards
SATA/2
Cost Optimized Rotational speed: 7200 rpm Common drive sizes
750 GB, 1 TB, 2 TB
90% duty cycle MTBF = 2.0 MHour4 Single drive IOP performance, no caching1
420 IOP/s
Duty cycle is generally ignored now. MTBF < 1.6 MHour with 50% duty clyce4 MTBF < 0.9 MHour with 90% duty cycle4 Single drive IOP performance, no caching1
with command tag queueing: 120 IOP/s without command tag queueing: 60 to 70 IOP/s
cache enabled
write = 154.6MB/s read = 123.6 MB/s
cache enabled
write = 30.3 MB/s read = 74.9 MB/s
Footnotes: 1. IOP rates assume 4K records 2. DS4800
dd buffer size = 1024K cache block size = 16K segment size = 256K
3. DS4700
dd buffer size = 1024K cache block size = 16K segment size = 64K
Storage Controllers
HPC systems requiring high bandwidth and/or large capacities use disk controllers to manage external disk. Common IBM choices for HPC include DS3000
low cost of entry
DS4000
DS4800 replaced by the DS5300 DS4700 is higher end product cf. DS3000, yet with a low cost of entry
DS5000
DS5300 balanced streaming and IOP performance DS5100 lower performance with a lower cost of entry
DS8000
DS8300 provides very high reliability with good IOP performance
DCS9000
designed specifically for HPC optimizing streaming BW and capacity
DS3000 Series
DS3200
DS3400
3-Gbps SAS connect to host Direct-attach For System x 2U, 12 disks Dual Power Supplies Support for SAS disks SAS or SATA disks Expansion via EXP3000 Starting under $4,500 US 4-Gbps Fibre connect to host Direct-attach or SAN For System x & BladeCenters 2U, 12 disks Dual Power Supplies Support for SAS disks SAS or SATA disks Expansion via EXP3000 Starting under $6,500 US
COMMENT: DS3300 provides an iSCSI interface for the same basic hardware. Its not commonly used with GPFS.
DS3400
Example Configuration
GbE connections to client nodes
Ethernet Switch
NSD Server-01 x3650 M2 8 cores, 6 DIMMs
GbE
GbE
2 x FC4
GbE
GbE
TbE
streaming: write < 700 MB/s, read < 900 MB/s IOP rate: write < 4500 IOP/s, read < 21,000 IOP/s
up to 48 disks Example: 15Krpm SAS disks @ 450 GB/disk 4 x 4+P RAID 5 + 2 hot spares (optimize streaming performance)
raw ~= 10 TB, usuable ~= 7 TB
DS3400-01
12 disks (SAS or SATA)
ESM-A
EXP3000-01
12 disks (SAS or SATA)
ESM-A
ESM-B
EXP3000-01
12 disks (SAS or SATA)
ESM-A
ESM-B
EXP3000-01
12 disks (SAS or SATA)
WARNING: The DS3400, while relatively fast and inexpensive, is not well suited for large configurations, especially when using SATA drives. RAID array rebuilds are common in large configurations (e.g., 10 x DS3400s with 480 drives) especially for SATA. But a DS3400's performance is significantly compromised during a rebuild. Therefore at any given time in file systems aggregated across many DS3400s, a RAID array rebuild will be in progress and the expected value of file system performance will be significantly less than the maximum possible sustained rates.
DS3400
Benchmark Results
GPFS Parameters
blocksize = 256K or 1024K pagepool = 1G maxMBpS = 2000
Bandwidth Scaling
BW per 4+P RAID 5 array using 15 Krpm disks read cache: ON with default read ahead write cache: ON RAID sets write (MB/s) read (MB/s) 1 239 294 2 471 615 4 655 860 8 647 875
DS3400 Parameters
RAID 5 array = 4+P segment size = 64K or 256K cache page size = 16K read ahead = default write cache = enabled write cache mirroring = disabled
Streaming Job*
record size = 1024K file size = 4G number of tasks = 8 access pattern = seq
8 RAID sets
IOP Job*
record size = 2K total data accessed = 1G number of tasks = 16 access pattern = small file
stream* IOP*
* Configuration was optimized differently for each test. Benchmark tool: ibm.v4a Theoretical max IOP rates for the DS3400 cached < 96,000 IOP/s (512 B transactions) uncached: write < 4,200 IOP/s, write < 19,000 IOP/s (4 KB transactions)
DS5000 Series
Controller Support Modules (Fans, power supplies)
DS5300
model 1818-53A
4u
EXP5000
model 1818-D1A
3u
Drives FC/SATA
DS5300
Controller/Enclosure Overview
Dual, redundant RAID controllers Dual, redundant power, battery backup and fans Internal busses (theoretical)
PCI-E x8 simplex rate = 2 GB/s
Controller A
Controller B
16 host-side connections FC4 < 380 MB/s FC8 < 760 MB/s Active/passive architecture
Supported enclosures
EXP5000 (FC switched, 4 Gb/s) EXP810 (FC switched, 2 Gb/s) Maximums 28 enclosures, 16 disks per enclosure 448 disks
Disk Technology
15 Krpm FC disk (300, 450 GB) SATA (750, 1000 GB) Peak sustained rates (theoretical) Streaming (to media) requires 192 x 15Krpm drives write < 6 GB/s read < 6 GB/s IOP Rate (to media) requires 448 x 15Krpm drives write < 45,000 IOP/s read < 172,000 IOP/s
A loop pair is a set of redundant drive side cables as shown in this diagram. A stack is a set of enclosures along a loop pair. A DS5300 supports at most 4 x EXP5000 enclosures per stack, but never more than 28 x EXP5000 total. Be sure to balance the number of enclosures across the stacks.
DS5300
Rearview
Controller A
Controller B
Power connection
Serial connection
EXP5000
16 drives in 3U enclosure 4 Gbps FC interfaces / ESMs
High-speed, low-latency interconnect from controllers to drives
FC - in FC 1B
Switched architecture
Higher performance, lower latency Drive isolation, better diagnostics
*
logical layout
FC 1B FC -out
FOOTNOTES ESM A is the primary path for the odd drives ESM B s the primary path for the even drives
If an ESM fails, the other ESM can access all of the drives.
1B
1A
ESM A
ESM B
1A 1B
DS5300
Cabling and Disk to Array Mapping
Careful attention must be given to cabling and disk to array mapping on the DS5300 in order guarantee optimum streaming performance. This issue is less significant for IOP performance.
WARNINGS:
Default array mappings (e.g., created by SMclient) are not guaranteed to be optimum! Rules and best practices for the DS4800 do not always apply to the DS5300.
DS5300
Drive Side Cabling - 8 Enclosures
Balance*: Best streaming performance is achieved using a multiple of 8 x EXP5000 drawers with the same number of drawers per stack. Optimum performance is achieved using 8, 16 or 24 stacks.
If ignored, performance penalty ~= 25%; it does not affect IOP rates.
ESM A
ESM A
2
ESM B
ID: 11
ESM B
ID: 25
ESM A
controller A
ESM A
4
ESM B
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 5 6 7 8
GbE GbE
ID: 31
ESM B
ID: 45
DS5300
ESM A
GbE GbE
1 2 3 4
ESM B
ESM A
6
ESM B
1 2 3 4 5 6 7 8
controller B
ID: 65
ID: 51
ESM A
ESM A
8
ESM B
ID: 71
ESM B
ID: 85
Stacks
When attaching enclosures, drive loops are configured as redundant pairs (i.e., loop pairs) utilizing one port from each controller; the enclosures along a loop pair are called a stack.
1 2 3 4 5 6 7 8
Tray ID
Tray ID is assigned during system configuration. The values are not arbitrary. Best practice: 10's digit: stack number 1's digit: ordinal number within a stack
DS5300
Drive Side Cabling - 16 Enclosures
1
Stack # Tray ID #
ID: 11
ESM B
ESM A
Balance*: Best streaming performance is achieved using a multiple of 8 x EXP5000 drawers with the same number of drawers per stack. Optimum performance is achieved using 8, 16 or 24 stacks.
If ignored, performance penalty ~= 25%; it does not affect IOP rates.
ESM A
2
ESM B
ID: 25
ESM A
ESM A
ID: 12
ESM B
ID: 26
ESM B
ESM A
ESM A
4
ESM B
ID: 31
ESM B
ID: 45
ESM A
ESM A
controller A
ID: 32
ESM B
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 5 6 7 8
GbE GbE
ID: 46
ESM B
DS5300
ESM A
GbE GbE
1 2 3 4
ESM B
ESM A
6
ESM B
1 2 3 4 5 6 7 8
controller B
ID: 65
ESM A
ID: 51
ESM A
ID: 52
ESM B
ID: 66
ESM B
ESM A
ID: 71
ESM B
ESM A
Another Tray ID Best Practice: Start the 1's digit in the odd numbered stacks at 1 and in the even numbered stacks at 5. We do this because there can be up to 4 drawers in a stack.
ESM A
8
ESM B
ID: 85
ESM A
ID: 72
ESM B
ID: 86
ESM B
DS5300
Drive Side Cabling and Disk to Array Mapping
controller A controller B
XOR ASIC
loop switches*
4xFC4
FC4 | FC8
FC4 | FC8
FC4 | FC8
FC4 | FC8
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports In this example there is 1 drawer per stack. You can have at most 4 drawers per stack, but not more than 28 drawers total. Stack numbers highlighted by the yellow box.
4 1 2 3
host ports
Array Assignments: An array is a set of disks belonging to a RAID group. Arrays are assigned to (i.e., owned by) a single controller. Optimum performance requires careful attention being given to assigning disks to arrays and assigning arrays to the controllers.
E S M A E S M B
5 6 7 8
Optimum Both vertical paths or both diagonal paths can be active at same time.
x x
Remember: by default ESM-A accesses odd disks ESM-B accesses even disks This is independent of controller preference.
Sub-Optimum A vertical path and a diagonal path can not both be active at the same time.
Loop Switches
DS5300
Data Flow Example #1A
controller A prefers odd slots in stacks 1, 3, 5, 7 even slots in stacks 2, 4, 6, 8 controller B prefers even slots in stacks 1, 3, 5, 7 odd slots in stacks 2, 4, 6, 8 4xFC4
XOR ASIC
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports
1 ABABABABABABABAB
host ports
Stack # Tray ID
1 2 3 4 5 6
Slot #
7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8
11 1 25 2 31 3 45 4 51 p 65 71 85
In analyzing this access pattern, remember the even/odd preference of the drawers. Irregardless of controller preference, odd disks are accessed by ESM A and even disks are accessed by right ESM B.
E S M A E S M B
Mapping Disks to Array Rule: Assign disks to arrays diagonally with 1 per tray as shown. Array Ownership Rule: Assign array to controller accessing the first disk in the array.
Array X: 4+P RAID 5, owned by controller A Array Y: 4+P RAID 5, owned by controller B Tray protected ("barber pole"), optimum performance
DS5300
Data Flow Example #1B
controller A prefers odd slots in stacks 1, 3, 5, 7 even slots in stacks 2, 4, 6, 8 controller B prefers even slots in stacks 1, 3, 5, 7 odd slots in stacks 2, 4, 6, 8 XOR ASIC
3 4
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports
1 ABABABABABABABAB
host ports
2 BABABABABABABABA 3 ABABABABABABABAB
Stack # Tray ID
1 2 3 4 5 6
Slot #
7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8
11 1 25 2 31 3 45 4 51 p 65 71 85
1 2 3 4 p
7 ABABABABABABABAB 8 BABABABABABABABA
In analyzing this access pattern, remember the even/odd preference of the drawers. Irregardless of controller preference, odd disks are accessed by ESM A and even disks are accessed by right ESM B.
E S M A E S M B
Mapping Disks to Array Rule: Assign disks to arrays diagonally with 1 per tray as shown. Array Ownership Rule: Assign array to controller accessing the first disk in the array.
Array X: 4+P RAID 5, owned by controller A Array Y: 4+P RAID 5, owned by controller B Tray protected, optimum performance
DS5300
Sample Complete Disk to Array Mappings for Example #1
Best Practice: Adopt tray protection using the following configurations. 8 trays using 4+P RAID 5 or 4+2P RAID 6 16 trays using 4+P or 8+P RAID 5, or 8+2P RAID 6
DS5300
Data Flow Example #2A
controller A prefers all slots in stacks 1, 3, 5, 7 controller B prefers all slots in stacks 2, 4, 6, 8
4xFC4
XOR ASIC
loop switches
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports
1 AAAAAAAAAAAAAAAA
host ports While this configuration may not lead to optimum streaming performance, it is generally good enough for many application environments and is easy to configure. The adoption of an FC4 switched drive side network has significantly reduced the negative impact of this configuration's performance compared to the DS4500 and DS4800.
E S M A E S M B
Stack # Tray ID
1 2 3 4 5 6
Slot #
7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8
11 1 2 3 4 p 1 2 3 4 p 25 1 2 3 4 p 1 2 3 4 p 31 45 51 65 71 85
Mapping Disks to Array Rule: Horizontally and continguously assign disks to same array. Array Ownership Rule: Arrays in stacks 1, 3, 5, 7 are assigned to same controller (e.g., controller A) Arrays in stacks 2, 4, 6, 8 are assigned to same controller (e.g., controller B)
Array X, Y: 4+P RAID 5, owned by controller A Array A, B: 4+P RAID 5, owned by controller B Horizontal volume: performance is "good enough"
In analyzing this access pattern, remember the even/odd preference of the drawers. Irregardless of controller preference, odd disks are accessed by ESM A and even disks are accessed by ESM B.
DS5300
Data Flow Example #2B
controller A prefers all slots in stacks 1, 3, 5, 7 controller B prefers all slots in stacks 2, 4, 6, 8
XOR ASIC
loop switches
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports
1 AAAAAAAAAAAAAAAA
host ports While this configuration may not lead to optimum streaming performance, it is generally good enough for many application environments and is easy to configure. The adoption of an FC4 switched drive side network has significantly reduced the negative impact of this configuration's performance compared to the DS4500 and DS4800.
E S M A E S M B
2 BBBBBBBBBBBBBBBB 3 AAAAAAAAAAAAAAAA
Stack # Tray ID
1 2 3 4 5 6
Slot #
7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8
11 1 2 3 4 p 1 2 3 4 p 1 2 3 4 p 25 1 2 3 4 p 1 2 3 4 p 1 2 3 4 p 31 45 51 65 71 85
Mapping Disks to Array Rule: Horizontally and continguously assign disks to same array. Array Ownership Rule: Arrays in stacks 1, 3, 5, 7 are assigned to same controller (e.g., controller A) Arrays in stacks 2, 4, 6, 8 are assigned to same controller (e.g., controller B)
Array X, Y: 4+P RAID 5, owned by controller A Array A, B: 4+P RAID 5, owned by controller B Horizontal volume: performance is "good enough"
In analyzing this access pattern, remember the even/odd preference of the drawers. Irregardless of controller preference, odd disks are accessed by ESM A and even disks are accessed by ESM B.
DS5300
Data Flow Example #2C
controller A prefers all slots in stacks 2, 4, 6, 8 controller B prefers all slots in stacks 1, 3, 5, 7
XOR ASIC
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports
1 BBBBBBBBBBBBBBBB
host ports
2 AAAAAAAAAAAAAAAA 3 BBBBBBBBBBBBBBBB
E S M A
E S M B
Stack # Tray ID
1 2 3 4 5 6
Slot #
7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8
11 25 31 45 51 65 71 85
1 5 p 3 7 1 5 p 3 7 2 6 q 4 8 2 6 q 4 8 3 7 1 5 p 3 7 1 5 p 4 8 2 6 q 4 8 2 6 q
In analyzing this access pattern, remember the even/odd preference of the drawers. Irregardless of controller preference, odd disks are accessed by ESM A and even disks are accessed by ESM B.
Mapping Disks to Array Rule: Distribute the disks uniformly, horizontally and contiguously across stacks 1, 3, 5, 7 xor 2, 4, 6, 8. Array Ownership Rule: Arrays in stacks 1, 3, 5, 7 are assigned to same controller (e.g., controller B) I swapped the controllers Arrays in stacks 2, 4, 6, 8 are assigned to same controller (e.g., controller A) around to make a point.
Array W, X: 8+P+Q RAID 6, owned by controller B Array Y, Z: 8+P+Q RAID 6, owned by controller A Stack oriented: performance is "good enough"
DS5300
Sample Complete Disk to Array Mappings for Example #2C
COMMENT: Hot spares vs. RAID 6 There is little need for hot spares with RAID 6, but in a 8 tray configuration, there is not room for another 8+2P RAID 6 array. Therefore, configure the other 8 disks as a 4+4 RAID 10 array. In the 16 tray configuration, there is room for 1 more 8+2P RAID 6 array, but this will create an imbalance that will hurt GPFS performance. Therefore configure the other 16 disks as 2 x 4+4 RAID 10 arrays.
DS5300
Data Flow Example #3
controller A prefers odd slots in all stacks controller B prefers even slots in all stacks
4xFC4
XOR ASIC
1 2
3 4
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports
1 ABABABABABABABAB
host ports
Stack # Tray ID
1 2 3 4 5 6
Slot #
7 8 9 10 11 12 13 14 15 16
Optimum Both vertical paths or both diagonal paths can be active at same time.
1 2 3 4 5 6 7 8
11 25 31 45 51 65 71 85
1 2 3 4 p
x x
Array X: 4+P RAID 5, owned by controller A Vertical volume (contention on loop switches)
Disk to Array Mapping Mistake By vertically assigning all disks to array X, contention is created.
Sub-Optimum A vertical path and a diagonal path can not both be active at the same time.
Loop Switches
DS5300
Data Flow Example #4
controller A controller B
4xFC4
XOR ASIC
3 4
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports
1
host ports
2 3
Stack # Tray ID
1 2 3 4 5 6
Slot #
7 8 9 10 11 12 13 14 15 16
4 5 6
1 2 3 4 5 6 7 8
11 1 25 2 31 3 45 4 51 p 65 71 85
1 2 3 4 p
7 8
Array X: 4+P RAID 5, owned by controller A Array Y: 4+P RAID 5, owned by controller A
DS5300
Example Configuration
Ethernet Switch:
TbE: GPFS, GbE: Administration NSD Server-01 x3650 M2 8 cores, 6 DIMMs NSD Server-02 x3650 M2 8 cores, 6 DIMMs
TbE FC8
Performance Analysis
DS5300 streaming data rate
256 x SATA or 128 x 15Krpm disks: write < 4.5 GB/s, read < 5.5 GB/s
GbE
GbE
GbE
GbE
TbE
FC8
GbE
GbE
NSD Server-03 x3650 M2 8 cores, 6 DIMMs NSD Server-04 x3650 M2 8 cores, 6 DIMMs
TbE
FC8
8 x FC8
controller A
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 5 6 7 8
GbE GbE
DS5300
TbE FC8
GbE GbE
1 2 3 4
SAN switch not required
GbE
GbE
1 2 3 4 5 6 7 8
controller B
GbE
GbE
NSD Server-05 x3650 M2 8 cores, 6 DIMMs NSD Server-06 x3650 M2 8 cores, 6 DIMMs
TbE
FC8
Disk Drawers
option #1: 128 x 15Krpm FC disk option #2: 256 x SATA disks
TbE
FC8
GbE
GbE
GbE
GbE
NSD Server-07 x3650 M2 8 cores, 6 DIMMs NSD Server-08 x3650 M2 8 cores, 6 DIMMs
TbE
FC8
TbE
FC8
COMMENT: This is a "safe" configuration in the sense that meeting projected performance rates can reasonably be expected (n.b., there are more than enough servers, FC8 and TbE ports to do the job). If HBA failover is required, then 8 dual port HBAs may be adopted (thereby requiring a SAN switch). If 2xFC8 adapters are adopted, then peak performance can be maintained during failure conditions.
GbE
GbE
DS5300
Benchmark Results
To Be Completed
DS5020
The DS5020 is an upgrade from the DS4700 Its performance profile is roughly equivalent to the earlier generation DS4800
peak streaming rates < 1500 MB/s
It supports a maximum of 112 disks It uses the EXP5000 disk drawers with FC and/or SATA disk Compared with the DS5300
RAID 6 overhead comprises a greater percentage of processing time (e.g., ~= 25%) cf. the DS5300 (e.g., 10%) Write cache mirroring is not as effective
DS5020 112 16
DS5100 256 0 FC @ 4/8 Gb/s 8 ports 4 Gb/s 16 ports 8 GB 700,000 75,000 20,000 3200 3200 2,500 0, 1, 3, 5, 6, 10
146, 300, 450 @ 4Gb/s
DS5300 448 0 FC @ 4/8 Gb/s 16 ports 4 Gb/s 16 ports 16 GB 700,000 172,000 45,000 6,400 6,400 5,300 0, 1, 3, 5, 6, 10
146, 300, 450 @ 4Gb/s 750, 1000 GB @ 4Gb/s
EXP5000, EXP810
2 GB
1. Not intended for use in large capacity storage systems. Best practices suggest not using more than 4 units under the control of a single file system. 2. Data rates are reported as peak theoretical values and are not feasible in a production environment; they are intended for comparison purposes only. 3. This refers specifically to the 1814-72A.
DCS9900
4u 45u
Couplet
"dual RAID controller"
Disk Enclosure
2u
DCS9900
Couplet Dual RAID controller design Active/active design 5 GB of cache RAID level: 8+2P RAID 6 only 8 host side ports
FC8 or IB 4x DDR2
Disk Trays Up to 60 disks per tray Up to 20 trays (1200 disks) per couplet Supports SAS and SATA
SAS: 450 GB SATA: 1 TB, 2 TB* Peak Performance (theoretical+) write < 4.5 GB/s (4M/transaction) read < 5.9 GB/s (4M/transaction)
IOP Rates
40,000 IOP/s (4K/transaction)
DCS9900
Physical View
Illustrations shown using 60-bay enclosures (model 3S1). IBM also supports a 16-bay enclosure though it is seldom used.
COMMENT: To maximize performance per capacity, peak performance can be achieved using as few as 160 x 15 Krpm SAS drives or with 300 SATA drives. To minimize cost per capacity, the number of drives can be increased up to 1200.
DCS9900
Controller Overview
DCS9900 RAID configuration
8+2P RAID 6 Data accessed using a "byte striping algorithm"
DCS9900
controller C1
COMMENTS
Recommend creating only 1 LUN per tier for GPFS Parity is computed for each write I/O operation Parity is checked for each read I/O operation
Tier 1
A A A
B B B
C C C
D D D
E E E
F F F
G G G
H H H
P P P
P P P
Tier 2
Tier 3
8+2P RAID 6
DCS9900
Configuration and Parameter Explanation and Guidelines
DCS9900 Cache Organization
There is 2.5 GB of cache per RAID controller for a total of 5 GB The cache page size is a configurable parameter set this using the command "cache size=<int>"
valid choices are 64, 128, 256, 512, 1024, 2048, 4096 (units are in KB)
Linux set the max_sectors_kb parameter to the DCS9900 cache size located in /sys/block/<SCSI device name>/queue/max_sectors_kb typical SCSI device names are sdb, sdc, sdd, sde, ... These changes are not persistent, therefore this must be reset after every reboot.
DCS9900
Configuration and Parameter Explanation and Guidelines
Write Caching: Write Back vs. Write Thru
Enabling write back caching instructs the DCS9900 to write data blocks to cache and return control to the OS; data in the cache is actually written to disk later. If write thru caching is enabled, all data is written both to cache and disk before control is returned to the OS. Enable write back caching using the command "cache writeback=on" warning: if a controller fails, all data in its cache will be lost, possibly before it is written to disk
This can corrupt the file system since metadata can be lost. Adopt proper risk management procedures if write back caching is enabled.
Enable write thru by setting "cache writeback=off" Best Practice: disable write back caching
this will significantly degrade performance
Cache coherence
The DCS9900 has the concept of "LUN ownership" or "LUN affinity". The controller in the couplet that created the LUN owns that LUN. Both controllers in a DCS9900 couplet can both see a given LUN (even though only one of them created it) iff cache coherence is enabled Cache coherence generally has a minimal performance degradation Best Practice: Enable cache coherence using the command "dual coherency=on"
DCS9900
Configuration and Parameter Explanation and Guidelines
set MF bit using the command "cache mf=on" disable prefetching using the command "cache prefetch=0"
since file blocks are randomly distributed, prefetching hurts performance
DCS9900
Configuration and Parameter Explanation and Guidelines
User Data vs. Meta Data
Best Practice: Segregate User Data and Meta Data DCS9900 does streaming well, but randomly distributed small transactions not as well. Since meta data transactions are small, segregating user data and meta data can improve performance for meta data intensive operations. Caveats and warnings Most beneficial in environments with significant meta data processing Must have enough dedicated metadataOnly LUNs on controllers with good enough IOP rates to keep pace with DCS9900 LUNs.
FC Drivers
Linux GPFS uses the qla2400 driver for FC access to the DCS9900
this driver is AVT, not RDAC
Supports the DCS9900 active:active access model AIX GPFS uses the MPIO driver in failover mode Only supports an active:passive access model
Linux Multipathing
While not officially supported, customers familiar with Linux multipathing report being are able to get it work with GPFS and the DCS9900.
DCS9900
Configuration and Parameter Explanation and Guidelines
Summary of Selected DCS9900 Best Practice Settings
1 LUN per tier LUN block size = 512 (set interactively when the LUN is created) dual coherency=on cache size=1024 (assumes OS transfer size = 1MB) cache writeback=off cache mf=on cache prefetch=0 ncq disabled
DCS9900
Logical Configuration
The following page illustrates an optimum scheme for These criteria assume an NSD configuration defined LUN to Port Mapping (i.e., zoning) later in these slides. selecting primary and backup servers for each LUN cabling These schema are based on the following design criteria:
1. Guarantee that all controller ports and all HBA ports are uniformly active.
n.b., the DCS9900 supports an active:active protocol
2. If a NSD server fails, its backup server can access its LUNs. 3. Consider LUNs associated with a given HBA or controller port. If an HBA or controller port fails, GPFS failover can access the associated LUNS over alternative paths from a backup NSD server for a given LUN in a balanced manner (i.e., do not access all of the affected LUNs from a single NSD server).
This balance condition applies to performance under degraded conditions. It results in a slightly more complex logical configuration.
4. If one of the controllers in a couplet fails, the file system remains viable using backup NSD servers to access the LUNs of the failed controller over the other controller.
DCS9900
A Proper Logical Configuration
Logical IB Connections*
NSD Servers
primary 01 12 29 backup 64 02 13 30 69 03 14 35 70 04 15 36 73 05 16 41 74 06 17 42 79 07 18 51 80 08 19 52 85 09 20 57 86 10 21 58 91 11 22 63 92 P1 P2
DCS9900 Couplet
Controller C1 - Zoning
ports External LUN lables p1 01 03 05 07 09 11 13 15 17 19 21 -p2 25 27 29 31 33 35 37 39 41 43 45 -p3 49 51 53 55 57 59 61 63 65 67 69 -p4 73 75 77 79 81 83 85 87 89 91 93 -30 64 02 66 04 38 06 40 36 70 08 -10 44 12 46 42 74 14 76 16 78 18 50 -80 20 82 22 84 -56 52 86 54 88 26 90 28 62 58 92 60 94 32 -34 68
I B L A N
25 primary 36 01 backup 60
26 37 02 65
27 38 07 66
28 39 08 75
29 40 13 76
30 41 14 81
31 42 19 82
32 43 20 87
33 44 53 88
34 45 54 93
35 46 59 94
P1 P2
primary
49 60 03 backup 32
50 61 04 37
51 62 09 38
52 63 10 43
53 64 15 44
54 65 16 77
55 66 21 78
56 67 22 83
57 68 25 84
58 69 26 89
59 70 31 90
P1 P2
Controller C2 - Zoning
ports External p1 02 04 06 14 16 18 p2 26 28 30 38 40 42 p3 50 52 54 62 64 66 p4 74 76 78 86 88 90 LUN lables 08 10 12 20 22 -32 34 36 44 46 -56 58 60 68 70 -80 82 84 92 94 -29 63 01 65 03 37 05 39 35 69 07 -09 43 11 45 41 73 13 75 15 77 17 49 -79 19 81 21 83 -55 51 85 53 87 25 89 27 61 57 91 59 93 31 -33 67
primary
73 84 05 backup 40
74 85 06 45
75 86 11 46
76 87 12 49
77 88 17 50
78 89 18 55
79 90 27 56
80 91 28 61
81 92 33 62
82 93 34 67
83 94 39 68
P1 P2
COMMENTS: Couplet 9900A: 89 Tiers External LUN labels: 001..022, 025..046, 049..070, 073..094, 097 Couplet 9900B: 90 Tiers External LUN labels: 001..022, 025..046, 049..070, 073..094, 097, 098 In order to improve managability skip external LUN lables 23, 24, 47, 48, 71, 72, 95, 96 Controller C1 owns the odd LUNs, controller C2 owns the even LUNs In order to allow controller failover it is necessarty to enable cache coherence Command: dual coherencey = ON
DCS9900
The Actual Logical Configuration
DCS9900 Couplet
01 03 05 ... 85 87 02 04 06 ... 86 88 hdisks 2..89 LUNs 01..88 P1 P2
Controller C1 - Zoning
ports Tiers p1 01 03 05 02 04 06 p2 01 03 05 02 04 06 p3 01 03 05 02 04 06 p4 01 03 05 02 04 06 ... ... ... ... ... ... ... ... 85 86 85 86 85 86 85 86 87 88 87 88 87 88 87 88
I B L A N
P1 P2
P1 P2
Controller C2 - Zoning
ports Tiers p1 01 03 05 02 04 06 p2 01 03 05 02 04 06 p3 01 03 05 02 04 06 p4 01 03 05 02 04 06 ... ... ... ... ... ... ... ... 85 86 85 86 85 86 85 86 87 88 87 88 87 88 87 88
P1 P2
COMMENTS:
GPFS is configured in a SAN mode. Since dual coherency=OFF for these tests, P1 sees only the LUNs owned by controller 1 and P2 sees only the LUNs owned by controller 2. So there is no HBA failover. This is simply a configuration error. This does not affect performance..
Controller 1 "owns" LUNs 01, 03, 05, ..., 87 Controller 2 "owns" LUNs 02, 04, 06, ..., 88 LUNs 89, 90 were not used. dual coherency = OFF
DCS9900
Example Configuration
Ethernet Switch (Administration)
NSD Server-01 x3650 M2 8 cores, 6 DIMMs NSD Server-02 x3650 M2 8 cores, 6 DIMMs
DCS9900 (2U) RAID Controller C1
8 x FC8 host connections
IB 4xDDR
2xFC8
1 2
host ports drive ports
3 4
host ports
GbE GbE
GbE
GbE
GbE
GbE
IB 4xDDR
1 2
host ports drive ports
3 4
host ports
GbE GbE
GbE
GbE
NSD Server-03 x3650 M2 8 cores, 6 DIMMs NSD Server-04 x3650 M2 8 cores, 6 DIMMs
IB 4xDDR
2xFC8
IB 4xDDR
2xFC8
Disk trays
Minimum required to saturate DCS9900 BW
GbE
GbE
o o o o
IB Switch (GPFS)
The 2xFC8 HBAs can be replaced by dual port 4xDDR IB HCAs using SRP. The IB host ports can either be directly attached to the servers or connected to a dedicated IB SAN switch. It is also possible to use an IB switch for a combined LAN and SAN, but this has been discouraged in the past. As a best practice, it is not recommend to use an IB SAN for more than 32 ports.
* These are consistant well formed 4K transactions. A typical GPFS small transaction work load has a mixed transaction sizes resulting from metadata transactions.
COMMENT: More disks (for a total of 1200) can be added to this solution but it will not increase performance.
Peak sustained DCS9900 performance streaming data rate < 5.6 GB/s noncached IOP rate < 40,000 IOP/s* 4xDDR IB HCA (Host Channel Adapter) Potential peak data rate per HCA < 1500 MB/s Required peak data rate per HCA < 1400 MB/s 2xFC8 (dual port 8 Gbit/s Fibre Channel) Potential peak data rate per 2xFC8 < 1500 MB/s Required peak data rate per 2xFC8 < 1400 MB/s
DCS9900
Benchmark Results
GPFS Parameters
blocksize(streaming) = 4096K blocksize(IOP) = 256K pagepool = 1G maxMBpS = 4000
COMMENT The disparity between read and write performance observed below is much less pronounced when using 15Krpm SAS drives. For example, using 160 SAS tiers... write ~= 5700 MB/s, read ~= 4400 MB/s This disparity can be removed using cluster block allocation for SATA disk, but this not recommended.
DCS9900 Parameters
8+2P RAID 6 SATA cache size = 1024K cache prefetch = 0 cache writeback = ON
Streaming Job
record size = 4M file size = 32G number of tasks = 1 to 16 access pattern = seq Access Patt Streaming Streaming IOP* IOP* Tier write (MB/s) read (MB/s) write (IOP/s) read (IOP/s)
IOP Job
record size = 4K total data accessed = 10G number of tasks = 32 access pattern = small file (4K to 16K) 1 270 220 7,500 3,800 4 790 710 13,500 5,900 8 1400 1200 30,000 27,300 16 2700 1600 30,400 27,300 32 4800 2900 41,000 33,500 64 5400 3600
DS8000 Series
P6-p595 FC Ports
8 x 4 Gb/s FC ports per system ~= 3 GB/s per system
Fan Sence RPC
Drive Set
PPS
Drive Set
Drive Set
S A N S W I T C H
PPS
Controller 4-way
Controller 4-way
Battery
4 4 4 4
I/O Adapters Enclosure I/O Adapters Enclosure
Battery
I/O Adapters Enclosure I/O Adapters Enclosure
DS8300 ANALYSIS
Peak BW read < 3500 MB/s write < 1900 MB/s duplex < 2300 MB/s FC Ports 16 @ 4 Gb/s Disks 128
max per base unit
Battery
COMMENT: The DS8000 RAID architecture is RAID 5 organized in a combination of 6+P and 7+P RAID sets. While this makes configuration easier, it hurts GPFS streaming performance.
Which storage system is the best? What is the best number of disks? What is the best size of disks?
There is no unequivocal answer to these questions. The next page presents a feature comparison with a heuristic evaluation of these feature's value to HPC applications.
Feature Comparison
streaming BW IOP rate performance:capacity1 fast RAID rebuild RAID 6 support parity check on read controller organization2 disk technology max number of disks floor space utilization remote mirroring
RAID N+P where N = 2
k 3
DS3400 good acceptable best no yes yes active/passive SAS, SATA 48 acceptable no yes
DS53002 best best good no yes yes active/passive FC, SATA 448 best yes yes
DS8300 acceptable best acceptable no yes yes active/active FC, FATA 1024 acceptable yes no
DCS9900 best good good yes yes yes active/active SAS, SATA 1200 best no yes
Footnotes: 1. The performance:capacity ratio assessment is based on the minimum number of disks commonly deployed in order to achieve peak streaming BW. Increasing capacity behind a controller will decrease this ratio. See the analysis on following pages. 2. Most storage controllers are based a "dual RAID controller" design in order to avoid single point of failure risks. The RAID controllers are generally associated with the RAID sets in either an active/passive or active/active organization. 3. RAID architecture is described using the expression N+P where N is the number of data disks and P is the number of parity disks in a RAID set. For optimum GPFS performance, N = 2k. This category declares whether N = 2k.
Storage Servers
There are many options for storage servers (i.e., NSD servers) with GPFS clusters. The following pages provide examples illustrating some of the more common choices.
P6-p520
System Architecture
Nova
P6 DCM
4.2 GHz
P6 DCM
4.2 GHz
Nova
Burst simplex < 5600 MB/s duplex < 11200 MB/s Sustained simplex < 4400 MB/s duplex < 6800 MB/s
8:1 GX
p5-IOC-2
I/O Bridge
Obsidian
Burst simplex < 4200 MB/s duplex < 8400 MB/s Sustained simplex < 3400 MB/s duplex < 5000 MB/s
G X
G X
P C I E 8X
P C I E 8X
P C I E 8X
P C I X
P C I X
2.0 2.0
"Direct" GX Slot
"DIRECT" GX Slot IB cards are only supported in this slot. Card options:
dual port, IB 12xSDR @ 6:1 ratio (GX+) dual port, IB 12xDDR @ 3:1 ratio (GX++) RIO2 card @ 8:1 (GX+)
"Pass Thru" GX Slot The pass thru GX slot occupies the same physical space as the 1st PCI-E slot. Therefore you can not use both of these slots. Supports the RIO2 card @ 8:1 (GX+). It does not support IB card. Single PCI Adapter Data Rates PCI-E 8x: Simplex: Burst < 2000 MB/s, Sustained < 1400 MB/s Duplex: Burst < 4000 MB/s, Sustained < 2100 MB/s PCI-X 2.0
Burst < 2000 MB/s, Sustained < 1400 MB/s (this is not a duplex protocol)
Overiew The P6-p520 is cost effective storage server for GPFS in most pSeries clusters using Ethernet. This diagram illustrates those features most useful to its function as a storage server. Alternative Solution The P6-p550 can be used in place of the P6-p520. It provides the same number of I/O slots and bandwidth, but it also has more CPUs; GPFS does not need these extra CPUs, therefore the P6-p520 is recommended.
P6-p520
RIO Architecture
COMMENT:
This data rate analysis is based on the assumption that the G30 connects to Secondary GX Bus.
6:1 GX Bus on 4.2 GHz system Burst simplex < 2800 MB/s duplex < 5600 MB/s Sustained simplex < 2200 MB/s duplex < 3400 MB/s
Physical Dimensions
Height: 4U Width: 9.5 inches
2 G30s fit side by side in a 19 inch rack
IB PCI-X2 12x PCI-X2 IB PCI-X2 12x PCI-X2
2 x IB HCAs
12X IB Ports
Single IB 12X link* Burst simplex < 3000 MB/s duplex < 6000 MB/s Sustained simplex < 2400 MB/s duplex < 3600 MB/s
7314-G30
IB 12X to PCI-X 2.0 Bridge 0 1 2 3 IB 12X to PCI-X 2.0 Bridge 0 1 2 3
Single PCI-X 2.0 Adapter Burst < 2000 MB/s Sustained < 1400 MB/s
P C I X 2.0
P C I X 2.0
P C I X 2.0
P C I X 2.0
P C I X 2.0
P C I X 2.0
P6-p520
Example Configuration
The P6-520 offers only 12xDDR, while 4xDDR is more common, so cables supporting 12xDDR -> 4xDDR conversion are available.
Performance Analysis
DS5300 streaming data rate
448 x SATA or 192 x 15Krpm disks: write < 4.5 GB/s, read < 5.0 GB/s
IB LAN*
P6-p520
NSD server 4-way, 4.2 GHz, 8 GB RAM
GbE GbE
PCI-X2 #4 #5
448 x SATA disks: write < 7,000 IOP/s, read < 24,000 IOP/s 192 x 15Krpm disks: write < 16,000 IOP/s, read < 64,000 IOP/s
GX direct
12X DDR
GX pass-thru
PCI-E #2 #3 PCI-X2 #4 #5
8 x FC8 SAN switch not required
750 MB/s per FC8 is possible, but only 625 MB/s is required
#1
2x F C 8
GX direct
12X DDR
GX pass-thru
controller A
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 5 6 7 8
GbE GbE
FC8
DS5300
1 2 3 4
GbE GbE
P6-p520
NSD server 4-way, 4.2 GHz, 8 GB RAM
GbE GbE
PCI-E #1 #2 #3
2x F C 8
PCI-X2 #4 #5
1 2 3 4 5 6 7 8
controller B
GX direct
12X DDR
GX pass-thru
Disk Drawers
P6-p520
NSD server 4-way, 4.2 GHz, 8 GB RAM
GbE GbE
#1
2x F C 8
PCI-E #2 #3
PCI-X2 #4 #5
option #1: 192 x 15Krpm FC disks (12 drawers) option #2: 448 x SATA disks (28 drawers)
FOOTNOTES: The peak of 1250 MB/s per 12xDDR IB connection using IPoIB(sp) does not provide an adaquate margin of error to harvest the 5 GB/s potential from the DS5300; therefore this solution as shown may provide an aggregate data rate slightly less than 5 GB/s. However, a TbE connection can be added to each node and accessed via GPFS subnets or NFS to more fully utilize the BW potential of the DS5300.
GX direct
12X DDR
GX pass-thru
P6-p575
System Architecture
The P6-p575 is used as a storage server in HPC oriented pSeries clusters using Infiniband. This diagram illustrates those features most useful to its function as a storage server.
Only logical connections are illustrated to reduce diagram complexity. There are actaully 16 physical connections between the quad groups.
32 P6 Cores
P6 DCM P6 DCM cache cache memory memory P6 DCM P6 DCM cache cache memory memory
2:1 GX 2:1 GX
P6 DCM P6 DCM cache cache memory memory P6 DCM P6 DCM cache cache memory memory
P6 DCM P6 DCM cache cache memory memory P6 DCM P6 DCM cache cache memory memory
4:1 GX
P6 DCM P6 DCM cache cache memory memory P6 DCM P6 DCM cache cache memory memory
2:1 GX
4:1 GX Bus Data Rates Burst simplex < 4700 MB/s duplex < 9400 MB/s Sustained simplex < 3700 MB/s duplex < 5600 MB/s 2:1 GX Bus Data Rates Burst simplex < 9400 MB/s duplex < 18,800 MB/s Sustained simplex < 7500 MB/s duplex < 11,300 MB/s Technical Notes The GX bus is 32 bits wide Rules of thumb: - Sustained simplex rates ~= 80% of simplex burst rate - Sustained duplex rates ~= 60% of duplex burst rate
4:1 GX 2:1 GX
p5-IOC-2 (A)
"monk"
PCI-E 16x PXI-E 8x P C I E
Top
P C I E or X Bottom
p5-IOC-2 (B)
I/O Bridge
8:1 GX PCI-X
I/O Bridge
PCI-X2
"monk"
PCI-E 8x P C I E
P C I E or X Bottom
PCI-E 16x
PCI-X2
Galaxy-2
Galaxy-2
Galaxy-2
Galaxy-1
T b E T b E
IB 12x IB 12x
4X 4X IB IB
4X 4X IB IB
4X 4X IB IB
PCI Riser
PCI Riser
8:1 GX Bus for the RIO ports IB Performance Comments: Burst 4X DDR IB port BW simplex < 2400 MB/s simplex < 1500 MB/s, duplex < 2600 MB/s duplex < 4800 MB/s protocol limitations Sustained AIX supports IPoIB(sp) which is a high performance simplex < 1900 MB/s version of IPoIB duplex < 2900 MB/s simplex < 1250 MB/s, duplex < 2150 MB/s
COMMENTS: The 8:1 bus servicing the RIO ports severely restricts the data rate possible using 12x SDR IB. While this server has limited I/O connectivity, its I/O BW is outstanding. The monk IB ports in particular provides the greatest potential for high speed I/O.
Galaxy-2
4X 4X IB IB
Obsidian SAS
P6-p575
Physical View
"the whole enchilada"
COMMENT: The monk 4X DDR IB HCAs are not shown. If they were, the bottom PCI-E slot would not be available.
P6-p575
RIO Architecture
GX++ Bus @ 5.0 GHz Burst simplex < 10.0 GB/s duplex < 20.0 GB/s Sustained simplex < 8.0 GB/s duplex < 12.0 GB/s 2 x 12xDDR
Single IB 12X link Burst simplex < 6.0 GB/s duplex < 12.0 GB/s Sustained simplex < 4.0 GB/s duplex < 6.0 GB/s
12X HUB
COMMENTS: A P6-p595 provides 4 GX card slots per node. With 8 nodes per CEC, there is a max of 32 GX cards. Maximum Bandwidth Configuration: attach upto 16 PCI-E drawers in a dual loop configuration (as shown). Maximum Capacity Configuration: attach upto 32 PCI-E drawers in a single loop configuration.
12X HUB
Per Planar (HUB and bridge limited) Burst simplex < 10.0 GB/s duplex < 20.0 GB/s Sustained simplex: write < 5.0 GB/s, read < 6.0 GB/s duplex < 9.0 GB/s
Model - 5802
Planar 2
Planar 1
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
PCI-E X8
Burst rate simplex < 2.0 GB/s duplex < 4.0 GB/s
P6-p575
Example Configuration
Server benchmark test needed.
Performance Analysis
Peak sustained DCS9900 performance
streaming data rate < 5.6 GB/s noncached IOP rate < 40,000 IOP/s
IB 4xDDR
IB 4xDDR IB 4xDDR
IB 4xDDR IB 4xDDR
Required peak data rate per port < 1400 MB/s The peak of 1250 MB/s per IB port comes close, but is insufficient to harvest to full BW potential of the couplet. Additional IB ports are needed to fully utilize the BW potential of the couplet.
1 2
host ports drive ports
3 4
host ports
GbE GbE
1 2
host ports drive ports
3 4
host ports
GbE GbE
Disk trays
Minimum required to saturate DCS9900 BW
o o o o
x3650 M2
System Architecture
The x3650 M2 is a common and cost effective storage server for GPFS in System X environmnets. This diagram illustrates those features most useful to its function as a storage server.
x3650 M2 (2u)
3 DIMMs 3 DIMMs 2 DIMMs
Xeon 5500
Nehalem quad core
PCIe/x16 (512 MB/s1) PCIe/x16 (512 MB/s1)
Xeon 5500
Nehalem quad core
Riser
PCIe x8 PCIe/x8
Riser
Riser options:
1. single PCIe x16 adapter 2. two PCIe x8 adapters 3. two PCI-X 133 MHz adapters
Memory DIMMS Best performance achieved using multiples of 6 DIMMs Fewer DIMMs implies greater BW per DIMM DIMM sizes: 1, 2, 4, or 8 GB GPFS does not require a larger memory capacity for the NSD servers; 6 GB of RAM is adaquate if the x3650 M2 is only used as an NSD server.
PCIe x8 PCIe/x8
GbE GbE GbE GbE
I/O Bridge
South Bridge
PCIe/x4 (2 GB/s1)
SAS Controller
SAS Backplane
supports upto 12 x 2.5" SAS disks or SSDs
1. Listed bus rates are theoretical duplex rates assuming 512 MB/s per link. Production data rates will be less. 2. Peak duplex rates for PCIe x8 adapters Gen 1 adapters < 3.2 GB/s These are the data rates as they would be measured from an application perspective. Actual data rates with overhead are much greater. Gen 2 adapters < 6.4 GB/s 3. Aggregate I/O rate over 4 x PCIe x8 adapters < 10 GB/s
* See http://en.wikipedia.org/wiki/PCIe for details on the PCI Express standard
x3650 M2
Example Configuration
Ethernet Switch: GbE - System Administration
NSD Server-01 x3650 M2 8 cores, 6 DIMMs NSD Server-02 x3650 M2 8 cores, 6 DIMMs
DCS9900 (2U) RAID Controller C1
8 x IB DDR connections IB SAN switch not recommended
IB 4xDDR
IB 4xDDR
1 2
host ports drive ports
3 4
host ports
GbE GbE
GbE
GbE
GbE
GbE
IB 4xDDR
IB 4xDDR
1 2
host ports
drive ports
3 4
host ports
GbE GbE
GbE
GbE
NSD Server-03 x3650 M2 8 cores, 6 DIMMs NSD Server-04 x3650 M2 8 cores, 6 DIMMs
IB 4xDDR
IB 4xDDR
IB 4xDDR
IB 4xDDR
GbE
GbE
Disk trays
Minimum required to saturate DCS9900 BW
IB Switch (GPFS)
o o o o
COMMENTS - DCS9900 Host Connections Dual port IB 4xDDR HCAs are necessary since the DCS9900 host side ports can deliver at most 760 MB/s . Sharing the LAN based IB switch is not recommended, especially if there are more than 32 NSD servers. The host ports can either be directly attached to the servers or separate IB switch can be used. While IB 4xDDR (RDMA) can deliver rates upto 1500 MB/s over a LAN, in practice IB 4xDDR (SRP) delivers closer to 1300 MB/s over a SAN. The peak data rate for this solution may therefore be closer to 5.2 GB/s.
COMMENT: More disks (for a total of 1200) can be added to this solution but it will not increase performance.
Peak sustained DCS9900 performance streaming data rate < 5.6 GB/s noncached IOP rate < 40,000 IOP/s LAN: 4xDDR IB HCA (RDMA) Potential peak data rate per HCA < 1500 MB/s Required peak data rate per HCA < 1400 MB/s Host connections: 4xDDR IB HCA (SRP) Potential peak data rate per IB connection < 760 MB/s
x3550 M2
System Architecture
The x3550 may be a cost effective storage server for GPFS in some cases. It's main limitation is a lack of PCIe slots. This diagram illustrates those features most useful to its function as a storage server.
x3550 M2 (1u)
3 DIMMs 3 DIMMs 2 DIMMs
Xeon 5500
Nehalem quad core
PCIe/x16 (8 GB/s1) PCIe/x16 (8 GB/s1)
Xeon 5500
Nehalem quad core
Riser
PCIe x16
Riser options:
Riser 1. single PCIe x16 adapter 2. two PCI-X 133 MHz adapters
Memory DIMMS Best performance achieved using multiples of 6 DIMMs Fewer DIMMs implies greater BW per DIMM DIMM sizes: 1, 2, 4, or 8 GB GPFS does not require a larger memory capacity for the NSD servers; 6 GB of RAM is adaquate if the x3650 M2 is only used as an NSD server.
PCIe x16
I/O Bridge
South Bridge
PCIe/x4 (2 GB/s1)
SAS Controller
SAS Backplane
supports upto 6 x 2.5" SAS disks or SSDs
1. Listed bus rates are theoretical duplex rates assuming 512 MB/s per link. Production data rates will be less. 2. Peak duplex rates for PCIe x8 adapters Gen 1 adapters < 3.2 GB/s These are the data rates as they would be measured from an application perspective. Actual data rates with overhead are much greater. Gen 2 adapters < 6.4 GB/s 3. Aggregate I/O rate over 4 x PCIe x8 adapters < 10 GB/s
* See http://en.wikipedia.org/wiki/PCIe for details on the PCI Express standard
x3550 M2
Example Configuration
Ethernet Switch (Administration)
NSD Server-01
GbE GbE
TbE
FC8
1 2
host ports drive ports
3 4
host ports
GbE GbE
NSD Server-02
GbE GbE
TbE
FC8
NSD Server-03
GbE GbE
1 2
host ports drive ports
3 4
host ports
GbE GbE
NSD Server-04
GbE GbE
NSD Server-05
GbE GbE
Disk trays
TbE FC8
o o o o
NSD Server-06
GbE GbE
COMMENT: More disks (for a total of 1200) can be added to this solution but it will not increase performance.
NSD Server-07
GbE GbE
TbE
FC8
NSD Server-08
GbE GbE
TbE
FC8
COMMENT: Do not underestimate the I/O capability of the x3550 M2. It has the same busses and mother board as the x3650 M2, just fewer I/O ports. For example, an NSD server configuration similar to the previous x3650 M2 example could be effectively used instead of the one illustrated in this example.
Peak sustained DCS9900 performance streaming data rate < 5.6 GB/s noncached IOP rate < 40,000 IOP/s TbE (10 Gbit Ethernet Adapter) Potential peak data rate per TbE < 725 MB/s Required peak data rate per TbE < 700 MB/s FC8 (single port 8 Gbit/s Fibre Channel) Potential peak data rate per FC8 < 760 MB/s Required peak data rate per FC8 < 700 MB/s
Management efficiency
Nodes, cabling, switches, and management modules form integrated package Management modules provide a common interface for managing all BC components
Midplane
A M B M l a d e A M M
Space efficiency
A single 9U chassis supports up to 14 NSD servers (plus associated infrastructure!)
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
Power efficiency
BC power modules are as much as 50% more efficient then the smaller power supplies used in rack-mounted servers.
Midplane 6 x FC8 6 x FC8
Blade
Cores RAM
Price
By comparison, it requires a large initial investment if only a small storage server infrastructure is needed. But the incremental costs of scaling out are small.
I/O Ports
TbE module: in: 14 x TbE out: 6 x TbE Mixed Ethernet module: in: 14 x GbE out: 6 x GbE, 3 x TbE FC module: in: 14 x FC4 out: 6 x FC8
HS21 or HS21XM
4 cores RAM: 4 to 8 GB is adequate up to 32 GB is possible
I/O Ports
PCI-E
2 x TbE
Sustained I/O Rates PCI-E (8x) < 1400 MB/s TbE < 725 MB/s GbE < 80 MB/s PCI-X < 700 MB/s FC4 < 380 MB/s
BladeCenter Configuration
Using External Nodes as Storage Servers
The recommended best practice for using GPFS with blades is to use external nodes as the NSD servers.
GbE GbE
GbE
GbE
TbE
2 x FC4 2 x FC4
disk: 15Krpm SAS 48 disks 4+P RAID 5 8 arrays + 8 hot spares usable capacity < 14 TB
Controller-A
Controller-B
DS3400-01
12 x 15 Krpm SAS disks (450 GB/disk)
Aggregate BW
write < 1300 MB/s read < 1450 MB/s
ESM-A
ESM-B
EXP3000
12 x 15 Krpm SAS disks (450 GB/disk)
Average BW
write < 20 MB/s per blade read < 25 MB/s per blade
Controller-A
Controller-B
DS3400-02
12 x 15 Krpm SAS disks (450 GB/disk)
ESM-A
ESM-B
EXP3000
12 x 15 Krpm SAS disks (450 GB/disk)
BladeCenter Configuration
Using Blades as NSD Servers
While not as effective as external servers, blades can be used as NSD servers.
Aggregate BW
write < 1300 MB/s read < 1450 MB/s
Average BW
write < 20 MB/s per blade read < 25 MB/s per blade
N N N N S S S S D D D D S e r v e r S e r v e r S e r v e r S e r v e r
Controller-A
Controller-B
DS3400-01
12 x 15 Krpm SAS disks (450 GB/disk)
N N N N S S S S D D D D S e r v e r S e r v e r S e r v e r S e r v e r
ESM-A
ESM-B
EXP3000
12 x 15 Krpm SAS disks (450 GB/disk)
Controller-A
Controller-B
COMMENT: Given a GbE of LAN, it is necessary to use 20 nodes as NSD servers. Requires SAN switch and 1 FC4 per NSD server. NSD servers can also be used to run applications as GPFS clients. Blades have less utility as storage servers due to more limited I/O capabilities.
N N N N S S S S D D D D S e r v e r S e r v e r S e r v e r S e r v e r
DS3400-02
12 x 15 Krpm SAS disks (450 GB/disk)
ESM-A
ESM-B
EXP3000
12 x 15 Krpm SAS disks (450 GB/disk)
Blade Configuration
Using Blades as NSD Servers
Blades can use GPFS as a SAN file system, but since blade clusters tend to be large, a larger SAN and special SAN tuning is necessary.
LAN (GbE) SAN (FC4) - 64 ports
Aggregate BW
write < 1300 MB/s read < 1450 MB/s
Average BW
write < 20 MB/s per blade read < 25 MB/s per blade
Controller-A
Controller-B
DS3400-01
12 x 15 Krpm SAS disks (450 GB/disk)
ESM-A
ESM-B
EXP3000
12 x 15 Krpm SAS disks (450 GB/disk)
Controller-A
Controller-B
COMMENT: Requires larger SAN (1 FC4 per blade) along with a SAN switch. Set queue depth to 1 or 2. There are no hard rules saying that GPFS can not be used for a large SAN, nor are there rules regarding the size of a GPFS SAN, but generally SANs spanning more than 32 nodes are less common for GPFS.
DS3400-02
12 x 15 Krpm SAS disks (450 GB/disk)
ESM-A
ESM-B
EXP3000
12 x 15 Krpm SAS disks (450 GB/disk)
This section contains numerous examples of actual and proposed GPFS configurations illustrating the versatility of GPFS.
Tinkertoy Computer
On display at Museum of Science, Boston
Disclaimers
The configurations shown in this section are only examples illustrating and suggesting GPFS possible configurations. In some cases, they merely illustrate how systems have been configured, not necessarily how they should be configured. These slides are not intended to be "wiring diagrams"; rather, they are to illustrate basic concepts when integrating various components into an overall solution. Unless stated otherwise, "feeds and speeds" are based on realistic upper bound estimates as measured by the application, but under ideal benchmarking conditions. Performance will vary "according to actual driving conditions".
Balance
The I/O Subsystem Design Goal
Ideally, an I/O subsystem should be balanced. There is no point in making one component of an I/O subsystem fast while another is slow. Moreover, overtaxing some components of the I/O subsystem (e.g., HBAs) may disproportionately degrade performance.
However, this goal can not always be perfectly achieved. A common imbalance is when capacity is more important than bandwidth; then the aggregate bandwidth based on the number of disks may exceed the aggregate bandwidth supported by the electronics of the contollers and/or the number HBAs and storage servers.
"Performance is inversely proportional to capacity." -- Todd Virnoche
A convenient design strategy for GPFS solutions is to define a "storage building block", which is the "smallest" increment of storage and servers by which a storage system can grow. Therefore, a storage solution consists of 1 or more storage building blocks. This allows customers to conveniently expand their storage solution in increments of storage building blocks (i.e., "build as you grow" strategy) This solution is made feasible since GPFS scales linearly in the number of disks, storage controllers, NSD servers, GPFS clients, and so forth.
FC and SAS disks: 450 GB/disk SATA Disks: 2 TB/disk Storage Controllers: 0.5 PB to 1.0 PB Storage Servers: several GB/s This presents a challenge for smaller storage systems.
GbE
GbE
GbE
GbE
TbE
2 x FC4 2 x FC4
DS3400-01
12 x 15 Krpm SAS disks (450 GB/disk)
ESM-A
ESM-B
EXP3000
12 x 15 Krpm SAS disks (450 GB/disk)
Performance
streaming: write < 1300 MB/s, read < 1600 MB/s IOP: write < 18,000 IOP/s, read < 22,000 IOP/s
DS3400-02
12 x 15 Krpm SAS disks (450 GB/disk)
ESM-A
ESM-B
EXP3000
12 x 15 Krpm SAS disks (450 GB/disk)
COMMENT:
Usable capacity could be increased to 72 TB using 4+P RAID5 arrays, but this is not a best practice.
TbE
2 x FC8
DS3400-01
12 x 15 Krpm SAS disks (450 GB/disk)
GbE
GbE
ESM-A
ESM-B
GbE
GbE
NSD Server-02 x3650 M2 8 cores, 6 DIMMs NSD Server-03 x3650 M2 8 cores, 6 DIMMs
TbE
2 x FC8
EXP3000
10 x 15 Krpm SAS disks (450 GB/disk)
Controller-A
Controller-B
GbE
GbE
GbE
GbE
COMMENT: Using 2xFC8 per NSD server instead 4xFC4 per NSD server with a SAN switch simplifies cabling.
DS3400-02
12 x 15 Krpm SAS disks (450 GB/disk)
ESM-A
ESM-B
EXP3000
10 x 15 Krpm SAS disks (450 GB/disk)
Ethernet Switch
FC4
Controller-A
Controller-B
DS3400-03
12 x 15 Krpm SAS disks (450 GB/disk)
ESM-A
ESM-B
EXP3000
10 x 15 Krpm SAS disks (450 GB/disk)
Controller-A
Controller-B
DS3400-04
12 x 15 Krpm SAS disks (450 GB/disk)
Performance
streaming: write < 2500 MB/s, read < 3000 MB/s IOP: write < 35,000 IOP/s, read < 40,000 IOP/s
ESM-A ESM-B
EXP3000
10 x 15 Krpm SAS disks (450 GB/disk)
WARNING: Scaling beyond 2 building blocks (i.e., 4 x DS3400) is not recommended when performance is critical because RAID rebuilds over multiple DS3400s significantly impede performance. If scaling beyond this is required, then deploy multiple GPFS file systems or storage pools to limit the impact of RAID rebuilds.
Ethernet Switch
NSD Server-01* x3650 M2 8 cores, 6 DIMMs
Requires tie-breaker disks for quorum if only one building block is deployed.
Controller-A
Controller-B
DS3400-04
12 x SATA disks (1 TB/disk)
IB 4xDDR
2 x FC8
ESM-A ESM-B
GbE
GbE
2 x FC8
EXP3000
12 x SATA disks (1 TB/disk)
GbE GbE
IB 4xDDR
2 x FC8 2 x FC8
ESM-A
ESM-B
EXP3000
10 x SATA disks (1 TB/disk)
ESM-A ESM-B
FC8 FC4
EXP3000
10 x SATA disks (1 TB/disk)
FC4
Controller-B
FC4
Controller-A Controller-B
DS3400-04
12 x SATA disks (1 TB/disk)
ESM-A ESM-B ESM-A
DS3400-04
12 x SATA disks (1 TB/disk)
ESM-B ESM-A
DS3400-04
12 x SATA disks (1 TB/disk)
ESM-B
EXP3000
12 x SATA disks (1 TB/disk)
ESM-A ESM-B ESM-A
EXP3000
12 x SATA disks (1 TB/disk)
ESM-B ESM-A
EXP3000
12 x SATA disks (1 TB/disk)
ESM-B
EXP3000
10 x SATA disks (1 TB/disk)
ESM-A ESM-B ESM-A
EXP3000
10 x SATA disks (1 TB/disk)
ESM-B ESM-A
EXP3000
10 x SATA disks (1 TB/disk)
ESM-B
EXP3000
10 x SATA disks (1 TB/disk)
EXP3000
10 x SATA disks (1 TB/disk)
EXP3000
10 x SATA disks (1 TB/disk)
Capacity Optimized
Use 4 drawers of 1 TB SATA disk per DS3400 Capacity per DS3400
42 disks @ 1 TB/disk in 8+2P RAID 6 configuration raw = 42 TB, usuable = 32 TB includes 2 hot spares
Aggregate Capacity
raw = 168 TB, usuable = 128 TB includes 8 hot spares
IB HCA (4xDDR2)
at most 1500 MB/s per HCA
Aggregate Performance
streaming rate < 3 GB/s
SATA @ 1 TB/disk
Capacity raw < 168 TB usable < 128 TB Performance streaming rate write < 2500 MB/s read < 3000 MB/s IOP rate* write < 15,000 IOP/s read < 20,000 IOP/s Performance to Usable Capacity Ratio streaming rate write < 19 MB/s / TB read < 23 MB/s / TB IOP rate write < 117 IOP/s / TB read < 156 IOP/s / TB Floor Space+
Racks (42u x 19"): 1 Usable Capacity per rack: 128 TB/rack
FOOTNOTES: SATA IOP rates need validation testing (n.b., they are a SWAG ;->) This ratio is misleading in this case since a rack is not fully utilized for this solution.
controller A
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 5 6 7 8
GbE GbE
GbE
GbE
DS5300
1 2 3 4
GbE GbE
GbE
GbE
IB 4xDDR
2xFC8
1 2 3 4 5 6 7 8
controller B
GbE
GbE
NSD Server-03 x3650 M2 8 cores, 6 DIMMs NSD Server-04 x3650 M2 8 cores, 6 DIMMs
IB 4xDDR
2xFC8
Disk Drawers
option #1: 128 x 15Krpm FC disks (8 x EXP5000) option #2: 480 x SATA disks (8 x EXP5060)
IB 4xDDR
2xFC8
Performance Analysis
DS5300 streaming data rate
128 x 15Krpm disks: write < 4.3 GB/s, read < 5.4 GB/s 480 x SATA disks: write < 4.2 GB/s, read < 5.3 GB/s
GbE
GbE
Capacity Analysis
15Krpm FC Disk
128 disks @ 450 GB/disk 24 x 4+P RAID 5 arrays + 8 hot spares raw capacity < 56 TB, usable capacity < 42 TB
SATA disk
480 disks @ 2 TB/disk 48 x 8+2P RAID 6 arrays raw capacity < 960 TB, usable capacity < 768 TB
Performance Analysis
DS5300 streaming data rate
128 x 15Krpm disks: write < 4.3 GB/s, read < 5.4 GB/s 480 x SATA disks: write < 4.2 GB/s, read < 5.3 GB/s
IB LAN*
P6-p520
NSD server 4-way, 4.2 GHz, 8 GB RAM
GbE GbE
PCI-E #1 #2 #3
2x F C 8
PCI-X2 #4 #5
GX direct
12X DDR
GX pass-thru
PCI-X2 #4 #5
8 x FC8 SAN switch not required
P6-p520
NSD server 4-way, 4.2 GHz, 8 GB RAM
GbE GbE
PCI-E #1 #2 #3
2x F C 8
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 5 6 7 8
GbE GbE
GX direct
12X DDR
GX pass-thru
FC8
DS5300
1 2 3 4
GbE GbE
P6-p520
NSD server 4-way, 4.2 GHz, 8 GB RAM
GbE GbE
1 2 3 4 5 6 7 8
controller B
PCI-E #1 #2 #3
2x F C 8
PCI-X2 #4 #5
FOOTNOTE: The 15Krpm IOP rates assume of good locality. Assuming poor locality, these rates could be: write < 9,000 IOP/s, read < 15,000 IOP/s.
GX direct
12X DDR
GX pass-thru
Disk Drawers
option #1: 128 x 15Krpm FC disks (8 x EXP5000) option #2: 480 x SATA disks (8 x EXP5060)
P6-p520
NSD server 4-way, 4.2 GHz, 8 GB RAM
GbE GbE
#1
2x F C 8
PCI-E #2 #3
PCI-X2 #4 #5
Capacity Analysis
15Krpm FC Disk
128 disks @ 450 GB/disk 24 x 4+P RAID 5 arrays + 8 hot spares raw capacity < 56 TB, usable capacity < 42 TB
GX direct
12X DDR
GX pass-thru
480 disks @ 2 TB/disk 48 x 8+2P RAID 6 arrays raw capacity < 960 TB, usable capacity < 768 TB
NSD Server-01
DS5300-01 Disk Enclosers
FC Disk 8 x EXP5000 128 disks - or SATA Disk 8 x EXP5060 480 disks
o
RACK #4 RACK #3 RACK #2
DS5300-02
NSD Server-02
Disk Enclosers
FC Disk 8 x EXP5000 128 disks
NSD Server-03 NSD Server-04 NSD Server-05 NSD Server-06 NSD Server-07 NSD Server-08 NSD Server-09
I B L A N
RACK #1 client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node
NSD Server-10 NSD Server-11 NSD Server-12 NSD Server-13 NSD Server-14 NSD Server-15 NSD Server-16
client node client node client node client node client node client node
SATA @ 1 TB/disk
Capacity raw < 3840 TB usable < 3072 TB Performance streaming rate write < 16 GB/s read < 20 GB/s IOP rate write < 28,000 IOP/s read < 96,000 IOP/s Performance to Usable Capacity Ratio streaming rate write < 5.3 MB/s / TB read < 6.7 MB/s / TB IOP rate write < 9.1 IOP/s / TB read < 31 IOP/s / TB Racks
Storage Racks (42u x 19"): 5 Server Racks (42u x 19"): 5
IB 4xDDR IB 4xDDR
TbE
TbE
GbE
GbE
IB 4xDDR IB 4xDDR
3 4
IB 4xDDR IB 4xDDR
GbE
DCS9900 Performance Streaming data rate write < 5.7 GB/s read < 4.4 GB/s Noncached IOP rate (4K transactions) write < 40,000 IOP/s read < 65,000 IOP/s LAN: 4xDDR IB HCA (RDMA)+ Potential peak data rate per port < 1250 MB/s
Limited by IBoIP(sp) protocol.
1 2
host ports drive ports
GbE GbE
host ports
1 2
host ports drive ports
3 4
host ports
GbE GbE
SAN: 4xDDR IB HCA (SRP) Potential peak data rate per host connection < 780 MB/s
Limited by the busses in the DCS9900.
Required peak data rate per host connection < 715 MB/s
Capacity Analysis
15Krpm FC Disk 160 disks @ 450 GB/disk 16 x 8+2P RAID 6 tiers raw capacity < 70 TB usable capacity < 56 TB
FOOTNOTES: 4 IB LAN ports per NSD server is overkill, but 2 IB LAN ports are not quite enough. Since peak performance is the objective of this design, the "extra" IB LAN ports are recommended. If you need more than 300 SAS disks to meet capacity requirements, a SATA solution may be sufficient; n.b., data is secure on SATA given the DCS9900 RAID 6 architecture.
5 Disk Trays*
Min required to saturate couplet performance
Analysis
Capacity
IB LAN Switch
DCS9900 Couplet
5 disk trays (160 SAS disks) 450 GB/disk @ 15Krpm usable capacity = 56 TB
IB 4X DDR
DCS9900 Couplet
5 disk trays (160 SAS disks) 450 GB/disk @ 15Krpm usable capacity = 56 TB
DCS9900 Couplet
5 disk trays (160 SAS disks) 450 GB/disk @ 15Krpm usable capacity = 56 TB
P6p575-07 P6p575-08
NSD Servers
DCS9900 Couplet
5 disk trays (160 SAS disks) 450 GB/disk @ 15Krpm usable capacity = 56 TB
raw < 280 TB usable < 224 TB Performance streaming rate write < 20 GB/s read < 16 GB/s IOP rate write < 160,000 IOP/s read < 260,000 IOP/s Performance to Usable Capacity Ratio streaming rate write < 91 MB/s / TB read < 77 GB/s / TB IOP rate write < 714 IOP/s / TB read < 1160 IOP/s / TB Racks
Storage Racks (45u x 19"): 2 Server Racks: 1
IB 4xDDR
2xFC8
1 2
host ports drive ports
3 4
host ports
GbE GbE
GbE
GbE
GbE
GbE
IB 4xDDR
1 2
host ports drive ports
3 4
host ports
GbE GbE
GbE
GbE
NSD Server-03 x3650 M2 8 cores, 6 DIMMs NSD Server-04 x3650 M2 8 cores, 6 DIMMs
IB 4xDDR
2xFC8
IB 4xDDR
2xFC8
GbE
GbE
5 Disk trays
IB Switch (GPFS)
Performance Analysis
DCS9900 Performance Streaming data rate write < 4.8 GB/s read < 3.1 GB/s Noncached IOP rate write < 47,000 IOP/s read < 33,000 IOP/s LAN: 4xDDR IB HCA (RDMA) Potential peak data rate per HCA < 1500 MB/s Required peak data rate per HCA < 1200 MB/s SAN: 2xFC8 (dual port 8 Gbit/s Fibre Channel) Potential peak data rate per 2xFC8 < 1500 MB/s Required peak data rate per 2xFC8 < 1200 MB/s
Capacity Analysis
SATA 300 disks @ 1 TB/disk 30 x 8+2P RAID 6 tiers raw capacity < 300 TB usable capacity < 240 TB
Analysis
Capacity raw < 1200 TB usable < 960 TB Performance streaming rate write < 18 GB/s read < 12 GB/s IOP rate write < 180,000 IOP/s read < 130,000 IOP/s Performance to Usable Capacity Ratio streaming rate write < 19 MB/s / TB read < 12 MB/s / TB IOP rate write < 187 IOP/s / TB read < 135 IOP/s / TB Racks
Storage Racks (45u x 19"): 2 Server Racks (42u x 19": 1
DCS9900 Couplet 5 disk trays (300 disks) 1 TB / SATA disk usable capacity = 240 TB
IB LAN Switch
FC8 IB 4X DDR
x3650 M2 #01 x3650 M2 #02 x3650 M2 #03 x3650 M2 #04 x3650 M2 #05 x3650 M2 #06 x3650 M2 #07 x3650 M2 #08 x3650 M2 #09 x3650 M2 #10 x3650 M2 #11 x3650 M2 #12 x3650 M2 #13 x3650 M2 #14
DCS9900 Couplet 5 disk trays (300 disks) 1 TB / SATA disk usable capacity = 240 TB
DCS9900 Couplet 5 disk trays (300 disks) 1 TB / SATA disk usable capacity = 240 TB
DCS9900 Couplet 5 disk trays (300 disks) 1 TB / SATA disk usable capacity = 240 TB
NSD Servers
x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs
TbE
FC8
1 2
host ports drive ports
3 4
host ports
GbE GbE
NSD Server-02
GbE GbE
TbE
FC8
NSD Server-03
GbE GbE
TbE
FC8
1 2
host ports drive ports
3 4
host ports
GbE GbE
NSD Server-04
GbE GbE
TbE
FC8
NSD Server-05
GbE GbE
TbE
FC8
NSD Server-06
GbE GbE
TbE
FC8
NSD Server-07
GbE GbE
TbE
FC8
NSD Server-08
GbE GbE
TbE
FC8
Performance Analysis
DCS9900 Performance Streaming data rate write < 5.4 GB/s read < 3.5 GB/s Noncached IOP rate* write < 52,000 IOP/s read < 33,000 IOP/s
FC8 (single port 8 Gbit/s Fibre Channel) Potential peak data rate per FC8 < 760 MB/s Required peak data rate per FC8 < 700 MB/s
Capacity Analysis
SATA 1200 disks @ 2 TB/disk 120 x 8+2P RAID 6 tiers
FOOTNOTES: Validation testing needed
TbE (10 Gbit Ethernet Adapter) Potential peak data rate per TbE < 725 MB/s Required peak data rate per TbE < 700 MB/s
Multi-tiered Storage
Example: Building Blocks 2A and 3B
GbE administrative network not shown
DS5300 12 disk trays (192 disks) 450 GB / disk @ 15Krpm usable capacity = 63 TB
IB LAN Switch
FC8 IB 4X DDR
x3650 M2 #01 x3650 M2 #02 x3650 M2 #03 x3650 M2 #04 x3650 M2 #05 x3650 M2 #06 x3650 M2 #07 x3650 M2 #08 x3650 M2 #09 x3650 M2 #10 x3650 M2 #11 x3650 M2 #12
DCS9900 Couplet 10 disk trays (600 disks) 1 TB / SATA disk usable capacity = 480 TB
DCS9900 Couplet 10 disk trays (600 disks) 1 TB / SATA disk usable capacity = 480 TB
NSD Servers
SAN Configurations
The concept of integrating storage servers and controllers into building blocks does not generalize aswell for SAN file systems. The following pages illustrate how GPFS can be deployed using a SAN configuration.
COMMENT: If the configuration is small enough, a SAN switch (e.g., Brocade or McData) is not needed.
SAN #1
Linux/Blades
LAN (GbE) SAN*
FC4: 56 ports, FC8: 8 ports
1 2
1 2
FC4 (4 Gbit/sec)
up to 380 MB/s per blade
5 x 60-disk Drawer
SATA Disk 300 x disks (1 TB) 30 x 8+2P RAID 6 capacity < 240 TB*
Since there are greater than 32 hosts attached to the SAN, reduce the queue depth setting to a value <= 4.
Storage DCS9900
data rate write < 5 GB/s, read < 3 GB/s disk: SATA 300 disks 8+2P RAID 6 raw capacity < 300 TB usable capacity < 240 TB
SAN #2A
AIX/System P - Optimize IOP Performance
P6-p595 FC Ports
FC8 = 8 Gbit/s;
usable BW < 760 MB/s
S A N S W I T C H
FC Disk (15krpm) 28 drawers 448 x disks (300 GB) 88 x 4+P RAID 5 8 x hot spares capacity < 103 TB*
16 x FC8 ports per system Configured with enough Aggregate data rate
at most 12 GB/s
DS5300 ANALYSIS
Assume 4+P RAID 5 Data Rate per DS5300 write < 4.5 GB/s read < 5.0 GB/s aggregate write < 9.0 GB/s read < 10.0 GB/s IOP Rate per DS5300 write < 30 Kiop/s read < 150 Kiop/s aggregate write < 60 Kiop/s read < 300 Kiop/s Capacity per DS5300 raw < 130 TB usable < 103 TB aggregate raw < 260 TB usable < 206 TB
Quoted data rates are a conservative estimate, especially for the read rates. Validation is required.
Quoted IOP rates are derived from benchmark tests using different configurations are provided for informational purposes only. Validation is required.
COMMENTS Since the objective is to optimize the IOP rate per DS5300, faster (15Krpm) but smaller (300 GB/disk) FC disks were chosen. Max IOP performance requires using all of the disks supported by a single DS5300 (i.e., 448) and specialized tuning (e.g., "short stroking"); this tuning will decrease the usable capacity. The number of FC8 ports are configured to also support peak streaming BW. Best practice: Configure with at least 4 partitions for use with GPFS.
SAN #2B-1
AIX/System P - Optimize IOP Performance
P6-p595 FC Ports
16 x FC8x ports per system at most 12 GB/s per system
S A N S W I T C H
FC8
DS5300 ANALYSIS
Aggregate Data Rate write < 18 GB/s read < 20 GB/s Aggregate IOP Rate write < 120 Kiop/s read < 600 Kiop/s
COMMENT Since the objective is to optimize the IOP rate per DS5300, faster (15Krpm) but smaller (300 GB/disk) FC disks were chosen. Max IOP performance requires using all of the disks supported by a single DS5300 (i.e., 448) and specialized tuning (e.g., "short stroking"); this tuning will decrease the usable capacity. The number of FC8 ports are configured to also support peak streaming BW.
Capacity is given as usable capacity FC8 = 8 Gbit/s; usable BW < 760 MB/s
Will fewer FC8 ports still allow peak IOP rate if streaming rate is not important?
SAN #2B-2
AIX/System P - Multi-tiered Solution
P6-p595 FC Ports
24 x FC8x ports per system at most 18 GB/s per system
S A N S W I T C H
FC Disk (15krpm) 28 drawers 448 x disks (300 GB) 88 x 4+P RAID 5 8 x hot spares capacity < 103 TB*
FC8
3 4
3 4
3 4
3 4
12
DS5300 ANALYSIS
Aggregate Data Rate write < 28 GB/s read < 26 GB/s Aggregate IOP Rate write < 200 Kiop/s read < 660 Kiop/s Aggregate Capacity raw < 1.7 PB, usable < 1.4 PB Comment DS5300 optimizes IOP rate DCS9900 optimizes capacity
10 x 60-disk Drawer
SATA Disk 10 drawers 600 x disks (1 TB) 60 x 8+2P RAID 6 capacity < 480 TB*
10 x 60-disk Drawer
SATA Disk 10 drawers 600 x disks (1 TB) 60 x 8+2P RAID 6 capacity < 480 TB*
1 2
1 2
3 4
3 4
x3650 M2
x3650 M2
5 x 60-disk Drawer
SAS Disk 160 x disks 16 x 8+2P RAID 6
3 4
3 4
x3650 M2
1 2
1 2
Analysis
Average performance per node is the same. write < 96 MB/s read < 78 MB/s peak performance for any one node SAN: at most 380 MB/s LAN: at most 1500 MB/s Network considerations SAN file system 2 networks smaller queue depth LAN file system simpler network larger queue depth
5 x 60-disk Drawer
SAS Disk 160 x disks (1 TB) 16 x 8+2P RAID 6
The FC8 host connections could be replaced with IB host connections. In that case, the DCS9900 could even be attached to the IB LAN, but that only increases the IB switch port count with little added benefit.
The following pages contain other examples of GPFS configurations rurther illustrating the versatility of GPFS.
IB Switch
AIX Rack #1 AIX Rack #2
P6-p575
Nodes 19-32
Linux Rack #3
x3550-M2
nodes 33-64
client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node
Linux Rack #4
x3550-M2
nodes 65-96
client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node
FC8
P6-p575
Nodes 5-18
client node client node client node client node client node
client node client node client node client node client node client node client node client node client node client node client node client node client node client node
DCS9900
SATA Disk 10 drawers 600 x disks (1 TB) 60 x 8+2P RAID 6 capacity < 480 TB*
client node client node client node client node client node client node
Capacity
raw: 600 TB usuable: 480 TB
IB Switch
AIX Rack #1
P6-p575
Nodes 1-14
AIX Rack #2
P6-p575
Nodes 14-28
3 GPFS Subnets 1. p575 client nodes 2. x3550-M2 client nodes 3. all NSD nodes
Linux Rack #4
x3550-M2
nodes 57..88
TbE TbE TbE TbE
client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node
client node client node client node client node client node client node client node client node client node client node client node client node client node client node
NSD node NSD node client node client node client node client node client node client node client node client node client node client node client node client node
SATA Disk 10 drawers 600 x disks (1 TB) 60 x 8+2P RAID 6 capacity < 480 TB*
FC8 Switch
DCS9900
IB 12xDDR
2 x FC8
GbE
IB 4xDDR
2 x FC8
As designed, each set of nodes can use upto half of the potential DCS9900 BW, but neither set of nodes can use more than half of the potential DCS9900 BW.
Ethernet Switch
Linux Rack #1 Linux Rack #2
x3550-M2
nodes 33-64
client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node
P6-p595
128 Cores 256 GB RAM
P6-p595
128 Cores 256 GB RAM
x3550-M2
nodes 1-32
client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node
DS5300
T b E
2x FC 8
RIO
5802
T b E
2x FC 8
T b E
2x FC 8
RIO
5802
T b E
2x FC 8
FC Disk (15krpm) 12 drawers 192 x disks (450 GB) 36 x 4+P RAID 5 12 x hot spares capacity < 63 TB
Capacity
raw: 84 TB usuable: 63 TB
SAN configuration at most 2.5 GB/s per p595 if x3550-M2s are idle at most 1.1 GB/s per p595 if x3550-M2s are busy
COMMENT: By design, under load this system is load balanced between both classes of nodes. The TbE network can provide up to half of the potential DS5300 BW to the x3550-M2 nodes leaving the other half for use locally on the p595 nodes.
Full Rack
32 Node Cards
Node Card 32 compute cards - 4x4x2 torus 2 I/O cards 13.9 TF/s 2 TB 8 to 64 x TbE Compute Card 1 chip per card 20 DRAMs Chip 4 cores @ 850 MHz 13.6 GF/s 2.0 GB DDR2 13.6 GF/s 8 MB EDRAM
2 compute cards fit back to back
1 PF/s 144 TB
Formula: 0.85 GHz * 4 cores * 4 pipes per core * 1 FLOP/pipe = 13.6 GF/s
3 Networks in BG/P
1. 3D Torus for point-to-point communications 2. Global tree for reduction, all-to-one communications and file I/O between the I/O and compute nodes 3. 10 Gbit/sec Ethernet (TbE) for file I/O, host interface, control and monitoring
An I/O card is similar to a compute card except that is has a single TbE port Each node card can be configured cards with1, 2 or no I/O Each rack can be configured with 8 to 64 I/O cards - default = 16 I/O cards I/O cards connect to compute cards over the tree network Each I/O card acts as a storage client; external nodes act as storage servers
Node Card
I/O Cards
2 x TbE ports
I/O Node
GPFS Client
NSD Servers E t h e r n e t S w i t c h
Application
records POSIX calls
CIOD
records
libc
mmfsd
Server #1
CNK
Compute Node Kernel tree packets
Linux Kernel
tree packets
BG/P ASC
BG/P ASC
Server #2
Disk
TbE BW per Server 2 TbE per server data rate per server < 1500 MB/s FC BW per Server 2 dual port HBAs per server data rate per server < 1560 MB/s Use RIO drawers to avoid overloading the common GX bus for TbE and PCI-E slots.
P6-p520
NSD server 4-way, 4.2 GHz, 8 GB RAM
TbE TbE
#1
PCI-E #2 #3
PCI-X #4 #5
GX direct
GX pass-thru
P6-p520
NSD server 4-way, 4.2 GHz, 8 GB RAM
TbE TbE
#1
PCI-E #2 #3
PCI-X #4 #5
GX direct
GX pass-thru
PCI-X2
2x FC 4
IB 12x
PCI-X2
2x FC 4
PCI-X2
2x FC 4
IB 12x
PCI-X2
2x FC 4
aggregate capacity
raw: 37.5 TB usable: 28.0 TB
controller A
4 3 2 1
4 3 2 1
controller A
4 3 2 1
4 3 2 1
DS4800
1 2 3 4 1 2 3 4
controller B
DS4800
1 2 3 4 1 2 3 4
controller B
EXP810-1
4 disk drawers cabling not shown
o o o
EXP810-1
4 disk drawers cabling not shown
o o o
EXP810-4
EXP810-4
GbE Switch
GPFS Parameters
page pool = 4 GB maxMBpS >= 2000 MB/s block size = 1024 KB
DS4800 Parameters
RAID config = 4+P read ahead multiplier = 0 write caching = off write mirroring = off read caching = on segment size = 256 KB cache block size = 16 KB
DS4800-01 RAID arrays 1..12 4 x EXP810 DS4800-02 RAID arrays 13..24 4 x EXP810 DS4800-03 RAID arrays 25..36 4 x EXP810 DS4800-04 RAID arrays 37..48 4 x EXP810
Bandwidth
Aggregate
write < 4.0 GB/s read < 5.6 GB/s
Alternative:
This is an "I/O poor" design using the only 8 I/O nodes (the minimum requirement). We could make this an "I/O rich" design by adding all 32 I/O nodes (the maximum allowed), but to be usefull we would need to increase the number of building blocks to 8.
The following pages contain some successful legacy designs that are still relevant today.
The previous building blocks all assume the existance of a high BW switch fabric (i.e. TbE or IB). However, many users have existing networks based on GbE only (n.b., TbE switch ports). This leads to a no building block with different granularity.
COMMENT
If a tape backup system is added to the storage cluster, use 2 x3650s with the following configuration for each x3650: dual core, dual socket 16 GB RAM 2 internal GbE ports dual port GbE dual port FC HBA for disk 4 Gb/s dual port FC HBA for tape 4 Gb/s
SATA/2 @ 500 GB/disk 10 Krpm FC @ 146 or 300 GB/disk 15 Krpm FC @ 73 or 146 GB/disk
Passive
x3550-01
G b E G b E
PCI-Ex Slots
P1 P2
x3550-02
G b E G b E
Active
4 3 2 1
PCI-Ex Slots
P1 P2
Passive
Active
COMMENTS
The 2 Ether ports dedicated to the the NFS and sysadm networks is a 2-way channel-bond, but it has 2 IP addresses.
sustained peak BW < 240 MB/s
controller A
4 3 2 1
ANALYSIS
Each x3550
"dual core, dual socket" (4 CPUs) 16 GB RAM 2 PCI-Express slots per node
1 dual port HBA @ 4 Gb/s 1 dual port GbE adapter
DS4800
1 2 3 4 1 2 3 4
controller B
EXP810
2 built-in GbE ports at most 380 MB/s per HBA 2 x 2-way "Ether channels"
at most 150 MB/s / Ether channel
The 2 ports dedicated to GPFS are not a channel-bond; using ethernet protocols, they are configured as an active/passive bond under the same IP address.
sustained peak BW < 80 MB/s the GPFS network is only used for GPFS overhead traffic (e.g., tokens, heartbeat, etc.) and thus minimal BW is used
EXP810
Disk Enclosures
2 EXP810 enclosures at most 16 disks per enclosure
6 x 4+P RAID 5 arrays
9.4 TB (raw)
4.5 TB (raw)
16 TB (raw)
GPFS Network
Passive Connections
ANALYSIS
2 Building Blocks
at most 480 MB/s NFS BW limited by the GPFS GbE adapters
G b E G b E
Disk Enclosures
4 EXP810 enclosures at most 16 disks per enclosure
12 x 4+P RAID 5 arrays
controller A 4 3 2 1 4 3 2 1
x3550-01
NSD server
P1 P2
P1 P2
DS4800
1 2 3 4 1 2 3 4 controller B
G b E
G b E
x3550-02
NSD server
P1 P2
P1 P2
EXP810
G b E G b E
x3550-03
NSD server
P1 P2
P1 P2
EXP810
9 TB (raw)
G b E
G b E
x3550-04
NSD server
P1 P2
P1 P2
EXP810
EXP810
96 TB (raw)
GPFS Network NFS and System Administration
56 TB (raw)
G b E
G b E
x3550-01
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-04
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-07
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-10
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-02
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-05
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-08
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-11
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-03
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-06
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-09
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-12
NSD server
P1 P2
P1 P2
controller A
4321
4 321
DS4800
1234 1234
controller B
EXP810
EXP810
EXP810
EXP810
EXP810
EXP810
EXP810
EXP810
EXP810
EXP810
EXP810
EXP810
GbE Switch
(at least 32 ports)
x335-01 x335-02 x335-03 x335-04 x335-05 x335-06 x335-07 x335-08 x335-09 x335-10 x335-11 x335-12 x335-13 x335-14 x335-15 x335-16
scsi
scsi
G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E
G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E
x335-17 x335-18 x335-19 x335-20 x335-21 x335-22 x335-23 x335-24 x335-25 x335-26 x335-27 x335-28 x335-29 x335-30 x335-31 x335-32
scsi
scsi
G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E
G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
This was a POC test done by a customer. Each node is both a compute client and NSD node. GPFS was built on the 2nd internal SCSI disk. They now use it in production on clusters of 128 nodes. GPFS 2.3 RH 9.1 Feeds and Speeds
internal SCSI disk ~= 30 MB/s aggregate ~= 1 GB/s
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
Risks
many single points of failure if a node crashes, GPFS file system is unavailable until the node is on-line if a disk fails, the file system will be corrupted and data will be lost NOTE: GPFS robustness design requires "twin tailed disk"
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
Advantage
very inexpensive excellent performance scaling
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
COMMENTS: This configuration is not recommended since it presents a single point of failure risk. While less than optimal, this risk can be eliminated using GPFS replication.
scsi
scsi
scsi
scsi
Blade Specs
IBM HS20 dual Xeon @ 2.8 GHz 4 GB RAM 2 IDE drives (40 GB at 5400 RPM)
I I I I I I I I I I I I I I D D D D D D D D D D D D D D E E E E E E E E E E E E E E I I I I I I I I I I I I I I D D D D D D D D D D D D D D E E E E E E E E E E E E E E
GbE Ports
Benchmark Results
FC Ports
bonnie (see http://linux.maruhn.com/sec/bonnie.html) single task read rate = 80 MB/s aggregate (1 task per blade) read rate = 560 MB/s (i.e., 40 MB/s per blade) baseline test (read from a single local disk using ext2) read rate = 30 MB/s
COMMENT: the primary application is blast
COMMENTS: This configuration is not recommended since it presents a single point of failure risk. While less than optimal, this risk can be eliminated using GPFS replication.
Heterogenous Cluster
System
1536-node, 100 TF pSeries cluster 2 PB GPFS file system (one mount point) 500 RAID conroller pairs, 11000 disk drives 126 GB/s parallel I/O measured to a single file (134GB/s to multiple files)
3 Switches 2 Gb FC
. . .
Frame-15
Frame-01
4 x FAStT600 8 x EXP700
FAStT600-01 EXP700
switch 02
to Frame-15 8 connections
. . .
to Frame-01 8 connections
switch 03
EXP700
Written Exercise #1
Suppose you have been asked to design the storage subsystem. The cluster will be running Linux with 256 4-way compute nodes (i386 or x86-64). The application job mix will be varied. The message passing traffic will vary from light to moderate (bursts of large messages that are latency tolerant) to heavy (numerous small packet messages that are latency intolerant for the duration of the job, or jobs that will have large packet transfers upon startup and close to terminiation that are latency tolerant); the customer believes that node message passing BW will be at most 50 to 80 MB/s. The storage I/O will also be variable. Some jobs will require lots of BW at the beginning and end of the jobs using a streaming access pattern (large records, sequential access), others will required sustained, moderate access over the life of the job using a streaming pattern, and 1 job will require sustained, but light access over the life of the job, but the access pattern will be small records, irregularly distributed over the seek offset space. The jobs on the cluster are parallel. Finally, there are about 200 users with Windows or Linux based PCs in their office that must access this cluster's file system. Typically, the aggregate file system BW for the cluster will be in the neighborhood of 1 GB/s, though aggregate burst rates could be as high as 2 GB/s. Individual nodes must be able to sustain storage I/O BW up to 60+ MB/s, though more typical node rates are less than 5 MB/s. The file system must start at 50 TB, but be expandable to 100 TB in the future.
Design the cluster by specifying - network (GbE, Myrinet, IB or mixed) and its topology - storage nodes - disk and controllers What additional information do you need in order to make a better specification?
The dog ate my homework!
Written Exercise #2
Suppose you have been asked to design a new storage subsystem to be shared by 2 clusters. The first cluster is a new one that consists of 32 P6-p575 nodes (32 cores per node with 64 GB or RAM) using IB (4x DDR) for the LAN. The other cluster is a legacy system that was designed for written exercise #1. The new storage subsystem needs to be accessible by both clusters though it will be primarily used by the new pSeries cluster (85% usage) as a scratch file system; the legacy cluster accesses on the new file system will mostly be reads. This new pSeries cluster must also be able to access the storage subsystem in the legacy cluster, though it will only account for 15% of the storage work load on that storage subsystem with most of the accesses being reads. The job mix is varied on the new pSeries cluster; it will be used for parallel jobs with heavy message passing requirements, but the storage I/O access pattern will be largely streaming oriented (large records sequentially accessed in large files), but there will be one job flow that must access many small files (2K to 256K with the average being 8K). The small file workload will account for 25% of the overall workload on the pSeries cluster. The new storage system must be able to support high data rates (8 GB/s) for the streaming work load and high IOP rates for the small file work load (80,000 files per second). The capacity of the new storage system must be 250 TB.
Design the cluster by specifying - storage network (GbE, Myrinet, IB or mixed) and its topology - storage nodes - disk and controllers What additional information do you need in order to make a better specification?
Home Work
Ether
Let's take a look at some selected sysadm details. Much of this this information is also relevant to programmers. This discussion is not intended to be exhaustive; rather it is intended to provide general guidance and examples (i.e., "give me an example and I will figure out how to do it"). The emphasis is more on concept than syntax. Nor are all options explanined. See manuals for further details.
COMMENT: Unless stated otherwise, specific examples are based on GPFS 3.1. For the most part, there is little or no change between GPFS versions 2.3, 3.1 and 3.2 for these commands.
/gpfsdocs... pdf and html versions of basic GPFS documents /include... include files for GPFS specific APIs, etc. /lib... GPFS libraries (e.g., libgpfs.a, libdmapi.a) /samples... sample scripts, benchmark codes, etc.
/var/adm/ras
error logs
files... mmfs.log.<time stamp>.<hostname> (new log every time GPFS restarted) links... mmfs.log.latest, mmfs.log.previous
/tmp/mmfs
used for GPFS dumps sysadm must create this directory see mmchconfig
/var/mmfs
GPFS configuration files
GPFS upgrades
https://www14.software.ibm.com/webapp/set2/sas/f/gpfs/download/home.html Once the base version of GPFS is installed, upgrades can be freely downloaded and installed for AIX and Linux
GPFS provides a number of commands to list parameter settings, configuration components and other things. I call these the "mmls" or "mm list" commands.
COMMENT: By default, nearly all of the mm commands require root authority to execute. However, many sysadm's reset the permissions on mmls commands to allow programmers and others to execute them as they are very useful for the purposes of problem determination and debugging.
COMMENTS: Lists configuration parameters applying to the cluster. Generally only lists configuration paramters that have been changed.
GPFS cluster configuration servers: ----------------------------------Primary server: gpfs_node1 Secondary server: gpfs_node2 Node Daemon node name IP address Admin node name Designation ----------------------------------------------------------------------------------------------1 gpfs_node1 192.168.42.101 gpfs_node1 quorum 2 gpfs_node2 192.168.42.102 gpfs_node2 quorum 3 gpfs_node3 192.168.42.103 gpfs_node3 quorum 4 gpfs_node4 192.168.42.104 gpfs_node4 quorum
storage status availability pool ------------- ------------ -----------ready up system ready up system ready up system ready up system
status
suspended: indicates that data is to be migrated off this disk being emptied: transitional status in effect while a disk deletion is pending replacing: transitional status in effect for old disk while replacement is pending replacement: transitional status in effect for new disk while replacement is pending
availabity
up: disk is available to GPFS for normal read and write operations down: no read and write operations can be performed on this disk recovering: an intermediate state for disks coming up during this state GPFS verifies and corrects data read operations can be performed, but write operations cannot unrecovered: the disks was not successfully brought up
mmlsnsd
display current NSD information in the GPFS cluster -X cool new option for GPFS 3.2
Maps the NSD name to its disk device name in /dev on the local node and, if applicable, on the NSD server nodes. Using the -X option is a slow operation and is recommended only for problem determination.
[root]# mmlsnsd -X -d "hd3n97;sdfnsd;hd5n98" Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------hd3n97 0972846145C8E927 /dev/hdisk3 hdisk c5n97g.ppd.pok.ibm.com server node,pr=no hd3n97 0972846145C8E927 /dev/hdisk3 hdisk c5n98g.ppd.pok.ibm.com server node,pr=no AIX hd5n98 0972846245EB501C /dev/hdisk5 hdisk c5n97g.ppd.pok.ibm.com server node,pr=no hd5n98 0972846245EB501C /dev/hdisk5 hdisk c5n98g.ppd.pok.ibm.com server node,pr=no sdfnsd 0972845E45F02E81 /dev/sdf generic c5n94g.ppd.pok.ibm.com server node sdfnsd 0972845E45F02E81 /dev/sdm generic c5n96g.ppd.pok.ibm.com server node LINUX
-p primary GPFS cluster configuration server node -s secondary GPFS cluster configuration server node -R specify remote file copy command (e.g., rcp or scp) -r specify remote shell command (e.g., rsh or ssh) The remote copy and remote shell commands must adhere to the same syntax format as the rcp and rsh commands, but may implement an alternate authentication mechanism.
mmshutdown -a
unmount all GPFS file systems and shut down mmfsd daemons on all nodes always do this before rebooting nodes if possible
If you do not need to do this on all nodes, the -W or -w parameters will allow you specify which nodes in the cluster to start up/shut down mmfsd.
dsk.lst is modified for use as the input file to the mmcrfs command
-v no
verify the disk is not already formatted as an NSD; a value of no means do NOT verify
DiskName: The disk name as it appears in /dev ServerList: Is a comma separated list of NSD server nodes. You may specify up to eight NSD servers in this list. The defined NSD will preferentially use the first server on the list. If the first server is not available, the NSD will use the next available server on the list. DiskUsage: dataAndMetadata (default) or dataOnly or metadataOnly FailureGroup: GPFS uses this information during data and metadata placement to assure that no two replicas of the same block are written in such a way as to become unavailable due to a single failure. All disks that are attached to the same adapter or NSD server should be placed in the same failure group. Applies only to GPFS in non-SAN mode. DesiredName: Specify the name you desire for the NSD to be created. Default format... gpfs<integer>nsd StoragePool: Specify name of the storage pool that the NSD is assigned to; this parameter is used by mmcrfs command
dsk.lst is modified for use as the input file to the mmcrfs command
-v no
verify the disk is not already formatted as an NSD; a value of no means do NOT verify
NOTE: This is a Linux example. Under AIX disk names are generally of the form hdisk<x>
The mmcrvsd output disk descriptor file can no longer be used as input to the mmcrfs command to build the file system. It is necessary to create NSDs (via the mmcrnsd command) using the output disk descriptor file from the mmcrvsd command after creating the VSDs.
mmcrlv
no longer required and no longer exists If you do create LVs manually using crlv, GPFS will not configure properly!
-A -B -E -m -M -n -N -Q -r -R -S -v -z -D
yes -> mount after starting mmfsd, no -> manually mount, automount -> mount at first use block size (16K, 64K, 256K (default), 512K, 1024K, 2048K, 4096K)
default is yes
If you choose a block size larger than 256 KB, you must run mmchconfig to change the value of maxblocksize to a value at least as large as BlockSize.
specifies whether or not to report exact mtime values default number of copies (1 or 2) of i-nodes and indirect blocks for a file default max number of copies of inodes, directories, indirect blocks for a file estimated number of nodes that will mount the file system max number of files in the file system (default = sizeof(file system)/1M activate quotas when the file system is mounted (default = NO) default number of copies of each data block for a file default maximum number of copies of data blocks for a file suppress the periodic updating of the value of atime verify that specified disks do not belong to an existing file system enable or disable DMAPI on the file system (default = no) specify nfs4 to allow "deny-write open lock" to block writes for NFS V4 exported file systems default=posix -k specify the authorization protocol; the options are <posix | nfs4 | all>
Typical example
mmcrfs /fs fs -F disk.lst -A yes -B 1024k -v no
GPFS provides a number of commands to change configuration and file system parameters after being initially set.
I call these the "mmch" or "mm change" commands. There are some GPFS parameters which are initially set only by default; the only way to modify their value is using the appropriate mmch command. n.b., There are restrictions regarding changes that can be made to many of these parameters; be sure to consult the Concepts, Planning and Installation Guide for tables outlining what parameters can be changed and under which conditions they can be changed. See the Administration and Programming Reference manual for further paramter details.
-i changes take effect immediately and are permanent -I changes take effect immediately but do not persist after GPFS is restarted parameters do not apply to all attributes; carefully review Administration and Programming Guide for details attributes (selection of the more common or nettlesome ones) autoload: Start mmfsd automatically when nodes are rebooted. Valid values are yes or no. dataStructureDump: the default is /tmp/mmfs
do not use a GPFS directory (it may not be available) warning: files can be large (200 MB or more)... be sure to delete them when done
designation: explicitly designate client, manager, quorum, or nonquorum nodes maxblocksize: default is 1024K; n.b., mmcrfs blocksize (-B) can not exceed this. maxMBpS (data rate estimate (MB/s) on how much data can be transferred in or out of 1 node)
The value is used in calculating the amount of IO that can be done to effectively prefetch data for readers and write-behind data from writers. By lowering this value, you can artificially limit how much IO one node can put on all of the disk servers. This is useful in environments in which a large number of nodes can overrun a few storage servers. The default is 150 MB/s which can severally limit performance on HPS ("federation") based systems.
tiebreakerDisks: to use this feature, provide a list of disk names (i.e., there NSD name)
Options available under mmcrfs, but not available under mmchfs -B, -M, -n, -N, -R, -v changing these parameters requires rebuilding the FS Carefully review the following documents for more details Concepts, Planning and Installation Guide Administration and Programming Reference
GPFS is provides dynamic means to add and/or remove many components. This allows thesysadm a convient means to grow the current infrastructure or to re-allocate resources to other places by deleting them from an existing system.
GPFS provides the means to dynamically add and remove disks from a GPFS cluster.
mmadddisk <device> -F disk.lst -r -v {yes | no} mmdeldisk <device> -F disk.lst -r -c
<device> is the GPFS device in /dev disk.lst entries in the form DiskName:::DiskUsage:FailureGroup
see documentation for details
GPFS provides the means to dynamically add and remove nodes from a GPFS cluster.
mmaddnode -n node.lst
add nodes toan existing cluster and create mount points and device entries on the new nodes
under some circumstances (e.g., re-installing node), it may be necessary copy the mmsdrfs file to the new nodes (n.b., get copy from primary cluster server... see mmlscluster)
mmdelnode -n node.lst
remove nodes from an existing cluster notes and caveats
primary/secordary cluster servers primary/secordary NSD servers must first unmount GPFS file system use caution when deleting quorum nodes
There are a number of ways to measure file system performance. There are some very simple techniques that provide useful insight while there are other more elaborate alternatives. Some are common and some are unique to GPFS.
Measuring Performance
Benchmarking
good ones
IOR, xdd bonnie, iozone
See Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 2nd ed., Morgan Kaufmann, 1996 (pp. 20-21)
Kernels
nsdperf
very few of these exist for measuring file system performance.
Toy benchmarks
dd, home grown varieties
Measurement Tools
System tools (iostat, nmon) GPFS commands (mmpmon, nsdperf) Controller tools
Measuring Bandwidth
iostat
Measuring I/O time within the application using timing functions like rtc() or gettimeofday() are useful from a job perspective, but do not accurately measure actual I/O rates (e.g., they can overlook locking delays, include PVM message passing overhead, ignore variance).
The iostat command shows actual disk activity (this is the AIX version). iostat <time interval> <number of samples> Use dsh to collect from multiple nodes. export WCOLL=/wcoll dsh -a > iostat 10 360 > `hostname -s`.iostat
rsh can also be used
See man pages for more options. See also the vmstat command for CPU oriented measures.
Measuring Bandwidth
iostat
flash008> iostat 10 360 tty: tin tout 0.0 0.0 Disks: hdisk1 hdisk0 hdisk3 hdisk5 hdisk2 hdisk4 tty: tin 0.0 % tm_act 0.0 0.2 16.5 0.0 0.0 16.3 avg-cpu: % user 49.8 % sys 5.2 % idle 44.9 % iowait 0.1 hdisk0, hdisk1 local JFS directory hdisk3, hdisk4 mounted locally on this VSD server node hdisk2, hdisk5 mounted locally on this VSD server node only in failover mode % iowait 0.0
Kb_read Kb_wrtn 0 0 5537695 4424062 2119811596 709528896 5262 0 5262 0 39842779 710476158 % sys 8.1 % idle 91.6
avg-cpu:
% user 0.3
Disks: hdisk1 hdisk0 hdisk3 hdisk5 hdisk2 hdisk4 tty: tin 0.0
Kb_read 0 0
avg-cpu:
% user 0.1
% sys 3.4
Kb_read 0 0 0 0 0 0
Measuring Bandwidth
iostat
Meaning of iostat columns. %usr - percent application CPU time %sys - percent of kernel CPU time %idle - percent of CPU idle time during which there were no outstanding disk I/O requests %iowait - percent of CPU idle time during which there were outstanding disk I/O requests %tm_act - percent of time that the hdisk was active (i.e., bandwidth disk utilization) Kbps - volume of data read and/or written to the hdisk in kilobytes per second tps - transfers (i.e., I/O requests) per second to the hdisk Kb_read - total data read from the given hdisk over the last time interval in KB Kb_wrtn - total data written to the given hdisk over the last time interval in KB
Miscellaneous
Starting with version 3.1, it can collect statistics from multiple nodes It requires root access Up to 5 instances of mmpmon can be run on 1 node at one time
Measuring Latency
Multi-node mmpmon Usage
n04 1;10;30;100
n02 n03 n04 apply rhist on to nodes n01 n02 n03 n04
nr
to nodes n01
n04
<-- apply rhist
to nodes n01
n04
<-- apply rhist
off
to nodes n01
Measuring Bandwidth
nsdperf
Measuring Bandwidth
Code Instrumentation
Other Topics
Waiters (i.e., waiting threads)
see p. 48, 89, of Problem Determination Guide
mmfsadm
see p. 9, of Problem Determination Guide mmfsdam dump config mmfsadm dump all
WARNING: Creates a file up to 100's of MB in size
mmfsadm cleanup
alternative to mmshutdown; designed to recycle mmfsd on a node without hanging
gpfs.snap
see pp. 6-8, of Problem Determination Guide
Launch mmtrace on all nodes in cluster. When a node looses quorum, GPFS executes the following script which recycles the tracing (there-by generating a trace report). The tracefile is called lxtrace.trc.<hostname>. > cat mmQuorumLossExit echo `hostname` LOST GPFS QUORUM echo RECYCLING mmtrace date /usr/lpp/mmfs/bin/mmtrace COMMENT: A GPFS trace is fixed size (default is 16M... set environment variable TRCFILESIZE to change size). Trace data wrapps around once it hits the end of the file. The time duration represented by the tracefile is proportional to its size.
GPFS Security
Security vs. performance and convenience... a classic example of being caught between a rock and a hard place
GPFS Security
Defining Administrative Domain
GPFS Security
Administrative Access
It is necessary to properly configure security in order to administer GPFS; this includes the following...
Provide standard root access to designated system administrators. most GPFS commands require root authority Establish an authentication method between nodes in the GPFS cluster. Designate a remote communication program for remote shell and remote file copy commands. a subset of nodes must allow root level communication without the use of a password and without any extraneous messages common choices are ssh/scp and rsh/rcp
designated using the mmcrcluster and mmchcluster commands the selected option must use the rsh/rcp CLI
GPFS uses remote shell and remote file copy commands to do things like...
GPFS commands executed by a system administrator on a given node propogate configuration information to and perform administrative tasks on other nodes in the cluster. GPFS automatically communicates changes of system state across the nodes of a cluster
TRUSTED
mmchconfig adminMode=allToAll
The domain of trust can be extended over a WAN via GPFS multi-cluster. Use OpenSSL for access security. (n.b., root squash option)
C a m p u s W i d e N e t w o r k
NSD Server - 01 NSD Server - 02 NSD Server - 03 NSD Server - 04 NSD Server - 05 NSD Server - 06 NSD Server - 07 NSD Server - 08
COMMENTS: Only the nodes within trusted network have direct access to GPFS file system User accounts do NOT exist on nodes in the trusted network User access is indirect via job schedulers, login nodes, etc. FOOTNOTE * Prior to version 3.3, this was the only option for a GPFS cluster.
TRUSTED
C a m p u s W i d e N e t w o r k
mmchconfig adminMode=central NO root access NO root access Frame #8
Ethernet Frame #1 Switch
client - 01 client Ethernet Switch- 02 client - 03 client - 01 client - 04 client - 02 client - 05 client - 03 client - 06 client - 04 client - 07 client - 05 client - 08 client - 06 client - 09 client - 07 client - 10 client - 08 client - 11 client - 09 client - 12 client - 10 client - 13 client - 11 client - 14 client - 12 client - 15 client - 13 client - 16 client - 14 client - 17 client - 15 client - 18 client - 16 client - 19 client - 17 client - 20 client - 18 client - 21 client - 19 client - 22 client - 20 client - 23 client - 21 client - 24 client - 22 client - 25 client - 23 client - 26 client - 24 client - 27 client - 25 client - 28 client - 26 client - 29 client - 27 client - 30 client - 28 client - 31 client - 29 client - 32 client - 30 client - 31 client - 32
NSD Server - 01 NSD Server - 02 NSD Server - 03 NSD Server - 04 NSD Server - 05 NSD Server - 06 NSD Server - 07 NSD Server - 08
COMMENTS: Only the nodes within trusted network have direct access to GPFS file system User accounts do NOT exist on nodes in the trusted network User access is indirect via job schedulers, login nodes, etc. FOOTNOTE * mmchconfig adminMode is a new feature in version 3.3.
GPFS Security
Example Configuring Passwordless ssh/scp Authentication
[root@nsd1 ~]# cd .ssh Create the /root/.ssh directory if it does not exist. Generate the public/private key pair (the other option is dsa) [root@nsd1 .ssh]# ssh-keygen -t rsa -f id_rsa Generating public/private rsa key pair. Leave these responses blank to avoid passwords. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in id_rsa. Your public key has been saved in id_rsa.pub. The key fingerprint is: dc:74:17:46:0e:ea:ad:96:50:df:d3:bf:99:86:d6:c8 root@nsd1 Append public key file to the authorized_keys file. [root@nsd1 .ssh]# cat id_rsa.pub >> authorized_keys Be sure you can ssh to yourself without a password; all nodes must be able to do this. [root@nsd1 .ssh]# ssh nsd1 The authenticity of host 'nsd1 (172.31.1.78)' can't be established. RSA key fingerprint is d8:4a:cd:96:45:25:34:19:34:fa:23:98:36:c0:ed:7e. This is normal the first time you ssh to a node. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'nsd1,172.31.1.78' (RSA) to the list of known hosts. Last login: Thu Oct 9 17:01:06 2008 from nsd1 [root@nsd1 .ssh]# exit Connection to nsd1 closed. [root@nsd1 .ssh]# dir total 20 -rw------- 1 root root 391 Oct 9 17:06 authorized_keys Be sure the permissions are 600 and the owner/group is root. -rw------- 1 root root 1675 Oct 9 17:05 id_rsa Be sure the permissions are 644 -rw-r--r-- 1 root root 391 Oct 9 17:05 id_rsa.pub and the owner/group is root. -rw-r--r-- 1 root root 398 Oct 9 17:06 known_hosts [root@nsd1 .ssh]#
The known_hosts file is generated "automagically" when a remote node first logs into the local node via ssh. In this example, it was created when we "sshed" to ourselves and answered "yes".
GPFS Security
Example Configuring Passwordless ssh/scp Authentication
It is necessary for all nodes in the [root@nsd1 .ssh]# for i in 2 3 4 GPFS cluster to have ssh keys. It is > do common practice to generate the > scp authorized_keys id_rsa id_rsa.pub known_hosts nsd$i:.ssh keys on one node and copy them to all other nodes in the GPFS cluster. > done The authenticity of host 'nsd2 (172.31.1.79)' can't be established. RSA key fingerprint is 48:db:31:71:76:4f:25:f0:37:b1:62:29:d6:87:5e:4e. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'nsd2,172.31.1.79' (RSA) to the list of known hosts. root@nsd2's password: ******** Answering "yes" to this authorized_keys 100% 391 0.4KB/s 00:00 request causes ssh to id_rsa 100% 1675 1.6KB/s 00:00 "automagically" append encrypted public keys to the id_rsa.pub 100% 391 0.4KB/s 00:00 local known_hosts file. known_hosts 100% 796 0.8KB/s 00:00 Subsequent logins to any of The authenticity of host 'nsd3 (172.31.1.80)' can't be established. these remote nodes will no longer encounter this request. RSA key fingerprint is e9:96:bc:31:a6:7f:e5:29:92:06:f3:ac:3d:5a:2b:3c. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'nsd3,172.31.1.80' (RSA) to the list of known hosts. root@nsd3's password: ******** The first ssh access to remote authorized_keys 100% 391 0.4KB/s 00:00 nodes requires a password. After properly copying the id_rsa 100% 1675 1.6KB/s 00:00 keys to these other nodes, a id_rsa.pub 100% 391 0.4KB/s 00:00 password challenge will no longer happen. known_hosts 100% 1194 1.2KB/s 00:00 The authenticity of host 'nsd4 (172.31.1.81)' can't be established. RSA key fingerprint is e2:3d:1b:3f:ef:6f:b8:bd:5e:0a:ab:e0:56:1b:83:39. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'nsd4,172.31.1.81' (RSA) to the list of known hosts. root@nsd4's password: ******** authorized_keys 100% 391 0.4KB/s 00:00 id_rsa 100% 1675 1.6KB/s 00:00 id_rsa.pub 100% 391 0.4KB/s 00:00 known_hosts 100% 1592 1.6KB/s 00:00 [root@nsd1 .ssh]#
GPFS Security
Example Configuring Passwordless ssh/scp Authentication
[root@nsd1 .ssh]# ssh nsd1 ssh nsd2 ssh nsd3 ssh nsd4 ssh nsd1 date Host key verification failed. This is a simple test to be sure ssh is configured properly.
It failed because the known_hosts file was incomplete on the other nodes in the GPFS cluster.
[root@nsd1 .ssh]# for i in 2 3 4 TRICK: Since the known_hosts file on the local node is now complete after copying the keys to all of the other nodes in the GPFS cluster, > do simply copy it again to all of the other nodes in the GPFS cluster. > scp known_hosts nsd$i:.ssh > done known_hosts 100% 1592 1.6KB/s 00:00 known_hosts 100% 1592 1.6KB/s 00:00 known_hosts 100% 1592 1.6KB/s 00:00 [root@nsd1 .ssh]# ssh nsd1 ssh nsd2 ssh nsd3 ssh nsd4 ssh nsd1 date Thu Oct 9 17:14:45 EDT 2008 The test completed properly this time.
This test is not fool proof, however.
[root@nsd1 .ssh]# ls -al drwx------ 2 root root drwxr-x--- 12 root root [root@nsd1 ~]#
Be sure the permissions are 700 and the owner/group is root. Be sure the permissions are 750 and the owner/group is root. WARNING: Some implementations of ssh/scp may not allow passwordless access if the permissions are not set properly.
COMMENT: This is a tedious process! For large clusters, automated tools are used to do this task.
The following pages is a potpourri of practical sysadm and tuning experience (often learned late at night under duress :->)
Read/Modify/Write Penalty
Choosing N in N+P in RAID 5 Configurations
A RAID 5 "stripe" is N * segment_size where segment_size is the size of the block of data written to 1 physical disk in the RAID 5 array
If segment_size = 256K with 4+P RAID 5 array, then the stripe_size = 1024K
GPFS block_size should equal RAID 5 stripe size for best performance. Since GPFS block_size is not arbitrary (GPFS blocksize is 2k), this implicitly restricts choices for N and the segment_size if optimum performance is to be achieved. For example...
On a DS4000 system, N = { 4 | 8 } If GPFS block_size=1024K, then
if N == 4, then segment_size=256K if N == 8, then segment_size=128K
If N+P and GPFS block_size are consistent, then block_size == N * segment_size and the stripe is "over written"
this yields best performance this is sometimes called a "full stride write"
not consistent, then it is necessary read RAID 5 stripe, then update it, then write it
this significantly reduces performance
First ls does not update atime. atime is updated when a file is actually accessed (ls accesses the directory). I do not have data on cost of atime - it really depends on the workload. The reason of recommending changing -E no (for mtime) is that on some systems we have observed impact of mtime in shared file updates (and variability in performance). My guess would have been to expect atime to be a lesser issue than mtime (what is their workload that makes them concerned about performance impact atime updates?). -E no means no exact mtime. For supressing atime updates you would say -S yes Requires remounting the file system
What is PHOENIX?
It is the "high availability" layer in GPFS today. Replaces the RSCT service used by GPFS in its AIX days
A slow or improperly configured LUN in a GPFS file system can slow down performance for the entire file system Use dd to isolate performance of a given LUN for example, read a SCSI device in Linux
time dd if=/dev/sdc of=/dev/null bs=1024K count=2048
Use caution when writing to SCSI device... it will "clobber" a file system
See the GPFS: Concepts, Planning, and Installation Guide for further information.
Suspending GPFS
Fine Print:
Use the mmfsctl command to issue control requests to a particular GPFS file system. The command is used to temporarily suspend the processing of all application I/O requests, and later resume them, as well as to synchronize the file systems configuration state between peer clusters in disaster recovery environments.
SAN Conjestion
An Example
This analysis is based on the Brocade SilkWorm 48000 with 4 Gb/s FC fabric
2 SAN switches with 2 x 32 port blades 2 ASICs per blade with 16 ports per ASIC total ports available = 128 total ports used = 56
Empirical tests show that the effective ASIC BW < 1100 MB/s
test code: dd to raw disks, read 4096 records with sizeof(record) = 1M effective BW is the BW measured by the application total BW through the ASIC is 2200 MB/s (accounting for the data in and data out streams)
Properly distribute host and controller connections across all ASICs to avoid ASIC saturation
see cabeling example on next page requires using all DS4800 host side ports in this example BEST PRACTICE: deploy cabeling to avoid all "inter-ASIC" traffic For completeness, each ASIC connects to a control processor enabling 32 Gb/s simplex or 64 Gb/s duplex inter-ASIC communication, however, the electronics of the ASIC does not appear to be able to handle that much aggregate BW.
This SAN cabeling issue does not impact standard GPFS NSD configuration.
In the standard NSD configuration, a SAN is not necessary. Moreover, each host port typically will be accessed by only 1 HBA; i.e., there is a 1:1 HBA to host port ratio. In this multi-cluster VSD configuration, there is a 3:1 HBA to host port ratio.
SAN Conjestion
The View from the Perspective of fcs0
Legend: DS0<1|2|3|4>, Array ID {1...12}
Zoning
node1
1,1 1,7 1,4 1,10 1,2 1,8 1,5 1,11 1,3 1,9 1,6 1,12
2,1 2,7 2,4 2,10 2,2 2,8 2,5 2,11 2,3 2,9 2,6 2,12
3,1 3,7 3,4 3,10 3,2 3,8 3,5 3,11 3,3 3,9 3,6 3,12
4,1 4,7 4,4 4,10 4,2 4,8 4,5 4,11 4,3 4,9 4,6 4,12
A1 fcs0
B1 fcs3
A3 fcs2
B3 fcs1
node2
A1 fcs0
A S I B C l a A d S e I C A S I B C l a A d S e I C
DS01
partition #1 1, 2, 3 7,8,9 partition #2 4, 5, 6 10,11,12
crtl A crtl B
B1 fcs3
A3 fcs2
DS02
1 2 3 4 1 2 3 4
B3 fcs1
crtl A crtl B
partition #1 1, 2, 3 7,8,9
partition #2 4, 5, 6 10,11,12
node3
A1 fcs0
B1 fcs3
A3 fcs2
B3 fcs1
DS03
A S I B C l a A d S e I C A S I B C l a A d S e I C
1 2 3 4 1 2 3 4
node4
crtl A crtl B
partition #1 1, 2, 3 7,8,9
partition #2 4, 5, 6 10,11,12
1,1 1,7 1,4 1,10 1,2 1,8 1,5 1,11 1,3 1,9 1,6 1,12
2,1 2,7 2,4 2,10 2,2 2,8 2,5 2,11 2,3 2,9 2,6 2,12
3,1 3,7 3,4 3,10 3,2 3,8 3,5 3,11 3,3 3,9 3,6 3,12
4,1 4,7 4,4 4,10 4,2 4,8 4,5 4,11 4,3 4,9 4,6 4,12
A2 fcs0
B2 fcs3
A4 fcs2
B4 fcs1
node5
A2 fcs0
DS04
1 2 3 4 1 2 3 4
B2 fcs3
A4 fcs2
crtl A crtl B
partition #1 1, 2, 3 7,8,9
partition #2 4, 5, 6 10,11,12
B4 fcs1
node6
A2 fcs0
B2 fcs3
A4 fcs2
B4 fcs1
Notice that there is no data transmission between the 2 ASICs when accessing RAID arrays in partition #1.
SAN Conjestion
The Complete Cabeling View
Legend: DS0<1|2|3|4>, Array ID {1...12}
Zoning
node1
1,1 1,7 1,4 1,10 1,2 1,8 1,5 1,11 1,3 1,9 1,6 1,12
2,1 2,7 2,4 2,10 2,2 2,8 2,5 2,11 2,3 2,9 2,6 2,12
3,1 3,7 3,4 3,10 3,2 3,8 3,5 3,11 3,3 3,9 3,6 3,12
4,1 4,7 4,4 4,10 4,2 4,8 4,5 4,11 4,3 4,9 4,6 4,12
A1 fcs0
B1 fcs3
A3 fcs2
B3 fcs1
node2
A1 fcs0
A S I B C l a A d S e I C A S I B C l a A d S e I C
DS01
partition #1 1, 2, 3 7,8,9 partition #2 4, 5, 6 10,11,12
crtl A crtl B
B1 fcs3
A3 fcs2
DS02
1 2 3 4 1 2 3 4
B3 fcs1
crtl A crtl B
partition #1 1, 2, 3 7,8,9
partition #2 4, 5, 6 10,11,12
node3
A1 fcs0
B1 fcs3
A3 fcs2
B3 fcs1
DS03
A S I B C l a A d S e I C A S I B C l a A d S e I C
1 2 3 4 1 2 3 4
node4
crtl A crtl B
partition #1 1, 2, 3 7,8,9
partition #2 4, 5, 6 10,11,12
1,1 1,7 1,4 1,10 1,2 1,8 1,5 1,11 1,3 1,9 1,6 1,12
2,1 2,7 2,4 2,10 2,2 2,8 2,5 2,11 2,3 2,9 2,6 2,12
3,1 3,7 3,4 3,10 3,2 3,8 3,5 3,11 3,3 3,9 3,6 3,12
4,1 4,7 4,4 4,10 4,2 4,8 4,5 4,11 4,3 4,9 4,6 4,12
A2 fcs0
B2 fcs3
A4 fcs2
B4 fcs1
node5
A2 fcs0
DS04
1 2 3 4 1 2 3 4
B2 fcs3
A4 fcs2
crtl A crtl B
partition #1 1, 2, 3 7,8,9
partition #2 4, 5, 6 10,11,12
B4 fcs1
node6
A2 fcs0
B2 fcs3
A4 fcs2
B4 fcs1
Notice that all host ports are being used. This does not increase BW (cf using only 4 host ports per DS4800), but this makes it possible to avoid inter-ASIC traffic.
The dump is mostly binary, but the following text record can be seen
NSD descriptor for /dev/sdba created by GPFS Wed May 11 17:56:17 2005
Under Linux, this discrepency can be seen by comparing the output between
mmlsnsd -f <device> -m
<--output omitted to execessive length
and
[root@gpfs01 gpfslpp]# ps -ef | grep mmfsd | grep -v grep root 19625 19510 0 May11 ? 00:00:40 /usr/lpp/mmfs/bin//mmfsd <--output omitted to execessive length [root@gpfs01 gpfslpp]# lsof -p 19625
precreate all files (empty) on one node before all the other nodes open/access their file have each node create their file in its own private directory
When using large pSeries clusters with HPS (i.e., "federation") or SP (i.e., "colony") switch, VSD provides more efficient switch protocol then TCP/IP. But be sure to set the following AIX tuning parameters appropriately for GPFS. Set LTG size tobe >= GPFS blocksize e.g., if blocksize is 1024K, then increase LTG size to 1024K
requires AIX 5.2 or later
early AIX 5.2 releases may require patch(?)
default = 128K
Set buddy buffer size to be >= GPFS blocksize e.g., if blocksize is 1024K, then set buddy buffer size to 1024K modify via smitty default = 256K
Consistent GID/UID
Queue Depth
What is the queue depth?
Storage controllers can process up to a maximum number of concurrent I/O operation requests, sometimes called the maximum command queue depth (MQD) DS5300: MQD = 4096 requests (plus a small number of active requests)
up to 2048 requests per RAID controller up to 2048 requests per port up to the maximum allowed on a RAID controller
When the MQD has been reached, the controller will respond with a "queue full" status until some number of active requests have been processed and there is room for new requests. Storage adapter (e.g., HBA) drivers provide a device queue depth (DQD) parameter controlling the number of I/O requests submitted to a disk device on a given host DQD sets the number of I/O requests per device (e.g., sd<char> or hdisk<int>)
How it works
Parameters: NN = Number of nodes submitting IOPs to a storage controller NLUN = Number of LUNs per node For reliable operation, set DQD such that MQD > NN * NLUN * DQD
This formula ignores the fact that Linux may break a GPFS "packet" into several transactions. This is especially true for larger block sizes. This only makes the problem worse!
Scratch Paper: Consider cluster with 128 nodes and a DS5300 MQD = 4096 Set DQD = 1 and let NLUN = 24 NN * NLUN * DQD = 128 * 24 * 1 = 3072 < MQD = 4096: OK! Set DQD = 1 and let NLUN = 32 NN * NLUN * DQD = 128 * 32 * 1 = 4096 ~< MQD = 4096: Ouch!
The following pages provide an actual example of installing and configuring GPFS under Linux using 4 x3650 NSD servers and a DCS9550 storage controller and disk enclosures. The steps for doing this under AIX are very similar; differences are explained in the annotations. This example can be used as a hands on guide for a lab exercise. Note the following:
red arial font is used for annotations blue courier font is used to highlight commands and parameters black courier font is used for screen text COMMENT: This example is based on GPFS 3.1, but the steps for GPFS 3.2 are nearly identical. Key differences are highlighted in context.
Lab Exercise
Install GPFS from media
if using Linux, build portability layer
experiment with "mmls" and "mmch" commands examine /var/adm/ras/mmfs.log.<extension> examine /var/mmfs look at /var/mmfs/gen/mmsdrfs file run dd or other benchmark tests monitor performance using iostat, vmstat, SMclient
iostat, vmstat not installed by default in Linux must have Windows or AIX client to run SMclient
GPFS v3.3
x3650-M2
Nodes 01..16
SAN client SAN client SAN client SAN client SAN client SAN client SAN client SAN client SAN client
1 2 1 2
60-disk Tray
4 x x3650-M2
GbE GbE
Linux
Distribution: Cent OS 5.4* Kernel: 2.6.18-164.11.1.el5
60-disk Tray
Storage Configuration
DDN Couplet
Host Connections: 16 x FC8 Drive Connections: 40 x SAS
SAN client SAN client SAN client SAN client SAN client SAN client SAN client
60-disk Tray 60-disk Tray 60-disk Tray
20 x SSD
4 x 4+P RAID 5 60 GB per SSD
224 x SAS
450 GB/disk, 15 Krpm 56 x 4+P RAID 5
FOOTNOTE: * Since this is CentOS and not RHEL, it's necessary to create a configuration file as follows so that the portability layer build procedures will work. [root@node-01]# echo "Red Hat Enterprise Linux Server release 5.4 (Tikanga)" > /etc/redhat-release
GPFS v3.3
a. Create cluster b. Declare client and server licenses c. Change global GPFS parameters and start the GPFS daemon d. Create the NSDs e. Create and Mount the file system
GPFS v3.3
2. Copy base version RPMs to the installation directory and extract the RPMs on all nodes in the GPFS cluster.
Sample RPM names: gpfs.base-3.3.0-0.x86_64.rpm gpfs.docs-3.3.0-0.noarch.rpm gpfs.docs... contains the man pages. gpfs.gpl-3.3.0-0.noarch.rpm It is not necessary to install the GUI. gpfs.gui-3.3.0-0.x86_64.rpm gpfs.msg.en_US-3.3.0-0.noarch.rpm
3. Download the latest update package. This comes as a tar/gzip file from
https://www14.software.ibm.com/webapp/set2/sas/f/gpfs/download/home.html gpfs-3.3.0-4.x86_64.update.tar.gz This file contains a different version of the same RPMs as the base version.
4. Copy this file to the installation directory, then gunzip/untar this file, and extract the RPMs on all nodes in the GPFS cluster.
Example installation directory: /gpfs_install/gpfs_3.3.0.4
The steps for doing this are more thoroughly documented in chapter 5 of the GPFS Concepts, Planning and Installation Guide The steps for installing the GPFS code under AIX and Windows are also documented in this guide. It can be found at
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfs_com_faq.html
This link may not take you directly to the current GPFS FAQ, but by drilling down, you can get there.
GPFS v3.3
6. Copy portability rpm to all nodes and extract. 7. Warnings and Caveates
If a cluster has mixed architectures and/or kernel levels, it is necessary build a portability rpm for each instance and copy it to like nodes. Required Linux patches for GPFS can be found at: http://www.ibm.com/developerworks/opensource/
GPFS v3.3
Administrative domain spans all nodes. (i.e., the traditional security model)
GPFS v3.3
This is a common and routine message. GPFS is using ssh and scp to copy configuration information to all of the nodes asynchronously. It may be that the node on which the command is executed may complete before all of the other nodes are ready.
mmcrcluster parameters -n: list of nodes to be included in the cluster -p: primary GPFS cluster configuration server node -s: secondary GPFS cluster configuration server node -R: remote copy command (e.g., rcp or scp) -r: remote shell command (e.g., rsh or ssh)
GPFS v3.3
GPFS cluster configuration servers: ----------------------------------Primary server: node-01 Secondary server: node-02 Node Daemon node name IP address Admin node name Designation ----------------------------------------------------------------------------------------1 node-01 172.31.1.200 node-01 quorum-manager 2 node-02 172.31.1.201 node-02 quorum-manager 3 node-03 172.31.1.202 node-03 quorum-manager 4 node-04 172.31.1.203 node-04 5 node-05 172.31.1.204 node-05 6 node-06 172.31.1.205 node-06 7 node-07 172.31.1.206 node-07 8 node-08 172.31.1.207 node-08 9 node-09 172.31.1.210 node-09 10 node-10 172.31.1.211 node-10 11 node-11 172.31.1.212 node-11 12 node-12 172.31.1.213 node-12 13 node-13 172.31.1.214 node-13 14 node-14 172.31.1.215 node-14 15 node-15 172.31.1.216 node-15 16 node-16 172.31.1.217 node-16
GPFS v3.3
[root@node-01 GPFS_install]# mmchlicense client --accept -N license_client.lst The following nodes will be designated as possessing GPFS client licenses: node-04 node-05 mmchlicense parameters node-06 server: server license type node-07 client: client license type node-08 --accept: suppress the license prompt node-09 (implies you accept license terms) -N: list of nodes for a given license type node-10 node-11 It is necessary to explicitly declare both node-12 license types. node-13 node-14 node-15 node-16 mmchlicense: Command successfully completed mmchlicense: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.
GPFS v3.3
16 3 13 0 0
mmchlicense parameters -L: Displays the license type for each node, using an * to designate node with licenses out of compliance.
[root@node-01 GPFS_install]# mmlslicense -L Node name Required license Designated license ------------------------------------------------------------------node-01 server server node-02 server server node-03 server client * node-04 client client node-05 client node * Summary information --------------------Number of nodes defined in the cluster: 5 Number of nodes with server license designation: 2 Number of nodes with client license designation: 2 Number of nodes still requiring server license designation: 1 Number of nodes still requiring client license designation: 1
GPFS v3.3
GPFS v3.3
GPFS v3.3
[root@node-01 GPFS_install]# mmcrnsd -F disk.lst -v no mmcrnsd: Processing disk dm-3 mmcrnsd: Processing disk dm-4 mmcrnsd: Processing disk dm-5 The mmcrnsd parameters are mmcrnsd: Processing disk dm-6 -F: name of the NSD specification file (n.b., the file is changed by this command mmcrnsd: Processing disk dm-7 keep a back up!) mmcrnsd: Processing disk dm-8 -v: check if this disk is part of an existing GPFS file system or ever had a GPFS file mmcrnsd: Processing disk dm-9 system on it (n.b., if it does/did and the parameter is yes, then mmcrnsd will not mmcrnsd: Processing disk dm-10 create it as a new NSD) mmcrnsd: Processing disk dm-11 mmcrnsd: Processing disk dm-12 mmcrnsd: Processing disk dm-13 mmcrnsd: Processing disk dm-14 mmcrnsd: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.
GPFS v3.3
# dm-7:::dataOnly::sas4 sas4:::dataOnly:-1:: # dm-8:::dataOnly::sas5 sas5:::dataOnly:-1:: # dm-9:::dataOnly::sas6 sas6:::dataOnly:-1:: # dm-10:::dataOnly::sas7 sas7:::dataOnly:-1:: # dm-11:::dataOnly::sas8 sas8:::dataOnly:-1:: # dm-12:::dataOnly::sas9 sas9:::dataOnly:-1:: # dm-13:::dataOnly::sas10 sas10:::dataOnly:-1:: # dm-14:::dataOnly::sas11 sas11:::dataOnly:-1::
GPFS v3.3
Disk name NSD volume ID Device Devtype Node name Remarks ------------------------------------------------------------------------------------sas10 AC1F01C84B6C5755 /dev/dm-13 dmm node-01 Since GPFS is configured as SAN sas11 AC1F01C84B6C5756 /dev/dm-14 dmm node-01 topology in this example, the node sas4 AC1F01C84B6C574F /dev/dm-7 dmm node-01 names are not unique. In a LAN configuration, there is one line for sas5 AC1F01C84B6C5750 /dev/dm-8 dmm node-01 each LUN and each node where it is sas6 AC1F01C84B6C5751 /dev/dm-9 dmm node-01 mounted. sas7 AC1F01C84B6C5752 /dev/dm-10 dmm node-01 sas8 AC1F01C84B6C5753 /dev/dm-11 dmm node-01 mmlsnsd parameters sas9 AC1F01C84B6C5754 /dev/dm-12 dmm node-01 -X: list extended NSD information ssd0 AC1F01C84B6C574B /dev/dm-3 dmm node-01 ssd1 AC1F01C84B6C574C /dev/dm-4 dmm node-01 ssd2 AC1F01C84B6C574D /dev/dm-5 dmm node-01 ssd3 AC1F01C84B6C574E /dev/dm-6 dmm node-01
File system Disk name NSD servers --------------------------------------------------------------------------gpfs1 sas10 (directly attached) By omitting the -X parameter, a different view is presented. gpfs1 sas11 (directly attached) For example, if a file system exists, it would then show the LUN to file system mapping. If GPFS used a LAN topology, gpfs1 sas4 (directly attached) it would show primary and backup NSD servers. gpfs1 sas5 (directly attached) gpfs1 sas6 (directly attached) gpfs1 sas7 (directly attached) gpfs1 sas8 (directly attached) gpfs1 sas9 (directly attached) gpfs1 ssd0 (directly attached) gpfs1 ssd1 (directly attached) gpfs1 ssd2 (directly attached) gpfs1 ssd3 (directly attached)
GPFS v3.3
COMMENT: Do not forget to set the -n parameter. Since it provides an estimate for the number of nodes that will mount the file system, try estimate future growth without wildly overestimating. While it can be off quite a bit with minimal impact, after it crosses a certain threshold performance can be severely impacted (e.g., performance will be impacted when it is off by an order of magnitude and the file system is over 70% capacity) and this parameter can not be easily changed. If you configure GPFS with a SAN topology on a cluster that you anticipate will exceed 32 nodes, seek technical assistance from IBM.
GPFS v3.3
GPFS v3.3
GPFS v3.3
[root@node-01 GPFS_install]# dir /gpfs1 total 4194304 -rw-r--r-- 1 root root 4294967296 Feb 5 13:01 buggs_bunny -rw-r--r-- 1 root root 0 Feb 5 12:51 test_file [root@node-01 GPFS_install]# cat /etc/fstab /dev/VolGroup00/LogVol00 / LABEL=/boot /boot tmpfs /dev/shm devpts /dev/pts sysfs /sys proc /proc /dev/VolGroup00/LogVol01 swap /dev/gpfs1 /gpfs1 gpfs
ext3 defaults 1 1 ext3 defaults 1 2 tmpfs defaults 0 0 devpts gid=5,mode=620 0 0 sysfs defaults 0 0 proc defaults 0 0 swap defaults 0 0 rw,mtime,atime,dev=gpfs1,autostart 0 0
GPFS v3.3
Don't jump!
It's easier the second time!
Properly deleting the file system ensures that the file system descriptors are deleted from the disks so that they will not create issues upon a subsequent file system creation attempt. Properly deleting the NSDs ensures that the NSD descriptors are deleted so that they will not create issues upon a subsequent NSD creation attempt.
mmshutdown -a mmdelnode -a
GPFS v3.3
SAS Drives
.... partial listing ..... LUN1 (360001ff08000a0000000001b8ba70001) dm-4 DDN,SFA 10000 [size=184G][features=1 queue_if_no_path][hwhandler=0][rw] \_ round-robin 0 [prio=50][active] \_ 1:0:0:2 sdp 8:240 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 2:0:0:2 sdd 8:48 [active][ready] LUN0 (360001ff08000a0000000001a8ba60000) dm-3 DDN,SFA 10000 [size=184G][features=1 queue_if_no_path][hwhandler=0][rw] \_ round-robin 0 [prio=50][active] \_ 2:0:0:1 sdc 8:32 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 1:0:0:1 sdo 8:224 [active][ready]
SSD Drives
GPFS v3.3
GPFS v3.3
The following pages examine newer features that are discussed in the Advanced Administration Guide.
Beginning with version 3.1 (and moving forward), GPFS is exploiting existing features and creating new features that make it more than an HPC file system... GPFS is becoming a general purpose clustered file system where HPC is a key and pervasive feature.
The following pages examine some of the new (or newly exploited) GPFS features making it more suitable as a general purpose file system. Today, these features include ILM with Integrated HSM Robust NFS/CIFS support Scale-out File System (SoFS) - TBD Storage Virtualization Disaster Recovery Snapshots - TBD GPFS SNMP Support - TBD
Tier-1
Performance Optimized Disk e.g., FC, SAS disk Scratch Space
Tier-2
Capacity Optimized e.g., SATA Infrequently used files
Tier-3
Local tape libraries
Tier-4
Remote tape libraries
System
Storage Network
Tape
Comments
One global name space across pools of independent Storage Files in the same directory can be in different pools
Gold
Silver
Bronze
Storage Pools
ILM Manages sets of storage called "storage pools" What is a storage pool?
A named subset of disks and tapes within the context of GPFS, new appropriate SW to include tape (e.g., HPSS) Each file is assigned to a storage pool based upon policy rules placement policies (where to place files upon creation) migration policies (moving files from one pool to another) deletion policies (removing files from the storage system)
GPFS Filesets
Side effects:
Unlinked filesets can confuse programs that scan the file system (e.g. incremental backup programs) Moving and linking between filesets is not allowed, in keeping with their being like little file systems
A lab with insufficient tape BW was forced to use DHL to move 200 TB of data on disk!
Never underestimate the BW in a pickup load of magnetic tape! ... or a cargo plane for that matter.
Participants
HPSS Collaboration member NERSC/Lawrence Berkeley Lab IBM Research, Almaden Lab IBM GPFS Product Development in Poughkeepsie NY IBM HPSS Development and Support in Houston TX
What is HPSS?
HPSS
High Performance Storage System
Core Server LAN Metadata Disks
HPSS is a disk and tape hierarchical storage system with a cluster architecture similar in many ways to GPFS architecture HPSS can be used alone as a cluster hierarchical storage system or as the tape component of GPFS Versatile native HPSS interfaces:
Client Computers
Traditional HPSS APIs Linux file system interface New GridFTP interface available
Rugged DB2 metadata engine assures reliability and quick recovery Like GPFS, HPSS supports horizontal scaling by adding disks, tape libraries, movers, and core servers to:
10s of petabytes 100s of millions of files gigabytes per second
Disk Arrays
GPFS/HPSS is software that connects GPFS and HPSS together under the GPFS ILM policy framework GPFS/HPSS agents (processes and daemons) run on the GPFS Session Node and I/O Manager Nodes GPFS/HPSS uses DB2 to contain a reference table that maps between GPFS file system objects and HPSS storage objects GPFS/HPSS is distributed with and supported by HPSS
DB2
(GPFS/HPSS)
DB2
(HPSS tables)
GPFS NSD Nodes and HPSS Movers can share the same physical nodes
HPSS Movers
GPFS/HPSS Cluster
Tape Libraries
Functionally, HPSS will use the ILM policy lists from GPFS in order to move data between disk and tape.
HPSS Cluster
HPSS Storage HPSS Movers
Initial Placement
RULE 'SlowDBase' SET STGPOOL 'sata' FOR FILESET('dbase') WHERE NAME LIKE '%.data RULE 'SlowScratch' SET STGPOOL 'sata' FOR FILESET('scratch') WHERE NAME LIKE '%.mpg' RULE 'default' SET STGPOOL 'system'
Rule name
Qualifiers
Movement by Age
Rule to RULE 'MigData' MIGRATE FROM POOL 'system THRESHOLD(80,78) move files WEIGHT( TIME_SINCE_LAST_ACCESS ) TO POOL 'sata FOR FILESET('data') to HPSS RULE 'HsmData' MIGRATE FROM POOL 'sata THRESHOLD(95,80) WEIGHT( TIME_SINCE_LAST_ACCESS ) TO POOL 'hsm FOR FILESET('data') RULE 'Mig2System' MIGRATE FROM POOL 'sata WEIGHT(ACCESS_TIME) TO POOL 'system LIMIT(85) FOR FILESET('user','root') WHERE DAYS_SINCE_LAST_ACCESS_IS_LESS_THAN( 2 )
Lock in place
RULE 'ExcDBase' EXCLUDE FOR FILESET('dbase')
Life expiration
RULE 'DelScratch' DELETE FROM POOL 'sata FOR FILESET('scratch') WHERE DAYS_SINCE_LAST_ACCESS_IS_MORE_THAN( 90 )
2u
x3650-01
NSD server 4-way 4 GB RAM
FC4
TbE
dual port
G b E
G b E
TbE
dual port
PCI-Ex Slots
PCI-Ex Slots
HBA
peak < 780 MB/s sustained < 700 MB/s
G b E
G b E
2u
x3650-02
NSD server 4-way 4 GB RAM
FC4
TbE
dual port
G b E
G b E
2u NSD server
4-way 4 GB RAM
x3650-04
FC4
TbE
dual port
PCI-Ex Slots
DCS9550 Couplet
TS1120-01
host ports drive ports host ports
TS1120-05
HPSS mover
peak < 200 MB/s sustained < 100 MB/s
host ports
drive ports
host ports
TS1120-02
TS1120-06
TS1120-03
TS1120-07
10 disk trays
o o o
16-Bay Chassis (3U)
In terms of sustained rates assuming well designed I/O application in a production environment, applications will be able to draw up to 450 MB/s per server.
ANALYSIS
Servers
4 x3650s 2 dual core sockets at least 4 GB RAM 2 dual port, 4 Gb/s HBA
380 MB/s per port
35 LUNs
8+P RAID sets
Sustained BW
write < 2.6 GB/s, read < 2.2 GB/s
1 GbE
Ethernet Switch
Disk Capacity raw: 522 TB usuable: 422 TB Tape Capacity 2 PB Aggregate FS BW 6.6 GB/s Application I/O BW 5.2 GB/s Tape BW 1.2 GB/s requires 1.2 GB/s FS BW
x3650-13 x3650-14
DS3200 12 disks, RAID 10
4 Gb/s FC network
4 Gb/s FC network
10 GbE
TS1120-06
TS1120-07
TS1120-08
TS1120-09
TS1120-10
TS1120-11
TS1120-12
TS1120-13
TS1120-14
TS1120-15
TS1120-16
TS1120-17
TS1120-18
TS1120-19
TS1120-20
TS1120-21
TS1120-22
TS1120-23
TS1120-24
4 Gb/s FC network
Tape Cartridges size: 700 GB / cartridge minimum needed 2860 maximum available: 2997
WARNING: If any more tape drives are added to this configuration without increasing the number of servers, it will be necessary to add a SAN switch for the tape drives.
Tivoli Storage Manager (TSM) is a comprehensive software suite that manages storage. It provides
Backup / restore Archive / retrieve Disaster recovery Database & application protection Space management (HSM) Bare machine recovery Continuous data protection Content Management
It is a client/server design with seperate server products and client products implementing this list of functions.
TSM Architecture
Administration User Interface
TSM Clients
TSM Server
TSM Stgpools
COMMENT It is a common practice to combine the TSM server and client functions into a common node in GPFS clusters.
TSM Archive
Archive
TSM Server
Disk Copy Disk On-site Tape DVD/CD Optical Copied to Offsite Tape Other
TSM Client
Retrieve
Archive Features
It is a common practice to combine the TSM server and client functions into a common node in GPFS clusters.
Long-term storage Point in time copy Retention period Policy managed Index archives with descriptive metadata expedite locating historical information Allows focus to be placed on active data
Recover only active data Reduce backup time by focusing on active files only
TSM Backup
Backup
TSM Server
Disk Copy Disk On-site Tape DVD/CD Optical Copied to Offsite Tape Other
TSM Client
Restore
Backup Features
Progressive incremental backup
Backup only new/changed files avoiding wasteful full backups Data tracked at file level Accurately restores files to a point in time
It is a common practice to combine the TSM server and client functions into a common node in GPFS clusters.
Adaptive subfile differencing Volume level Multiple versions kept Policy managed System assisted restore Automated scheduling
COMMENT: For TSM, a recommended best practice is to explicitly backup archived data.
Policy File
Policy daemon
1. monitors
2. starts
Policy Engine
ans 3. sc
4. uses
The Process
GPFS policy daemon monitors HT/LT based on enabled policy. Policy daemon starts policy engine. Policy engine scans file system and generates candidate list based on enabled migration policy. dsmmigrate is called and migrates all files in candidate list to the TSM server.
6.
rts ta s
5. ge ne rat e
HT LT
dsmmigrate
File System
8. Migrates data
TSM Server
1 2
host ports drive ports
3 4
host ports
GbE ports
GbE
GbE
IB 4xDDR
2xFC8
GbE
GbE
IB 4xDDR 2xFC8
1 2
host ports drive ports
3 4
host ports GbE ports
GbE
GbE
NSD Server-03 x3650 M2 8 cores, 6 DIMMs NSD Server-04 x3650 M2 8 cores, 6 DIMMs
IB 4xDDR
2xFC8
GbE
GbE
5 Disk trays
IB 4xDDR 2xFC8 2xFC8
o o o o
x3650-05
GbE GbE
IB LAN Switch
FC Switch
TS3500-L53/D53 (tape library)
TS1040-01
(LTO-4)
FC4
o o o
TS1040-10
(LTO-4)
FC4
4xDDR IB HCA (Host Channel Adapter) Peak data rate per HCA < 1500 MB/s Require RDMA FC8 (single port 8 Gbit/s Fibre Channel) Peak data rate 2 FC8 HBAs < 1500 MB/s DCS9900 Performance streaming data rate < 5000 MB/s noncached IOP rate < 40,000+ IOP/s
In a production system there would normally be 2 TSM client/server systems; the active one and a passive one for redundancy.
TS3500-L53/D53
5 frames 1 x L53 4 x D53 800 GB/cartridges uncompressed 2000 cartidges 10 x TS1040 drives at most 120 MB/s uncompressed aggregate capacity < 1.6 PB data rate < 1.0 GB
42u frame
EXP5000 #1 EXP5000 #2
IB Switch
NSD Server - 01 NSD Server - 02 NSD Server - 03 NSD Server - 04 NSD Server - 05 NSD Server - 06 NSD Server - 07 NSD Server - 08 Active TSM - 01 Passive TSM - 02
FC8
COMMENTS Tier 1 Scratch storage used for application processing Tier 2 Archive storage indirectly accessed by applications Tier 3 Archive/backup storage indirectly accessed by applications. Footnotes The passive TSM client/server is a "hot spare" backup for the active TSM client/server. It is assumed that the NSD server and TSM client/server nodes are x3650 M2. Alternatively, the P6-p520 could be used instead. Likewise, the IB LAN could be replaced by TbE where each server has a channel bonded 2xTbE. There is upto 2.5 GB/s of unused bandwidth in the tier 2 storage. If applications directly access this storge to create data, then additional tape bandwidth is needed archive and/or backup this data. This will require more TSM client/server nodes which means creating a 2nd TSM repository, or selecting a more powerful node that can handle the increased bandwidth. Generally, the archive rate = data creation rate which is assumed to be 80% write rate for this example. (n.b., not all written data is retained)
TS1040-01 TS1040-02 TS1040-03 TS1040-04 TS1040-05 TS1040-06 TS1040-07 TS1040-08 TS1040-09 TS1040-10
FC4
FC8
Tier 3 - archive/backup Capacity 1.6 PB ILM Bandwidth receive < 0.8 GB/s restore < 0.2 GB/s
Tier 1 - scratch Capacity < 58 TB 128 x 450 GB drives 15 Krpm FC drives Application Bandwidth write < 1.0 GB/s read < 2.0 GB/s ILM Bandwidth transfer < 1.0 GB/s restore < 0.5 GB/s
Tier 2 - archive Capacity < 600 TB 600 x 1 TB drives SATA drives ILM Bandwidth recieve from tier 1 < 1.0 GB/s transfer to tier 3 < 1.0 GB/s restrore to tier 1 < 0.5 GB/s Unused Bandwidth < 2.5 GB/s
Capacity
HPSS is sized and priced for systems with over 1 PB of storage. TSM has an upper limit of 1 PB per TSM instance.
Backup
HPSS integrates backup into the archive function. TSM requires a separate backup procedure in addition to the archive function.
Parallelism
HPSS is designed as a parallel archive tool; it supports multiple tape servers (i.e., "tape movers") per HPSS instance. TSM is not parallel; to scale TSM beyond a single server requires multiple TSM instances.
Metadata management
HPSS requires a separate metadata subsystem (e.g., 2 "core servers" plus external disk storage for its metadata database). TSM integrates the metadata operations into its server operations.
Market segment
HPSS was designed for the high end HPC market by a consortium of HPC labs. TSM was designed for commercial applications, but is commonly adapted to scientific and technical environments.
NFS/GPFS Integration
Yesterday
NFS was a
Ouch!
Today...
Improved performance, robustness, server farm features for NFS Clustered NFS (CNFS)
Provides high availability NFS server functionality using GPFS only under Linux
RHEL and SuSE includes Linux on pSeries (LoP)
Serves most any NFS client examples: AIX, Linux, Solaris, etc.
CNFS
"A Picture is Worth a Thousand Words"
NFS Client 1
NFS Client 2
NFS Client 3
NFS Client 4
oooo
NFS Client N
monitor utility
nfsd
nfsd
S A N
mmfsd node 2
o o o
monitor utility nfsd
mmfsd node 12
disk/tape servers
GPFS Cluster
CNFS
"A Picture is Worth a Thousand Words"
Cluster #2
Cluster #1 Legacy Blades Legacy Rack Optimized Internal LAN Myrinet External LAN (Ethernet)
Mount Legacy NFS storage New GPFS storage
Storage Frame
Storage Frame
NFS #1
NFS #2
This legacy system retains its original Myrinet work for message passing, but this cluster is now part of the GPFS cluster and natively mounts the GPFS file system over Ethernet. (It can still access the legacy NFS file system.)
Ethernet Fabric (with DNS Round Robin Load Balancing Applied to NFS clients)
monitor utility
nfsd
nfsd
S A N
mmfsd node 2
Cluster #3 Blades HS-21 (Xeon) GPFS clients that natively mount GPFS file system
o o o
monitor utility nfsd
mmfsd node 12
disk/tape servers
GPFS Cluster
TBD
The core of GPFS continues to operate on Unix UID/GID values. Windows GPFS nodes perform the task of mapping to Windows SIDs: explicit Unix-Windows ID maps are defined in Active Directory; implicit (default) maps for Windows SIDs are created from a reserved range of UID/GID values; and unmapped Unix IDs are cast into a foreign domain for Windows. Explicit maps persist only in the Active Directory. Implicit maps persist in the file system. (So how did we do on that explainable principle?)
CNFS Details
Monitoring
Every node in the CNFS cluster runs an NFS utility that monitors GPFS, NFS, and networking components on the node. Upon failure detection and based on your configuration, the monitoring utility might invoke a failover.
Failover
As part of GPFS recovery, the CNFS cluster failover mechanism is invoked. It transfers the NFS serving load that was served by the failing node to another node in the CNFS cluster. Failover is done using recovery groups to help choose the preferred node for takeover. The failover mechanism is based on IP address failover. In addition, it guarantees NFS lock (NLM) recovery.
Load balancing
CNFS supports a failover of all of the nodes load together (all of its NFS IP addresses) as one unit to another node. However, if no locks are outstanding, individual IP addresses can be moved to other nodes for load balancing purposes. CNFS is based on round robin DNS for load balancing of NFS clients among the NFS cluster nodes.
Storage Virtualization
An Abstract Example with GPFS Using Best Practices
This example illustrates two possible ways to virtualize very different types of storage technologies.
Segregate different disk systems to improve performance 1. Use GPFS ILM placing each disk system into its own storage pool 2. Different disk systems are segregated into seperate file systems. Segregation is not required, but is merely a "best practice". COMMENT: GPFS can virtualize almost any block device under a common rubric.
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
nsd36
nsd36
nsd36
nsd36
nsd36
nsd36
NSD Server
nsd1 nsd2 nsd3 nsd4
o o o
NSD Server
nsd1 nsd2 nsd3 nsd4
o o o
NSD Server
nsd1 nsd2 nsd3 nsd4
o o o
NSD Server
nsd1 nsd2 nsd3 nsd4
o o o
NSD Server
nsd1 nsd2 nsd3 nsd4
o o o
NSD Server
nsd1 nsd2 nsd3 nsd4
o o o
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
Storage Virtualization
Another Abstract Example with GPFS Using Best Practices
This example illustrates two possible ways to virtualize very different types of storage technologies.
Segregate different disk systems to improve performance 1. Use GPFS ILM placing each disk system into its own storage pool 2. Different disk systems are segregated into seperate file systems. Segregation is not required, but is merely a "best practice". COMMENT: GPFS can virtualize almost any block device under a common rubric.
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
nsd36
nsd36
nsd36
nsd36
nsd36
nsd36
NSD Server This example shows that it is not necessary to segregate storage systems between servers.
nsd1 nsd2 nsd3 nsd4
o o o
NSD Server
nsd1 nsd2 nsd3 nsd4
o o o
GPFS NSD
GPFS NSD
Storage Virtualization
GPFS Can Provide Storage Virtualization to Non-GPFS Clusters
This example illustrates two possible ways to virtualize very different types of storage technologies.
Segregate different disk systems to improve performance 1. Use GPFS ILM placing each disk system into its own storage pool 2. Different disk systems are segregated into seperate file systems. Segregation is not required, but is merely a "best practice". COMMENT: GPFS can virtualize almost any block device under a common rubric.
NFS Client
NFS Client
NFS Client
NFS Client
NFS Client
NFS Client
GPFS NSD
nsd1 nsd2
o o o
nsd36
nsd36
L1..L18 L19..L36
L19..L36 L1..L18
General Concept
Redundant storage technology is deployed that enables operational continuity in the event of a disaster or other unrecoverable error. This is achieved by maintaining duplicate copies of a data set at two different locations, each with a redundant storage system, and enabling the "other" storage system to take over responsibility. This commonly called "disaster recovery" in the literature.
Maintaining Quorum
node-z1 NSD Server
Site C guarantees quorum under degraded operation. It does not participate in regular operations. The disk contains only file system descriptor info (descOnly) needed to maintain quorum Site-C is not required, but it improves your chances of surviving an outage automatically by 50%
Q designates quorum nodes
Requirement
This infrastructure is deployed as a single GPFS cluster (n.b., can not use GPFS multi-cluster feature)
failure group 3
Site Z
Tiebreaker
E t h e r n e t
IP Network
GPFS uses mirroring to synchronously copy user data and meta data to both sites (e.g., the write() call blocks until the data is copied to both sites).
minimizes the risk of permanent data loss must provide sufficient BW between sites so regular I/O operations are not impeded must have extra BW to allow quick recovery (n.b., restriping) following a failure Does not require SAN switches or SAN connectivity between sites
E t h e r n e t
DS4800-X
DS4800-Y
EXP810
o o o
failure group 1
failure group 2
EXP810
o o o
Site X
active
Site Y
active
metro mirroring full synchronous copy requiring a response from the secondary storage device to continue distance limited to metropolitan areas global mirror asynchronous copy with guaranteed in order delivery and the ability to create a consistency group of LUNs that will be mirrored together distance limited (but not as much as metro mirroring) global copy asynchronous copy with no guaranteed in order delivery and no consistency groups very long distances theoretically possible
Failover can be automated with scripts Failback procedures vary according to the cause and magnitude of the failure Deployed in an active/passive configuration
LUNs at the secondary site can not be written to
Requires inter-SAN connectivity and local SAN switches Premium feature that must be licensed
Maintaining Quorum
node-z1 NSD Server
Site C guarantees quorum under degraded operation. It does not participate in regular operations. The disk contains only file system descriptor info (descOnly) needed to maintain quorum Site-C is not required, but it improves your chances of surviving an outage automatically by 50%
Requirement
This infrastructure is deployed as a single GPFS cluster (n.b., can not use GPFS multi-cluster feature)
Site Z
Tiebreaker
Primary Site
node-x1 client node-x2 client
The Word Smith Q
Secondary Site
E t h e r n e t
IP Network
Some Terms primary/secondary storage system mirroring storage controller pair mirror FC connection primary/secondary logical drive mirrored logical drive pair mirror role role reversal write consistency group full synchronization
E t h e r n e t
DS4800-X
DS4800-Y
EXP810
o o o
EXP810
o o o
Site X
active
Site Y
passive
Operational Procedures
Each logical drive in a mirrored pair presents itself to the local host(s) as a SCSI device (e.g., /dev/sdb). Write requests can be received only by the primary logical drive in a mirrored pair. Read requests can be received by both the primary and logical drive; reading secondary logical drives is primarily intended for administrative purposes. Two file systems are created for the mirrored storage controller pair, a primary and secondary file system. During normal operation, applications access the primary file system. If operation to the primary file system is lost, the secondary site must go through a role reversal, and the secondary file system unmounted/mounted. Full synchronization is needed after an outage.
Snapshots
TBD
TBD
While these are simple, common sense things, they are easily overlooked, especially when you are working with legacy codes developed under different conditions and assumptions.
1. large record sequential order 2. large record strided order or small record sequential order 3. large records in random order or small records in strided order 4. issue hints when reading in small records in random order 5. small record random order without hints
NOTE: large records are >= GPFS block size small records are < GPFS block size (e.g., 2K to 16K)
Example illustrating how code can be rewritten to eliminate small record accesses. Suppose you are sorting directly a set of small, randomly or semi-randomly distributed records. Because records are small, GPFS will perform poorly. Rewrite code sort as follows:
divide file into N subsets and assign each subset to a node choose the subset size so that it can fit entirely within RAM sort each subset depending on file size, a node may need to sort several subsets merge all of the subsets together
General rule of thumb in CS textbooks read: 90% write: 10% But this generalization is more typical of commercial applications than technical HPC applications. For example, the ratio for many scientific applications is read: 60% to 70% write: 40% to 30% Therefore, plan accordingly.
Rule of thumb
Configure a file system to handle peak performance up to 3 or 4 standard deviations above the mean to avoid "gold plating". (John Watts, IBM) Programmers worried about performance will often over architect a system
Best Practice
Avoid mixing home and scratch directories under GPFS
Where NFS works well Where NFS is a challenge NFS vs. GPFS
One of GPFS's salient features is that it has a million knobs... One of GPFS's problems is that it has a million knobs...
DCS9900 Couplet DCS Tray #01 DCS Tray #02 DCS Tray #03 DCS Tray #04 DCS Tray #05 DCS Tray #06 DCS Tray #07 DCS Tray #08 DCS Tray #09 DCS Tray #10
45u
3. Create the GPFS cluster 4. Startup the GPFS daemons 5. Create the logical disks (i.e., NSDs) 6. Create and mount the file system An experienced sysadm can do this in as little as 5 to 10 minutes!
Ethernet Switch TbE GbE server #01 server #02 server #03 server #04 server #05 server #06 server #07 server #08
42u
GPFS is a best of class product with good features, but it is not a "silver bullet" Without careful design, I/O can seriously degrade parallel efficiency (e.g., Amdahl's law) Good I/O performance requires hard work, careful design and the intelligent use of GPFS I/O is not the entire picture; improving I/O performance will uncover other bottle necks