Gpfsworkshop2010 Tutorial v17 2

GPFS Best Practices
Programming, Configuration, Environment and Performance Perspectives Tutorial for GPFS versions 3.3 and earlier
Raymond L. Paden, Ph.D.

HPC Technical Architect IBM Deep Computing [email protected] 877-669-1853 "A supercomputer is a device for turning compute-bound problems into I/O-bound problems." Ken Batcher Version 17.2.0 13 Apr 10
Special Notices from IBM Legal

This presentation was produced in the United States. IBM may not offer the products, programs, services or features discussed herein in other countries, and the information may be subject to change without notice. Consult your local IBM business contact for information on the products, programs, services, and features available in your area. Any reference to an IBM product, program, service or feature is not intended to state or imply that only IBM's product, program, service or feature may be used. Any functionally equivalent product, program, service or feature that does not infringe on any of IBM's intellectual property rights may be used instead of the IBM product, program, service or feature. Information in this presentation concerning non-IBM products was obtained from the suppliers of these products, published announcement material or other publicly available sources. Sources for non-IBM list prices and performance numbers are taken from publicly available information including D.H. Brown, vendor announcements, vendor WWW Home Pages, SPEC Home Page, GPC (Graphics Processing Council) Home Page and TPC (Transaction Processing Performance Council) Home Page. IBM has not tested these products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this presentation. The furnishing of this presentation does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM's future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. Contact your local IBM office or IBM authorized reseller for the full text of a specific Statement of General Direction. The information contained in this presentation has not been submitted to any formal IBM test and is distributed "AS IS". While each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will be obtained elsewhere. The use of this information or the implementation of any techniques described herein is a customer responsibility and depends on the customer's ability to evaluate and integrate them into the customer's operational environment. Customers attempting to adapt these techniques to their own environments do so at their own risk. IBM is not responsible for printing errors in this presentation that result in pricing or information inaccuracies. The information contained in this presentation represents the current views of IBM on the issues discussed as of the date of publication. IBM cannot guarantee the accuracy of any information presented after the date of publication. IBM products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. Any performance data contained in this presentation was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements quoted in this presentation may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-available systems. Some measurements quoted in this presentation may have been estimated through extrapolation. Actual results may vary. Users of this presentation should verify the applicable data for their specific environment. Microsoft, Windows, Windows NT and the Windows logo are registered trademarks of Microsoft Corporation in the United States and/or other countries. UNIX is a registered trademark in the United States and other countries licensed exclusively through The Open Group. LINUX is a registered trademark of Linus Torvalds. Intel and Pentium are registered trademarks and MMX,Itanium, Pentium II Xeon and Pentium III Xeon are trademarks of Intel Corporation in the United States and/or other countries. Other company, product and service names may be trademarks or service marks of others.
Author, Revisions and TBDs

Author: Date: Version: Raymond L. Paden 6 June 2008 v15.1
TBDs: Add SoFS example under NAS in Taxonomy Add SoFS to section on NFS and CNFS Add snapshots GPFS SNMP support Guide lines on where to use NFS pros and cons for using GPFS for home directories
OS Commands see pp.79-80 in Concepts, Planning, Installation Guide. OS Calls see pp.80-83 in Concepts, Planning, Installation Guide. GPFS Command Processing see pp. 83-84 in Concepts, Planning, Installation Guide. GPFS Port Usage see pp. 122-124 of Advanced Admin Guide
Abstract and Biographical Sketch

Abstract GPFS (General Parallel File System) is IBM's clustered/parallel file system commonly used for HPC and cluster applications. It has been generally available since 1998 giving it both maturity and market presence. This 2-day seminar is divided into 4 sessions, which are both flexible and dynamic. They examine GPFS's features, semantics, programming considerations, configuration procedures, tuning and optimization guidelines, best practices and environment. If supported hardware is available, it includes a "hands on" lab exercise or a live demonstration. A planning and design session can also be included to help customers deploy GPFS for their specific circumstances. Specific topics are emphasized on the basis of attendee interest. This seminar is delivered in a comfortable environment encouraging question and answer dialogue.
Biographical Sketch Dr. Ray Paden is currently an HPC Technical Architect with world wide scope in IBM's Deep Computing organization, a position he has held since June, 2000. His particular areas of focus include HPC storage systems, performance optimization and cluster design. Before joining IBM, Dr. Paden worked as software engineer doing systems programming and performance optimization for 6 years in the oil industry. He also served in the Computer Science Department at Andrews University for 13 years, including 4 years as department chair. He has a Ph.D. from the Illinois Institute of Technology in Computer Science. He has done research and published papers in the areas of parallel algorithms and combinatorial optimization, performance tuning, file systems, and computer education. He has served in various capacities on the planning committee for the Supercomputing conference since 2000. He is currently a member of ACM, IEEE and Sigma Xi. As a professor, he has won awards for excellence in both teaching and research. He has also received the Outstanding Innovation Award from IBM.
Sample 2 Day Agenda

Day 1
Session 1 (8:30 AM - 10:00 AM, 10:30 AM - NOON) Introduction: Parallel I/O, Clustered Files Systems and a Cluster Storage Taxonomy Overview of GPFS and Various Design Motivations GPFS Architecture Session 2 (1:30 PM - 3:00 PM, 3:30 PM - 5:00 PM) GPFS Organization and Topology GPFS Environment: servers, disk controllers, disk technology, networks Example GPFS Configurations GPFS Design Exercise: optional "paper and pencil" exercise
Day 2
Session 1 (8:30 AM - 10:00 AM, 10:30 AM - NOON) Review written exercise from previous day GPFS System Administration GPFS Configuration Example Optional GPFS lab exercise: install and configure GPFS Session 2 (1:30 PM - 3:00 PM, 3:30 PM - 5:00 PM) Specific topics selected based on attendee interests; topics include
1. GPFS planning and design (intended for customers who have purchased GPFS) 2. Information Life Cycle Management (ILM) and HSM Product Integration 3. Clustered NFS (CNFS) 4. SoNAS and SoFS 5. Snapshots 6. Disaster Recovery 7. SNMP Support 8. Miscellaneous Best Practices 9. GPFS Roadmap (requires NDA)
COMMENT: The material in this slide set is detailed and comprehensive; it requires 3 full days to cover it in its entirety (including the hands on lab). However, this tutorial is generally covered in 2 days at customer sites by including only the material relavant to the customer.
Sample Test Configurations

Example #1 - Simple
Ethernet Switch
NSD Server-01 x3650 M2 8 cores, 6 DIMMs
TbE 2 x FC4
Example #2 - Elaborate
IB Switch
x3550-01 x3550-02
NSD Server-01
GbE GbE
IB 4xDDR
2xFC8
GbE
GbE
GbE
GbE
TbE
2 x FC4
x3550-03 x3550-04 x3550-05 x3550-06 x3550-07 x3550-08 x3550-09
NSD Server-02
GbE GbE
IB 4xDDR
2xFC8
NSD Server-03
GbE GbE
IB 4xDDR
2xFC8
NSD Server-04
GbE GbE
Controller-A
Controller-B
IB 4xDDR
2xFC8
x3550-10 x3550-11 x3550-12 x3550-13
DS3400
12 disks (SAS or SATA)
ESM-A
ESM-B
1 2 1 2
3 4 3 4
60-disk Drawer SAS or SATA
GbE GbE
x3550-14 x3550-15 x3550-16

GbE GbE
EXP3000
COMMENT:
These diagrams are intended to illustrate the range of possibilities for configurations that could be used for a test system to do the lab exercise. There are many other possibilities, including the use of "internal" SCSI or SAS drives.
Ideally, this config needs 160 x SAS drives or 300 x SATA drivers.
NOTE: Administrative GbE network not shown.
Tutorial Objectives
Conceptual understanding of GPFS
With a conceptual understanding and a man page, a sysadm can do anything!
theory
Practical understanding of how to use GPFS

GPFS integration with other products servers, disk controllers, networks, OSs "Hands on" introduction assumes appropriate HW resources are available
practical
Targeted Audience
system administrators systems and application programmers system architects computer center managers
Requirements
cluster experience in keeping with one of the previous backgrounds
An educated customer is a good customer!
1. Introduction
GPFS is a shared disk, parallel clustered file system.

Shared disk
all user and meta data are accessible from any disk to any node
Parallel Clustered
user data and metadata flows between all nodes and all disks in parallel 1 to 1000's of nodes under common rubric
compute node
compute node
compute node
compute node
compute node
compute node
Switching Fabric (LAN, SAN, WAN)

disk disk disk disk
Parallel I/O in a Cluster
User data and metadata flows between all nodes and all disks in parallel
Multiple tasks distributed over multiple nodes simultaneously access file data Multi-task applications access common files in parallel Files span multiple disks File system overhead operations are distributed and done in parallel Provides a consistent global name space across all nodes of the cluster
compute node
compute node
compute node
compute node
compute node
compute node

disk disk disk disk
Parallel I/O in a Cluster
The promise of parallel I/O is increased performance and robustness in a cluster and it naturally maps to the architecture of a cluster.
The challenge of parallel I/O is that it is a more complex model of I/O to use and manage.
compute node
compute node
compute node
compute node
compute node
compute node

disk disk disk disk
Textbook examples are great. But in practical terms, what is GPFS?
What is GPFS?
Some Some thing thing New Old
General Parallel File System

All of GPFS's rivals do some of these things, none of them do all of them!
General: supports wide range of applications and configurations Cluster: from large (4000+ in a multi-cluster) to small (only 1 node) clusters Parallel: user data and metadata flows between all nodes and all disks in parallel HPC: supports high performance applications Flexible: tuning parameters allow GPFS to be adapted to many environments Capacity: from high (4+ PB) to low capacity (only 1 disk) Global: Works across multiple nodes, clusters and labs (i.e., LAN, SAN, WAN) Heterogenous:
Native GPFS on AIX, Linux, Windows as well as NFS and CIFS Works with almost any block storage device
Shared disk: all user and meta data are accessible from any disk to any node RAS: reliability, accessibility, serviceability Ease of use: GPFS is not a black box, yet it is relatively easy to use and manage Basic file system features: POSIX API, journaling, both parallel and non-parallel access Advanced features: ILM, integrated with tape, disaster recovery, SNMP, snapshots, robust NFS support, hints
What is GPFS?
Typical Example
Aggregate Performance and Capacity Data rate: streaming rate < 5 GB/s, 4 KB transaction rate < 40,000 IOP/s Usable capacity < 240 TB IB LAN*
x3550-01 x3550-02 x3550-03 x3550-04 x3550-05 x3550-06 x3550-07 x3550-08 x3550-09 x3550-10 x3550-11 x3550-12 x3550-13 x3550-14 x3550-15 x3550-16 x3550-17 x3550-18 x3550-19 x3550-20 x3550-21 x3550-22 x3550-23 x3550-24 x3550-25 x3550-26 x3550-27 x3550-28 x3550-29 x3550-30 x3550-31 x3550-32 x3550-33
NSD Server-01
GbE GbE
IB 4xDDR
2xFC8
LAN Configuration
Performance scales linearly in the number of storage servers Add capacity without increasing the number of servers Add performance by adding more servers and/or storage Inexpensively scale out the number of clients
x3550-34 x3550-35 x3550-36 x3550-37 x3550-38 x3550-39 x3550-40 x3550-41 x3550-42 x3550-43 x3550-44 x3550-45 x3550-46 x3550-47 x3550-48 x3550-49 x3550-50 x3550-51 x3550-52 x3550-53 x3550-54 x3550-55 x3550-56 x3550-57 x3550-58 x3550-59 x3550-60 x3550-61 x3550-62 x3550-63 x3550-64
NSD Server-02
GbE GbE
IB 4xDDR
2xFC8
NSD Server-03
GbE GbE
IB 4xDDR
2xFC8
NSD Server-04
GbE GbE
IB 4xDDR
2xFC8
1 2 1 2
3 4 3 4
60-disk Drawer
GbE GbE
GbE GbE
60-disk Drawer
60-disk Drawer
60-disk Drawer
60-disk Drawer
Though not shown, a cluster like this will generally include an administrative GbE network.
What is GPFS?
Another Typcial Example
Aggregate Performance and Capacity Data rate: streaming rate < 5 GB/s, 4 KB transaction rate < 40,000 IOP/s Usable capacity < 240 TB FC8 LAN
1 2 1 2
3 4 3 4
60-disk Drawer
GbE GbE
GbE GbE
60-disk Drawer
60-disk Drawer
60-disk Drawer
P6-p595 128 core, 256 GB RAM
P6-p595 128 core, 256 GB RAM
60-disk Drawer
RIO (IB 12xDDR)
RIO (IB 12xDDR)
SAN Configuration
Performance scales linearly in the number of servers Add capacity without increasing the number of servers Add performance by adding more servers and/or storage
What GPFS is Not
GPFS is not a client/server file system like NFS, CIFS (Samba) or AFS/DFS with a single file server.
GPFS nodes can be an NFS or CIFS server, but GPFS treats them like any other application.
client
client
client
client
LAN
file server
(e.g., NFS or Samba)
LAN metadata server
GPFS is not a SAN file system with dedicated metadata server.

GPFS can run in a SAN file system like mode, but it does not have a dedicated metadata server.
client
client
client SAN
client
GPFS avoids the bottlenecks introduced by centralized file and/or metadata servers.
What GPFS is Not
GPFS is not a niche file system for IBM system P products

Yesterday
GPFS was a parallel file system for IBM SP systems
Today
GPFS is a general purpose clustered parallel file system tunable for many workloads on many configurations.
Winterhawk
BlueGene/P iDataPlex
P6 p595
BladeCenter/H
Where GPFS Is Used Today

GPFS is a mature product with established market presence. It has been generally available since 1998 with research development starting in 1991. Applications include...
Aerospace and Automotive
Banking and Finance

Defence Digitial Media
Bio-informatics and Life Sciences
EDA (Electronic Design Automation) General Business National Labs Petroleum SMB (Small and Medium sized Business) Universities Weather Modeling
LARGE Clusters
Smaller Number of Big Nodes Larger Number of Small Nodes
155 p575 Nodes
3780 iDataPlex Nodes 105 P6p575 Nodes
Herd of Elephants
Army of Ants
small Clusters
smaller
Clusters
2. Cluster Storage Taxonomy

The following pages examine a taxonomy of file systems commonly used with clusters. They may or may not be a clustered file system and they support varying degrees of parallelism. They do not represent mutually exclusive choices.
Conventional I/O Asynchronous I/O Networked File Systems Network Attached Storage (NAS) Basic Clustered File Systems SAN File Systems Multi-component Clustered File Systems High Level Parallel I/O
Conventional I/O
Used generally for "local file systems"
the basic, "no frills, out of the box" file system
Supports POSIX I/O model Generally supports limited forms of parallelism

intra-node process parallelism disk level parallelism possible via striping not truly a parallel file system
Journal, extent based semantics

journaling (AKA logging): to log information about operations performed on the file system meta-data as atomic transactions. In the event of a system failure, a file system is restored to a consistent state by replaying the log and applying log records for the appropriate transactions. extent: a sequence of contiguous blocks allocated to a file as a unit and is described by a triple consisting of <logical offset, length, physical>
If they are a native FS, they are integrated into the OS (e.g., caching done via VMM) Examples: ext3, JFS, NTFS, ReiserFS, XFS
Asynchronous I/O
Abstractions allowing multiple threads/tasks to safely and simultaneously access a common file Parallelism available if its supported in the base file system Included in the POSIX 4 standard
not necessarily supported on all Unix operating systems non-blocking I/O built on top of a base file system
Examples:
commonly available under real time operating systems Supported today on various "flavors" of standard Unix AIX, Solaris, Linux (starting with 2.6)
Networked File Systems

Disk access from remote nodes via network access
generally based on TCP/IP over Ethernet Useful for on-line interactive access (e.g., home directories)
Also classified as distributed file systems
NFS is ubiquitous in Unix/Linux environments

does not provide a genuinely parallel model of I/O
it is not cache coherent (will future versions like pNFS correct this?) parallel write requires O_SYNC and -noac options to be safe
poorer performance for HPC jobs, especially parallel I/O

write: only 90 MB/s on system capable of 400 MB/s (4 tasks) read: only 381 MB/s on system capable of 740 MB/s (16 tasks)
uses POSIX I/O API, but not its semantics traditional NFS configurations limited by "single server" bottleneck while NFS is not designed parallel file access, by placing restrictions on an application's file access and/or doing non-parallel I/O, it may be possible to get "good enough" performance NFS clients available for Windows, but POSIX to NTFS mapping is awkward GPFS provides a high availability version of NFS called Clustered NFS
CIFS is ubiquitous in Windows environments

Samba is a CIFS server available under Unix/Linux that maps a POSIX based file system to the Windows/NTFS model.
Networked File Systems

file client file client file client file client file client file client
Also classified as distributed file systems
LAN
file & metadata server
SAN Fabric
Storage Controller
A1 A2 A3 A4 A5 A6 A7 A8
COMMENT: Traditionally, a single NFS/CIFS file server manages both user data and metadata operations which "gates" performance/scaling and presents a single point of failure risk. Products (e.g., CNFS) are available that provide multiple server designs to avoid this issue.
Network Attached Storage (NAS)

Appliance Concept
Traditionally focused on the CIFS and/or NFS protocols Integrated HW/SW storage product
integrates servers, storage controllers, disks, networks, file system, protocol, etc. all into single product main advantage: "black box" design (i.e., ease of use at the expense of flexibility) not intended for high performance storage
Provides an NFS server and/or CIFS/Samba solution

these are server based products; they do not improve client access or operation may support other protocols (e.g., iSCSI, http)
Generally based on Ethernet LANs Is this just a subclass of the networked file systems level?
Examples
Netapps (also rebranded as IBM nSeries)
Provides excellent performance for IOPS and transaction processing workloads with favorable temporal locality.
Scale-out File System (SoFS, SoNAS)

An IBM product supporting CIFS, http, iSCSI, NFS, NSD (i.e., GPFS) protocols
Basic Clustered File Systems

Satisfies the definition of a clustered file system File access is parallel
supports POSIX API, but provides safe parallel file access semantics
File system overhead operations

n.b., no metadata servers
guarantees portability to other POSIX based file systems
file system overhead operations is distributed and done in parallel there are no single server bottlenecks
Common component architecture

commonly configured using seperate file clients and file servers
this is common for reasons of economy; for many storage systems, it costs too much to have a seperate storage controller for every node
some FS's allow a single component architecture where file clients and file servers are combined
yields very good scaling for asynchronous applications
file clients access file data through file servers via the LAN Example: GPFS (IBM), GFS (Sistina/Redhat), IBRIX Fusion
Basic Clustered File Systems

file client file client file client file client file client file client
LAN file server file server

A1 A2 A3 A4 A5 A6 A7 A8
Storage Controller
SAN Fabric
A SAN switch is optional.
File system overhead operations are distributed across the entire cluster and is done in parallel; it is not concentrated in any given place. There is no single server bottleneck. User data and metadata flows betweem all nodes and all disks via the file servers.
SAN File Systems

File access is parallel

it is NOT done in parallel single metadata server with a backup metadata server
metadata server is accessed via the LAN metadata server is a potential bottleneck, but it is not considered a limitation since these FS's are generally used for smaller clusters
Dual component architecture

file client/server and metadata server
All disks connected to all file client/server nodes via the SAN
file data accessed via the SAN, not the LAN
removes need for expensive LAN where high BW is required (e.g., IB, Myrinet)
inhibits scaling due to cost of FC Switch Tree (i.e., SAN)
Example: CXFS (SGI), SNFS (Quantum, formerly ADIC), QFS (Sun)

ideal for smaller numbers of nodes
SNFS scales to 50+ nodes CXFS scales up to 64+ nodes (appropriate for many-processor Altix systems)
SAN File Systems
LAN SAN client SAN client SAN client SAN

Storage Controller
A1 A2 A3 A4 A5 A6 A7 A8
SAN client
metadata servers
File system protocol is concentrated in the metadata server and is not done in parallel; all file client/server nodes must coordinate file access via the metadata server. There are generally no client only nodes in this type of cluster, and hence the need for large scaling is not needed.
Multi-component Clustered File Systems

Satisfies the definition of a clustered file system File access is parallel

Lustre: 1 metadata server per file system (with backup) accessed via LAN
potential bottleneck (deploy multiple file systems to avoid backup)
Will they improve this in the future?
Panasas: the "director blades" manages protocol

each "shelf" contains a director blade and 10 disks accessible via Ethernet this provides multiple metadata servers reducing contention
Multi-component architecture
Lustre: file clients, file servers, metadata server Panasas: file clients, director blade
director blade encapsulates file service, metadata service, storage controller operations
file clients access file data through file servers or director blades via the LAN Examples: Lustre, Panasas
Lustre: Linux only, Panasas: Linux and Windows. object oriented disks
Lustre emulates object oriented disks Panasas uses actual OO disks; user can only use Panasas disks
Do OO disks really add value to the FS? Other FS's efficiently accomplish the same thing at a higher level.
Multi-component Clustered File Systems

Lustre
file client file client file client LAN file server file server file client file client
file client
Panasas
file client file client LAN director blade
metadata server file server storage controller
file client
file client
metadata servers
concentrated protocol management
director blade
metadata server file server storage controller
SAN Fabric
Storage Controller
A1 A2 A3 A4 A5 A6 A7 A8
disks
disks
While different in many ways, Lustre and Panasas are similar in that they both have concentrated file system overhead operations (i.e., protocol management). The Panasas design, however, scales the number of protocol managers proportionally to the number of disks and is less of a bottleneck than for Lustre.
Higher Level Parallel I/O

High level abstraction layer providing a parallel I/O model Built on top of a base file system (conventional or parallel) MPI-I/O is the ubiquitous model
parallel disk I/O extension to MPI in the MPI-2 standard semantically richer API and semantics
can do things that POSIX I/O was never designed to do
applications using MPI-I/O are portable
Requires significant source code modification for use in legacy codes, but it has the adavantages of being a standard (e.g., syntactic portability) Examples: IBM MPI, MPICH, OpenMPI, Scali MPI
Which File System Architecture is Best?

There is no concise answer to this question. It is application/customer specific. All of them serve specific needs. All of them work well if properly deployed and used according to their design specs. Issues to consider are
application requirements
often requires compromise between competing needs
how the product implements a specific architecture
3. GPFS Design Motivation
Consider miscellaneous observations to motivate the design of GPFS.
Efficiency Is Critical in a Cluster
Clusters are intended to provide cost effective performance scaling. Thus it is imperative that I/O and computational performance keep pace with each other.
A cluster designed to perform TFLOP calculations must be able to access up to 100's of GB of data per second. A cluster is no faster than its slowest component (large or small!)
Anecdote: 200 years ago, when a tree fell across the road and your ox wasn't big enough to move it out of the way, you didn't go grow a bigger ox; you got more oxen.
Rear Admiral Grace Murray Hopper Computer Pioneer 1906-1992
Efficiency Is Critical in a Cluster

Amdahl's Law Applied to a Cluster
Speedup = 1 / (f + F/n) where F = fraction time that can utilize parallelism f = fraction of time that can NOT utilize parallelism (n.b., f = 1 - F) n = number of nodes (also called ideal speedup) Parallel Efficiency is then Efficiency = 100 * Speedup / n
I/O can, but need not be a large contributor to f in clusters. I call the inefficiency represented by the term f in Amdahl's law "Amdahl inefficiency" or "Amdahl overhead". Consider a job on a 32 node/64 CPU Linux Cluster (LC). This job, when executed on a single node accessing a local scratch disk, devotes 10% of its job time writing to a file. By contrast, the LC writes via NFS to a single file server preventing parallel I/O operation. Assume the following...
the file server is the same "out of the box" Linux system used for the sequential test the Ethernet connection rate used for NFS exceeds the sequential job's write rate number crunching and file reading phases of the job runs perfectly parallel (i.e., are small enough to be ignored)
In other words, the writes are sequentialized. What are the speedup and efficiency values for this job?
Number of Tasks 8 16 32 64
Speedup 4.71 6.40 7.85 9.86
Efficiency 58.9% 40.0% 24.5% 15.4%
Let's examine a simple disk I/O program and modify it to do parallel disk I/O so that we can better appreciate the tasks that a parallel file system must do and that GPFS does to allow a programmer to do parallel I/O safely.
Simple I/O Program

int main() { int fd, k, nrec = 1024, bsz = 16384; char *fid_out = "myfile", buf[bsz]; offset_t soff; /* 64 bit seek offset */ fd = open(fid_out, O_WRONLY | O_CREAT | O_TRUNC, 0777); for (k = 0; k < nrec; k++) { do_something(buf, bsz); soff = (offset_t)k * (offset_t)bsz; llseek(fd, soff, SEEK_SET); write(fd, buf, bsz); } close(fd); return 0; }
What do we need to do to parallelize disk I/O?
1. mapping function (i.e., locating proper data across multiple disks over different nodes) 2. message passing (i.e., shipping data between client nodetask and disk located on remote server) 3. caching system (e.g., coherence, aging, swapping, "data-shipping", etc.) 4. parallel programming model (e.g., data striping, data decomposition, node to disk access patterns) 5. critical section programming 6. performance tuning 7. maintain state information 8. provide an API That's a lot of work!
GPFS has 100's of KLOCs
GPFS Design Goal
Provide a parallel I/O system conforming to the POSIX API standard. This allows you to write an application code to access one file without worrying too much about what the other tasks are doing.
You can't be blind, but you can focus on application needs without worrying too much about system issues. If the code is sequential, you can get the full benefits of a parallel file system without worrying at all about it!!
A Simple Parallel I/O Program

int main() { int fd, k, nrec = 1024, bsz = 16384, ntask = 2, tid; char *fid_out = "myfile", buf[bsz]; offset_t soff; /* 64 bit seek offset */
tid = spawn_task(ntask); fd = open(fid_out, O_WRONLY | O_CREAT, 0777); for (k = tid; k < nrec; k+=ntask) The proper way to open a file is to set { a barrier and use O_TRUNC; however, this seems to work OK for do_something(buf, bsz); this simple example. soff = (offset_t)k * (offset_t)bsz; If O_TRUNC is used without a barrier or some equivalent timing primitive, llseek(fd, soff, SEEK_SET); write(fd, buf, bsz); } close(fd); return 0; }
In reality you will need more than this, but it will be application oriented ONLY!
records written to the file before subsequent tasks open the file will be "clobbered".
You Do Not Have to Worry About...
Which disk/file to write to (there is only one file seen by all tasks) If some other task/job has opened the file If somebody else is writing to the file right now Cache coherence If its portable ... its POSIX compliant!
Now Consider This:

What Can Go Wrong if the FS is not Parallel?
int main() { int fd, k, nrec = 1024, bsz = 16384, ntask = 2, tid; char *fid_out = "myfile", buf[bsz]; offset_t soff; /* 64 bit seek offset */ tid = spawn_task(ntask); fd = open(fid_out, O_RDWR | O_CREAT, 0777); while ((soff = find_record())) /* assume soff%bsz == 0 */ { critical section begin llseek(fd, soff, SEEK_SET); read(fd, buf, bsz); for (k = ntask; k < bsz; k += ntask) buf[k] = do_something(...); llseek(fd, soff, SEEK_SET); write(fd, buf, bsz); critical section end } close(ofd); return 0; }
Now Consider This:

Task 0, node J acquires lock Task 0, node J reads record N from disk Task 0, node J modifies buf[] at indicies 0, 2, 4, 6, 8, ... Task 0, node J writes record N to local cache Task 0, node J releases lock Task 1, node K acquires lock Task 1, node K reads record N from disk Task 1, node K modifies buf[] at indicies 1, 3, 5, 7, 9, ... Node J flushes cache Task 1, node K writes record N to local cache Task 1, node K releases lock Node K flushes cache clobbering Task 0's modifications!
he does not know its in node J's cache
Now Consider This:

This scenario is quite possible under NFS, for example, since it is not cache coherent (after all, its not truly parallel!).
GPFS maintains cache coherence (among many other parallel tasks) making parallel access to a common file safe (by taking the usual concurrency precautions such as using locks or semaphores).
Comments on NFS NFS V3 has cleaned much of this up. By using -noac option opening the file with the O_SYNC flag, parallel writes can be done more safely, though this contributes to Amdahl inefficiency by sequentializing parallel writes. However, this is not fool proof. Some customer codes fail under NFS using these options where they run safely without error under GPFS. NFS V4 holds more promise, but the verdict is still out. And parallel NFS (pNFS) is on the horizon...
Parallel Access from Multiple Nodes
Earlier I said that GPFS...

provides a parallel I/O system conforming to the POSIX standard; therefore, you can write an application code to access one file without worrying too much about what the other tasks are doing.
Well there are 2 things to worry about: 1. Normal precautions against RAW, WAR, WAW errors 2. Performance issues
overlapping records sequentializes file access and contributes to "Amdahl inefficiency"
RAW = Read After Write WAR = Write After Read WAW = Write After Write
Simplicity vs. Flexibility
GPFS is simple to use, but it is not a black box!

Design Philsophy:
Unlike a black box, GPFS provides many tuning parameters so that it can be adapted to many and changing environments. Over the years, GPFS has become simpler to use/administer. Its complexity comes from sitting on top of a complex stack.
GPFS is simplest complex product you will ever use!
GPFS salient feature - million knobs GPFS problem - million knobs
The next several sections survey basic architectural, organizational and topological features of GPFS. This provides a conceptual understanding for GPFS.
This helps
applications and systems programmers to more effectively utilize GPFS system administrators and architects to more effectively design and maintain a GPFS infrastructure
4. GPFS Architecture
1. Client vs. Server 2. LAN Model 3. SAN Model 4. Mixed SAN/LAN Model
Is GPFS a Client/Server Design?

chameleon
Software Architecture Perspective: No There is no single-server bottleneck, no protocol manager for data transfer. The mmfsd daemon runs symetrically on all nodes. All nodes can and do access the file system via virtual disks (i.e., NSDs). All nodes can, if disks are physically attached to them, provide physical disk access for corresponding virtual disks.
Is GPFS a Client/Server Design?

chameleon
Practical Perspective: Yes

1. GPFS is commonly deployed having dedicated storage servers ("NSD servers") and distinct compute clients ("NSD clients") running applications that access virtual disks (i.e., "NSD devices" "NSDs") via the file system. or
this is based on economics (its generally too expensive to have 1 storage controller for every 2 nodes)
2. Nodes are designated as clients or servers for licensing.

client nodes only consume data server nodes produce data for other nodes or provide GPFS management functions
producers: NSD servers, application servers (e.g., CIFS, NFS, FTP, HTTP) management function: quorum nodes, manager nodes, cluster manager, configuration manager
server functions are commonly overlapped
This reduces cost, but use caution!
example: use NSD servers as quorum and manager nodes

The new licensing model is much cheaper! client licenses cost less than server licenses server nodes can perform client actions, but client nodes can not perform server actions
Local Area Network (LAN) Topology

Clients Access Disks Through the Servers via the LAN
NSD
SW layer in GPFS providing a "virtual" view of a disk virtual disks which correspond to LUNs in the NSD servers with a bijective mapping
Client #1
nsd1 nsd2 nsd3 nsd4
o o o
Client #2
nsd1 nsd2 nsd3 nsd4
o o o
Client #3
nsd1 nsd2 nsd3 nsd4
o o o
Client #4
nsd1 nsd2 nsd3 nsd4
o o o
Client #5
nsd1 nsd2 nsd3 nsd4
o o o
Client #6
nsd1 nsd2 nsd3 nsd4
o o o
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
nsd12
nsd12
nsd12
nsd12
nsd12
nsd12
LAN Fabric (e.g., Ethernet, IB)

user data, metadata, tokens, heartbeat, etc.
LUN
Logical Unit Abstraction of a disk
AIX - hdisk Linux - SCSI device
Server #1
nsd1 nsd2 nsd3 nsd4
o o o
Server #2
nsd1 nsd2 nsd3 nsd4
o o o
Server #3
nsd1 nsd2 nsd3 nsd4
o o o
Server #4
nsd1 nsd2 nsd3 nsd4
o o o
Redundancy
Each LUN can have upto 8 servers. If a server fails, the next one in the list takes over.
There are 2 servers per NSD, a primary and backup server.
LUNs map to RAID arrays in a disk controller or "physical disks" in a server
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
nsd12
nsd12
nsd12
nsd12
Redundancy
Each server has 2 connections to the disk controller providing redundancy
L1, L2, L3 L4, L5, L6
L4, L5, L6 L1, L2, L3
L7, L8, L9 L10, L11, L12
L10, L11, L12 L7, L8, L9
SAN switch can be added if desired.

Storage Controller RAID Controller
A1 A2 A3 A4 A5 A6
No single points of failure

primary/backup servers for each LUN controller/host connection fail over Dual RAID controllers
RAID Controller
A7 A8 A9 A10 A11 A12
Zoning Zoning is the process by which RAID sets are assigned to controller ports and HBAs GPFS achieves its best performance by mapping each RAID array to a single LUN in the host. Twin Tailing For redundancy, each RAID array is zoned to appear as a LUN on 2 or more hosts.
Storage Area Network (SAN) Topology

Client/Servers Access Disk via the SAN
LAN Fabric (e.g., Ethernet, IB)
tokens, heartbeat, etc.
1 Gb/s connections are sufficient.
All nodes act both as client and server.
SAN Client #1
nsd1 nsd2 nsd3 nsd4 nsd12
o o o
SAN Client #2
SAN Client #3
L1 L4 L7 L10 L2 L5 L8 L11 L3 L6 L9 L12
o o o
SAN Client #4
L1 L4 L7 L10 L2 L5 L8 L11 L3 L6 L9 L12
o o o
SAN Client #5
L1 L4 L7 L10 L2 L5 L8 L11 L3 L6 L9 L12
o o o
SAN Client #6
L1 L4 L7 L10 L2 L5 L8 L11 L3 L6 L9 L12
o o o
GPFS NSD
GPFS NSD

o o o
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
All LUNs are mounted on all nodes.

GPFS is not a SAN file system; it merely can run in a SAN centric mode.
L1 L4 L7 L10
L2 L5 L8 L11
L3 L6 L9 L12
L1 L4 L7 L10
L2 L5 L8 L11
L3 L6 L9 L12
SAN Fabric (FC or IB)

user data, metadata
Multiple HBAs increase redundancy and cost.
Storage Controller RAID Controller

A1 A4 A5 A6 A2 A3
RAID Controller
A7 A8 A9 A10 A11 A12
Zoning maps all RAID sets (LUNs) to all nodes.

LICENSING CONSIDERATION: These nodes effectively function as a client/server, but not all of them require a server license.
No single points of failure

All LUNs mounted on all nodes SAN connection (FC or IB) fail over Dual RAID controllers
The largest SAN topologies in producton today are 256 nodes, but require special tuning.
CAUTION: A SAN configuration is not recommended for larger clusters (e.g., >= 64 since queue depth must be set small (e.g., 1)
Comparing LAN and SAN Topologies

LAN Topology
All GPFS traffic (user data, metadata, overhead) traverses LAN fabric Disks attach only to servers (also called NSD servers) Applications generally run only on the clients (also called GPFS clients); however, applications can also run on servers
cycle stealing on the server can adversely affect synchronous applications
Economically scales out to large clusters

ideal for an "army of ants" configuration (i.e., large number of small systems)
Potential bottleneck: LAN adapters

e.g., GbE adapter limits peak BW per node to 80 MB/s; "channel aggregation" improves BW
SAN Topology
User data and metadata only traverse SAN; only overhead data traverses the LAN Disks attach to all nodes in the cluster Applications run on all nodes in the cluster Works well for small clusters
too expense to scale out to large clusters (e.g., largest production SAN cluster is 250+ nodes) ideal for a "herd of elephants" configuration (i.e., small number of large systems)
Potential bottleneck: HBA (Host Bus Adapters)

e.g., assume 180 MB/s effect BW per 4 Gb/s HBA; multiple HBAs improves BW
Mixed LAN/SAN Topology

LAN Frabric
1
SAN client
SAN client
SAN client
SAN client
LAN client
LAN client
LAN client
LAN client
SAN Fabric Storage Controller A3 A5 A4 A6 A10 A11
It is necessary to declare a subset (e.g., 2 nodes) of the SAN clients to be primary/backup NSD servers. Alternatively, dedicated NSD servers can be attached to the SAN fabric.
A1 A2 A9
A7 A8 A12
COMMENTS:
Nodes 1 - 4 (i.e., SAN clients) GPFS operates in SAN mode User and meta data traverse the SAN Tokens and heartbeat traverse the LAN Nodes 5 - 8 (i.e., LAN clients) GPFS operates in LAN mode User data, meta data, tokens, heartbeat traverse the LAN
COMMON EXAMPLE
Nodes 1 - 4: P6p575 or P6p595 Nodes 5 - 8: iDataPlex or blades
Symetric Clusters
LAN Fabric
1
client server
client server
client server
client server
client server
client server
client server
client server
disk drawer
disk drawer
disk drawer
disk drawer
COMMENTS
Requires special bid pricing under new licensing model No distinction between NSD clients and NSD servers not well suited for synchronous applications Provides excellent scaling and performance Not common today given the cost associated with disk controllers Use "twin tailed disk" to avoid single point of failure risks New products may make this popular again. does not necessarily work with any disk drawer do validation test first example: DS3200 - yes, EXP3000 - no Can be done using internal SCSI Problem: exposed to single point of failure risk Solution: use GPFS mirroring
Which Organization is Best?
Its application/customer dependent! Each configuration has its limitations and its strong points. And each one is commonly used. The following pages illustrate specific GPFS configurations.
5. Performance Features
Six related performance features in GPFS 1. Multithreading 2. Striping 3. File caching 4. Byte range locking 5. Blocks and sub-blocks 6. Access pattern optimization
Multithreaded Architecture
GPFS can spawn up to
512 threads/node for 32 bit kernels 1024 threads/node for 64 bit kernels there is one thread per block (i.e., each block is an IOP)
large records may require multiple threads
The key to GPFS performance is "deep prefetch" which due its multithreaded architecture and is facilitated by
GPFS pagepool striping (which allows multiple disks to spin simultaneously) access pattern optimizations for sequential and strided access or the explicit use of hints
Data Striping
GPFS stripes successive blocks of each file across successive disks Disk I/O for sequential reads and writes is done in parallel (prefetch, write behind) Make no assumptions about the striping pattern Block size is configured when file system is configured, and is not programmable transparent to programmer
Disk Pool
Server Nodes
Application Nodes
increasing file offset ----> Block 1 Block 2 Block 3 Block 4 Block 5 Block 6
.....
one file
3 I/Os executed in parallel
Job reads at 120 MB/s Each disk reads at 40 MB/s
GPFS File Caching

Spatial vs Temporal Locality
Definitions
working set: a subset of the data that is actively being used spatial locality: successive accesses are clustered in space (e.g., seek offset) this is used for predictable access patterns (e.g., sequential, strided) temporal locality: successive accesses to the same record are clustered in time
To effectively exploit locality it is necessary to have a cache large enough to hold the working set.
good spatial locality generally requires a smaller working set ideally, adjacent records are accessed once and not needed again good temporal locality often requires a larger working set the longer a block stays in cache, the more times it can be accessed without swapping
GPFS locality
GPFS caching is optimized for spatial locality, but can accomodate temporal locality HPC applications more commonly demonstrate spatial locality
VM based caching systems are used in the "generic file systems"

favors temporal locality, but can accomodate spatial locality tuned using vmtune on Unix/Linux OSs temporal locality is common in commerical applications VM based caches can be as large as all free main memory examples: ext3, JFS, ReiserFS, XFS
GPFS File Caching

GPFS Pagepool
What is the pagepool?

It is a pinned memory cache used exclusively by GPFS
The pagepool is independent of the VMM subsystem
vmtune has no direct impact on the pagepool
GPFS uses mmap, schmat or kernel calls to do the pinning operation
It is used by GPFS for file data, indirect blocks and "system metadata blocks"
Pagepool size
Set by mmchconfig pagepool= {value}
map blocks inodes in transit recovery log buffers emergency buffers
default: 64M These values are generally too small, especially for large blocks (e.g., 4M) min value: 4M max value: 256G for 64 bit OS, 2G for 32 bit OS BUT GPFS will not allocate more than the pagepool parameter setting This can be changed to values between 10% to 90% allocate more than 75% of physical memory e.g., mmchconfig pagepoolMaxPhysMemPct=90 request more memory than the OS will allow
Optimum size of the pagepool

Best determined empirically
Optimum pagepool sizing is partially workload dependent File systems with a large blocksize and/or a larger number of LUNs requires a larger pagepool
This is not an exact measurement; it is only an example.
assume blocksize <= 1 MB and number of LUNs <= 12, then let sizeof(pagepool) <= 256 MB assume blocksize >= 2 MB and number of LUNs >= 24, then let sizeof(pagepool) >= 512 MB
Optimizing streaming access requires a smaller pagepool (e.g., up to 1 GB) Optimizing irregular access requires a larger pagepool (e.g., > 1 GB, enough to hold working set)
This requires temporal locality.
Once max performance is achieved, larger pagepools yield diminishing returns
GPFS File Caching

GPFS Pagepool
Pagepool Semantics
GPFS provides a client side caching model with cache coherency
Pagepool can be viewed as a single entity rather than seperate caches for each node
Regular access patterns use write-behind or prefetch caching

Applies to sequential or other predictable access patterns Write-behind: write cache policy
spatial locality
write back (write to cache only) write allocate (allocate cache block before writing)
Prefetch: read cache policy
Irregular (i.e., random) access patterns use LRU caching Prefetch threads
performs write behind for writers performs prefetch for readers
temporal locality
GPFS maintains pool of threads to be dispatched for active transactions

Despite the name, "prefetch" threads perform both write and read tasks.
Set by mmchconfig prefetchThreads={value}
Miscellaneous observations
Pagepool creates implicit asychronous operation
It is an open question as to whether POSIX AIO provides additional benefit under GPFS.
GPFS write operations are atomic

Given this and atomic writes, you can avoid accessing partially written records or corrupting records by having 2 tasks write to it at the same time.
GPFS File Caching

GPFS Pagepool
Determine targeted portion of cache space used for streaming access

large; mmchconfig prefetchPct={ integer <= 70 } Do not make this toorandom!remember, metadata access is allocate 1 buffer for first access default = 20% (e.g., 200 MB if the pagepool is 1 GB) allocate 2 buffers if 2nd access is sequential uniformly distribute cache over active sequential streams allocate more buffers up to the target for continued sequential adaptively determine if a stream is sequential access Allocate buffers for each filesystem with active sequential streams
adaptively determine the target number of buffers per active stream

sizeof(buffer) = GPFS blocksize DFS = desired number of buffers per file system = min(v1, v2) + v3 To fully utilize a NIC, set This is a scaling factor from a lookup table to compensate maxMBpS = 2 * NIC_speed v1 = 2 * LUNs * factorPerBlockSize for non-linear scaling associated with blocksizes. v2 = maxMBpS * tio / blocksize where tio = avg time over last 16 sequential transactions v3 = number of the active sequential streams
determine the allowed buffer count per file system (AFS)
AFS = min (v4, DFS) v4 = sizeof(pagepool) * prefetchPct / number of filesystems / blocksize(FS) TB = AFS / number of streams for this file system
Sequential means "strictly sequential", but these algorithms can be adapted to other regular (i.e., "predictable") access patterns.
determine the target number of buffers per active sequential stream per file system
Active streams receive 1 to AFS buffers as they demonstrate sequential access

a reader with AFS buffers can not receive a new buffer until it has consumed an old buffer a writer with AFS buffers can not dirty a new buffer until it's completed a write-behind on a old buffer once a buffer transaction is done, the buffer is placed on the "done list" where it is recycled
If a stream (i.e., sequential user) becomes random or inactive after 5 seconds, then its buffers are disowned and given a LRU status where they "age out"
GPFS File Caching

GPFS Pagepool
GPFS recognizes the following regular access patterns and sets appropriate prefetching strategies.
Regular Access Patterns
Sequential: This is a strictly sequential pattern. Fuzzy Sequential: The nfsPrefetchStrategy parameter defines a window of 3 to 12 (the default value is 2) contiguous blocks that can be accessed out of order, but cached using write-behind/prefetch semantics (except that write-behind buffers are returned to the LRU pool). While intended to handle out of order NFS accesses (due to thread scheduling of nfsd workers) this algorithm will work with any access pattern demostrating similar locality). Strided: This applies to records of the same size with a consistent offset (forward or backward, including backward sequential) from the previous record. Prefetch threads only access the sectors encapsulating the record. Mmap Strided: Applies where a small set of contiguous pages are accessed that are roughly the same length before each "gap". The prefetch algorithm tries to predict how many pages will be needed for the next stride, but only works within a single GPFS block at a time. Multi-block Random: Applies when 3 or more blocks are accessed in one request that is not sequential. The prefetch algorithm will be applied to the blocks after the first block up to the end of the request.
Remember that despite its name, the prefetch algorithm applies to both write-behind and read-prefetch.
example: if blocksize = 256K and recordsize = 1024K, then access = multi-block random
User Defined: Apply the prefetch algorithm to records allocated via the GPFS multiple access hint (discussed later).
Irregular (i.e., random)

Apply LRU (Least Recently Used) caching to access patterns not included in the previous list. Access only the sectors encapsulating the record.
GPFS File Caching

GPFS Pagepool
Miscellaneous observations
If you change both the maxblocksize and pagepool parameters at the same time
specify pagepool first if you increase the values specify maxblocksize first if you decrease the values
Large pagepools are most helpful when

writes can overlap computation heavy reuse of file records (n.b., good temporal locality) semi-random access patterns with acceptable temporal locality a GPFS node is used as a login node or an NFS server for large clusters
Pagepool size on NSD servers

general principle: NSD servers do not cache data; they use the pagepool for transient buffers Formula
pagepool_size = largest blocksize * NSDThreads This the largest blocksize from any GPFS file system on the cluster. NSDThreads = min(A1, max(A2, A3)) A1 = nsdMaxWorkerThreads Determine these parameters as follows: A2 = nsdMinWorkerThreads mmfsadm dump config | grep -i nsd A3 = K * nsdThreadsPerDisk K = number of LUNs per NSD server
Largest blocksize is not necessarily the same as the maxblocksize.
Heuristic: Don't worry about it! Pick a value that is not too large (e.g., 64 to 128 MB)
COMMENT: This NSD server issue is most important for application environments where some subset of the application nodes have larger pagepools. Since the pagepool is easy to change, empirical methods can also be used to determine an optimum setting.
GPFS File Caching

i-node and Stat Cache
i-node cache
part of the shared segment size = maxFilesToCache * 2.5 KB 1 <= maxFilesToCache <= 100,000
These parameters are easy to change. Empirical evaluation is the most effective means for determining optimum settings.
set large enough to accomodate the number of concurrently open files plus caching for recently used files the default is 1000, but a value as small as 200 is adaquate for traditional HPC applications larger values (e.g., 1000) may improve performance on systems with many small files larger values (e.g., 1000) are needed for a GPFS node used as a login node or an NFS server for large clusters
stat cache
part of the shared segment size = 176 B * maxStatCache best practice: maxStatCache <= 100,000
According to the GPFS documentation, this value can be set as large as 10,000,000 (n.b., 1.7 GB), but such a large value will exceed the shared segment size.
default = 4 * maxFilesToCache larger values are needed when a GPFS node is used as a login node or an NFS server (e.g., 50,000) mmfsd will only allocate as much space as it thinks is safe; if an excessive request is made, it will request at most 4 * maxFilesToCache. This is at best only a heuristic algorithm.
Avoid setting this value unecessarily large. Remember that it only is helpful where temporal locality of stat operations (e.g., ls -l) can be exploited.
The Shared Segment

memory shared by the GPFS daemon and the OS kernel AIX: a 256M segment of unpinned memory Linux: vmalloc space (n.b., set by a boot parameter) which is pinned memory GPFS uses at most 80% of the shared segment
use the mmfsadm dump fs command for the calculation of how much will fit in the shared segment
If maxFilesToCache or maxStatCache are set too large, mmfsd will not start.
Supporting Parallel File Access from Multiple Nodes
Traditionally file systems have allowed safe concurrent access to a single file from multiple tasks, but only with one task at a time. This was inefficient. GPFS provides a finer grained approach to this allowing multiple tasks to read and write to a file at the same time. GPFS does this using a feature called "byte range locking" which is facilitated by tokens.
Byte Range Locking

GPFS allows parallel applications on multiple nodes to access non-overlapping ranges of file without conflict or performance loss But byte range locks serialize access to overlapping ranges of a file
write conflict
8 GB File
node 1 locks offsets 0-2 GB node 2 locks offsets 2-4 GB node 3 locks offsets 4-6.5 GB node 4 locks offsets 6-8 GB
byte range locks preserve data integrity byte range locks are transparent to the user byte range lock patterns can be much more intricate
Token Management
Byte range locking facilitated by tokens
A task can access a byte range within a file (e.g., read or write) iff it holds a token for that byte range.
Token management is distributed between 2 components tokenMemLimit controls the number Token Server of tokens per token manager. Default
There can be 1 or more nodes acting as token servers
distributedTokenServer = yes by default (see mmchconfig) designate multiple manager nodes (using mmcrcluster or mmchnode) EXAMPLE: Using default settings, token load is uniformly distributed over the manager nodes a cluster with 256 nodes will have more than 1,200,000 tokens. 1 token manager can process >= 500,000 "tokens" using default settings total tokens = number of nodes * (maxFilesToCache + maxStatCache) + all currently open files
is 512M. As a rule of thumb allow for ~= 600 bytes of token per file per node. In this context, each token is a set of tokens adding up to 600 bytes.
Manages tokens on behalf of a particular file system

distributes tokens to requesting token clients distributes lists of nodes to token clients requesting conflicting tokens
Tokens are processed via the kernel
Token Client
There is one token client per node per file system running on behalf of all application tasks running on that node It requests/holds/releases tokens on behalf of the task accessing the file
Token Management
The Process
Offload as much work as possible from the token manager Token semantics tokens allow either read or write access within a byte range token manager responsibility
Reality is far more complex. Tokens are associated with lock objects. Tokens support 12 modes of access and there are 7 lock object types. As a "rule of thumb" allow for about 600 bytes of token (e.g., typically 3 tokens) per file per node.
"coordinates" access to files distributes tokens or a list of nodes holding conflicting tokens to requesting token clients
token client responsibility

token clients act on behalf of their tasks (n.b., token operations are blind to the application programmer) a task can access a given byte range without further access to the token manager once it has received a token until another task attempts to access the same byte range the task requests other tasks holding conflicting tokens to release their tokens the task releases a token to the token manager at the request of another task, but it will not do this until it has released the byte range lock for the file (this may include waiting for an I/O operation to complete)
COMMENTS
Accessing overlapping byte ranges where a file is being modified will sequentialize file operations (n.b., this contributes to Amdahl inefficiency) GPFS write operations are atomic There are 9 classes of tokens in GPFS, but an open file on any node will generally have only 3 classes of tokens associated with it for ~= 600 bytes per file per node.
Token Management: FAQs

FAQs
What happens if a manager node fails that is running a token server? A new token server will be automatically spawned on a manager node.
selected from the list of manager nodes specified by mmcrcluster, mmchnode if there are no free manager nodes, GPFS will redistribute the tokens across the existing manager nodes
File operations are suspended until the new token server is ready. The new token server re-creates its token set by collecting the token state from each node in the cluster. If there are multiple manager nodes are running token servers, a simple algorithm using token IDs sort out which tokens belong to which server. What happens if a node or task fails that holds byte range locks? A log corresponding to the failed node is re-played
metadata is restored to consistent state locks are released
How long does token server recovery take? Many variables to consider.
complexity of token state network design and robustness example: 10's of minutes in extreme cases (e.g., cluster with 4000 nodes)
Access Patterns
An application's I/O access pattern describes its I/O transaction sizes and the order in which they are accessed. It is determined by both the application and the file system. Sequentially accessed large application records based on large file system blocks provide the best performance for GPFS (as well as any other file system), but applications can not always do I/O this way. Let's examine GPFS features, tuning and best practices that can determine and compensate (to varying degrees) for access pattern variations.
GPFS Blocks
What is a block?
The largest "chunk" of contiguous data in a GPFS file system The largest "transfer" unit in a GPFS file system If sizeof(record) >= sizeof(block), then GPFS will simultaneously access multiple blocks for that transaction
exampe: if sizeof(record) = 4 MB and sizeof(block) = 1 MB, then this transaction will result in 4 simultaneous GPFS IOPs
Supported block sizes

4MB, 2MB, 1MB, 512 KB, 256 KB, 128 KB, 64 KB, 16 KB A large blocksize optimizes performance when large record accesses are common (by reducing the number of IOPs)
Blocksize (KB) write (MB/s) read (MB/s) 128 611.0 502.3 256 1226.1 1041.4 512 2344.2 1988.7 1024 4137.9 2994.7 2048 5482.5 3418.1 4096 5477.6 3630.6
Nodes(4): P6p520, RAM = 8G, 2 x FC8, 2 x TbE, 2 x HCA (12xDDR) DCS9900: SATA, 64 tiers, cache size = GPFS blocksize, cache writeback = ON, cache prefetch = 0, NCQ = OFF GPFS: blocksize = DCS9900 cache size, block allocation = scatter, pagepool = 4 GB, maxMBpS = 4000 Application: 16 tasks, record size = GPFS blocksize, file size = 256 GB
GPFS Sub-blocks
GPFS blocks can be divided into 32 sub-blocks
A sub-block is the smallest "chunk" of contiguous data in a GPFS file system a file smaller than a sub-block will occupy the entire sub-block large files begin on block boundary files smaller than a block can be stored in fragments of 1 or more sub-blocks files larger than a sub-block have very little internal fragmentation
Sub-blocks vs. sectors

A sector (512 bytes) is the smallest "transfer" unit e.g., a 1-byte read request will result in a 512 byte transfer If the access pattern is irregular, the record sizes are smaller than a block, and the data is not in the pagepool, then GPFS will access only the sectors that enclose the record If a file smaller than a block is accessed in a single transaction, then GPFS will access only the sectors that enclose the file
Caching small irregular transactions

Suppose sizeof(record) < sizeof(block) Suppose that the record is in the pagepool If a second record is accessed within the same block, then if "time" <= 32, then GPFS will access the entire block if "time" > 32, then GPFS flushes cache and accesses the sectors of the new record only "time" is measured by a random access counter differential
a global counter is bumped every time there is a random access to any file in the file system a local counter for a given block in the pagepool is initialized to the global value upon the first random access to that block time is the difference between the global and local counter on the next random access to a given block
Allocation Map and Allocation Regions

Segmented Block Allocation MAP:
map types: scatter (default), cluster selected using mmcrfs -j parameter
Each segment contains bits representing blocks on all disks Each segment is a separately lockable unit Minimizes conflicts between multiple writers Allocation manager provides hints which segments to try sizeof(segment) < blocksize
Allocation Regions
The block allocation map is divided into k regions where k > 32 * number of nodes
the value of k is based on the number of nodes estimate from mmcrfs -n parameter there are at least 32 allocation regions per node there is one or more allocation map segments per allocation region
Guarantees there are 1 or more allocation regions per node if file system < 97% capacity
if mmcrfs -n is set too small, nodes run out of allocation regions prematurely nodes start sharing allocation regions which hurts performance WARNING: it is not easy to change the mmcrfs -n setting... get it right the first time!
Block Allocation
Block Allocation Map Type
File data distribution
GPFS distributes file blocks to a file system's LUNs in a round-robin pattern file blocks are then distributed across each LUN according to the block allocation map type
Type: scatter
randomly distribute file blocks over the LUN (i.e., scattered over the disk) guarantees uniform performance of multitask jobs accessing a common file
compensates for Poisson arrivals
[Scatter] won't get you the best possible performance out of the disk subsystem, but it also avoids getting the worst of it. Yuri Volobuev
default = scatter if number of nodes > 8 or number of LUNs > 8
Type: cluster
write file data in clusters of contiguous disk blocks (i.e., clustered together on the disk) yields better performance in restricted circumstances
"small" clusters and/or "small" file systems "small" transactions (e.g., 4K)
default = cluster if number of nodes <= 8 and number of LUNs <= 8
COMMENT:
There is no guarantee that contguous file blocks on disk will be accessed in the same order that they are mapped to disk. Factors contributing to the randomness of "arrivals" include
a larger number of tasks and/or nodes simultaneously accessing a file a larger number of files simultaneously being accessed the stochastic nature of queueing systems
WARNING: This parameter can only be changed with a destructive rebuild.
Given the variabilities of clustered block allocation, validation testing is recommended before adopting it.
Example
write - number of nodes
blocksize block allocation
read - number of nodes

14 1 635 799 661 661 2 818 1321 1321 1347 4 1026 1510 1527 1561 8 1180 1485 1535 1561 14 1205 1470 1541 1551
transaction size
write read nodes/tasks

4/16 IOP/s IOP/s MB/s MB/s 4/16 4 KB 4 KB1 4096 KB2 4096 KB2
1
1 MB/s MB/s MB/s MB/s 554 554 624 624
2 780 1074 1145 1145
4 799 1117 1145 1145
8 799 1091 1145 1145
256 KB 256 KB 1024 KB 1024 KB
scatter cluster scatter cluster
808 1078 1142 1145
51931 21664 48113 37001 5576 3474 5394 4147

1. blocksize = 256 KB 2. blocksize = 4096 KB
Hartner, et. al. Sequential I/O Performance of GPFS on HS20 Blades and DS4800 Storage Server. Technical Report., IBM. 22 x 4+P RAID 5 arrays, 14 x HS20 blades using GPFS as a SAN over FC4
DCS9900, 64 SATA tiers 4 x P6p520 nodes GPFS SAN over FC8
Well Formed I/O

I/O transactions are well formed if the application record is aligned with GPFS block or sub-block boundaries.
Records >= blocksize
seek_offset % blocksize = 0 record_size / blocksize = k, where k is an integer > 0
Records < blocksize

seek_offset % sub-blocksize = 0 record_size / sub-blocksize = k, where k is an integer > 0
Caveates and Warnings

Sequential access compensates for non-well formed I/O
full blocks will be preallocated in cache upon write and prefetched upon read
Irregular access patterns with non-well formed IO will often require extra IOPs
Example (2n vs. 10n): If record_size = 1000000 and blocksize = 1048576, then each record will generally span 2 blocks requiring 2 IOPs to read 1 record that fits in 1 block. If the application reads a full block (i.e., 1048576), it will have significantly improved performance even if it does not use all of the data it reads.
COMMENT
seek_offset = 0 is well formed in GPFS
Well Formed I/O

Benchmark Example
minimal impact for sequential pattern
MB/s 100
90 80 70 60 50 40 30 20
well formed non well formed

COMMENT Well formed I/O has a minimal impact on a sequential access pattern. Well formed I/O has a significant impact on random access pattern since it must do nearly 2X more IOPs when its transactions are not well formed.
significant impact for random pattern
1048576
1048576
1048576
1000000
1000000
1048576
10
record size
seq write
App: GPFS: Nodes: Disk:
seq read
random write
random read
4 tasks, 2 nodes, record size = variable, file size = 8 GB version 3.2, blocksize = 1 MB, pagepool = 256 MB P4p615, 2 cores, 4 GB RAM SAN attached SSA (16 disks @ 10 Krpm), JBOD
1000000
1000000
Direct I/O
Direct I/O
Open the file the O_DIRECT flag this flag is considered advisory, not mandatory
the FS can ignore it, but GPFS accepts it (n.b., "buyer beware!")
COMMENT: Use Direct I/O when the GPFS caching mechanism can not compensate for the access pattern. This is not trivial!
I/O Buffer must be memory page aligned for most systems, 4 KB alignment (i.e., buffer_address % 4096 = 0) Seek offset must be sector aligned for GPFS 512 B alignment (i.e., seek_offset % 512 = 0)
n.b., most file systems require 4K alignment Direct I/O
Direct I/O bypasses the FS cache mechanism; therefore, the programmer must compensate by manually doing the aligning.
EXAMPLE:
int bsz; /* size of record */ off_t soff = 0; /* seek offset */ char *buf; /* 4K aligned buffer */ void *b1; /* needed for pointer swizzling */ unsigned b2; /* needed for pointer swizzling */ . . . . . . b1 = malloc(bsz + 4096); b2 = (unsigned)b1; b2 = b2 & (unsigned)0xFFFFF000; b2 = b2 + (unsigned)0x00001000; buf = (char*)b2; if (bsz%512 != 0) printf("ERROR: buffer is not block aligned\n"); else soff += (off_t)bsz;
Common Application Access Patterns

Streaming
records are accessed once and not needed again generally the file size is quite large (e.g., GB or more) good spatial locality occurs if records are adjacent performance is measured by BW (e.g., MB/s, GB/s) operation counts are low compared to BW most common in digital media, HPC, scientific/technical applications
COMMENT: These access patterns are from the application's perspective. The actual access pattern on the media is also determined by the file system and storage controller architecture and configuration.
IOP Processing
small transactions (i.e., less than FS block size) small records irregularly distributed over the seek offset space small files poor spatial locality and often poor temporal locality performance is measured in operation rates (e.g., IOP/s) operation counts are high compared to BW common examples: bio-informatics, EDA, rendering, home directories
Transaction Processing
small transactions (i.e., files or records less than the blocksize), but often displaying good temporal locality access efficiency can often be improved by database technology performance is measured in operation rates (e.g., IOP/s) operation counts are high compared to BW common examples: commercial applications
Access Pattern Optimizations
GPFS uses cache to improve the performance of various access patterns:

sequential (write and read, forward and backward) strided (write and read, forward and backward) small file (write only)
If the pattern is recognized, then the relevant records can be asynchronously pre-loaded into cache. If the access pattern is not recognized by GPFS, then hints can be provided informing GPFS which records can be pre-loaded into cache.
COMMENT: These optimizations assume spatial locality
Sequential Access Pattern

When GPFS detects a forward or backward sequential order, it either preallocates cache blocks on write or prefetches disk blocks on reads generating the peak performance sustainable by the disk controller.
App: GPFS: Nodes: Disk: 4 tasks, 2 nodes, record size = 1 MB, file size = 8 GB, well formed I/O version 3.2, blocksize = 1 MB, pagepool = 1 GB P4p615, 2 cores, 4 GB RAM SAN attached SSA (16 disks @ 10 Krpm), JBOD
Write 85.8 MB/s 89.4 MB/s 37.8 MB/s Read 88.8 MB/s 93.0 MB/s 32.8 MB/s
Access Pattern Sequential Backward Sequential Random
minimal caching
Sequential Access Pattern

GPFS Blocksize vs. Application Record Size*
I/O is well formed.
MB/s
100 90 80 70 60 50 40 30 20 10 0 2048 512 128 32 8 2 1024 256 64 16 4 0 1 WRITE RATE
READ RATE
Best Practice: Let

sizeof(record) >= sizeof(block). More complex systems will be less forgiving with very small records (e.g., < 16K).
COMMENT: Even if sizeof(record) < sizeof(block), GPFS will still access full blocks since the access pattern in sequential. write: fill block in cache before flushing read: prefetch the full block
Record Size in KB
App: GPFS: Nodes: Disk: 4 tasks, 2 nodes, record size = variable, file size = 2 GB, well formed I/O version 3.2, blocksize = 256 KB, pagepool = 256 MB P4p615, 2 cores, 4 GB RAM SAN attached SSA (16 disks @ 10 Krpm), JBOD
Strided Access Pattern
When GPFS detects a strided order, it prefetches along the stride thus improving performance.
8 tasks @ 1 per node, blocksize = 256 KB, record size = 16 KB, file size = 5 GB WH2 with 14 clients and 2 VSD servers, using 36 GB, 10 Krpm, SSA drives
strides under GPFS 1.2 strides under GPFS 1.3 Write less than 1 MB/s* 17 MB/s Read 11.1 MB/s* 58 MB/s
* The strided rate under GPFS 1.2 is the same as the random (without hints) rate under GPFS 1.3.
Improving Strided Access Rates
But notice that increasing the record size from 16KB to 1024 KB, the rates increase.
8 tasks @ 1 per node, blocksize = 256 KB, file size >= 5 GB WH2 with 14 clients and 2 VSD servers, using 36 GB, 10 Krpm, SSA drives
record size = 1024 KB record size = 16 KB Write 172 MB/s 17 MB/s Read 211 MB/s 58 MB/s
Irregular Access Patterns Are Slow
An irregular access pattern does not allow GPFS to anticipate the seek pattern. Therefore it can not prefetch records for reading or preallocate cache blocks for writing.
POSIX I/O is a simple standard covering the basics. Early versions of GPFS stuck closely to this standard. But because of its shortcomings in many environments, IBM has added API extensions to GPFS that go beyond the POSIX I/O API. These extensions are a mixed blessing. While they improve performance and facilitate important semantics not part of POSIX I/O, they are generally not portable.
GPFS Multiple Access Hint

Improving Random I/O
Some applications are intrinsically based on small records

sorting jobs dmostack in seismic processing
The GPFS multiple access hint allows the programmer to post future accesses and then prefetches them asynchronously. Reads are improved substantially, writes not as much.
without hints using hints* write rate 33.1 MB/s 38.8 MB/s read rate 18.5 MB/s 63.7 MB/s
The impact of using hints is more significant given a larger number of nodes.
8 tasks, 2 nodes, record size = 128 KB file size = 2 GB, well formed I/O version 3.2, blocksize = 256 KB, pagepool = 256 MB P4p615, 2 cores, 4 GB RAM SAN attached SSA (16 disks @ 10 Krpm), JBOD
The multiple access hint interface is tedious to use, but a simple to use interface can be crafted.
GPFS Multiple Access Hint:

An Example of a Simple Interface
A simple GPFS multiple access hint interface can be designed by the user (hiding the low level tedium) making it easier for high level applications to use hints.
For example...
public: int pio_init_hint(struct pio *p, int maxbsz, int maxhint); int pio_post_hint(struct pio *p, offset_t soff, int nbytes, int nth, int isWrite); int pio_declare_1st_hint(struct pio *p); int pio_xfer(struct pio* p, char* buf, int nth); private: int pio_gen_blk(struct pio *p, int nth, int isWrite); int pio_issue_hint(struct pio*p, int nth); int pio_cancel_hint(int fd);
PUTTING IT ALL TOGETHER:

Block Size vs. Record Size vs. Caching I/O Rate MB/s
100 90 80 70 60 50 40 30 20 10
Best Practice: Be sure that
sizeof(record) >= sizeof(block). This has a very noticable effect for the random access pattern which approaches sequential performance for very large records.
SEQ WRITE SEQ READ
RNDM WRITE RNDM READ
4 tasks, 2 nodes, record size = variable, file size = 2 GB, well formed I/O version 3.2, blocksize = 256 KB, pagepool = 256 MB P4p615, 2 cores, 4 GB RAM SAN attached SSA (16 disks @ 10 Krpm), JBOD
4096 2048 1024 512 256 128 64 32 16 8 4 2 1
Record Size in KB
Small Files
Small Files
Increasingly common in clusters today Small blocks work best when the average file size is small (e.g., less than 256K) But do not make GPFS blocks too small select blocksize so that the sub-block size ~= average file size
reduces internal fragmentation produces optimum small file performance block still large enough to support larger files (not every file will be small)
Small file optimization allocate small files "close together" by filling one full block on one disk before moving to the next
flushes small files to disk as individual small IOPs (i.e., in units of sub-blocks) use controller cache to block the small IOPs into larger transaction Will this be better on DCS9900? produces 7.2% improvement in DS4800 benchmarks
improves write performance, less likely to improve read performance
Small file access performance is not as good as streaming access

Small file access is an IOP access pattern Should only be considered when all else fails
Tuning GPFS for IOP Processing

When Small Transactions Are Inevitable
worker1Threads
Consider an N+P RAID set. The set worker1Threads = 2 * N * number_of_LUNs
COMMENT: This slide needs further refinement and some of its guidelines need to be tested more rigorously.
prefetchThreads
You need roughly twice as many prefetchThreads as LUNs. Suppose you have K LUNs, then prefetchThreads = 2 * K
pagepool
Set the pagepool large enough so that 20% can hold the buffers for the prefetchThreads prefetchThreads * blocksize < 0.2 * pagepool
When there are many small files...

Increasing the maxFilesToCache and the maxStatCache may help example (EDA): maxFilesToCache = 10000, maxStatCache = 40000
When there is good temporal locality

pagepool < 0.5 * sizeof(memory) max pagepool < 256 GB
Direct I/O
Bypasses the FS cache mechanism Since GPFS is optimized for large records, can this reduce overhead for small records? Use knowledge of application to exploit locality not detected by the file system
COMMENT: This slide needs further refinement doing tests with a more genuninely random pattern for small files.
IOP Processing vs. Streaming

IOP Workload
ntasks tree depth files per directory total directories total files record size total data function job time (sec) average time (sec) total time (sec) # IOPs % directory ops % open/close % write % read IOP rate (IOP/s) Data rate (MB/s) 16 12 8 65520 524160 2K 2600 MB
GPFS Config - version 3.2 - blocksize: 256 KB - subblocksize: 8 KB - pagepool: 1024 MB DS4800 Config - 64 x 15Krpm disks - 4+P RAID 5 - segment size: 64 KB - cache page size: 16 KB - read cache: ON - read ahead: ON - write cache: ON - write cache mirroring: OFF
Streaming Workload
ntasks tree depth files per directory total directories total files record size total data 16 1 1 1 1 1M 65536 MB write 60.7 59.0 943.6 65540 0% 0% 99.9% 1111 1111.2
GPFS Config - version 3.2 - blocksize: 1024 KB - pagepool: 256 MB DS4800 Config - 64 x 15Krpm disks - 4+P RAID 5 - segment size: 256 KB - cache page size: 16 KB - read cache: ON - read ahead: ON - write cache: OFF
write read 2 163.9 90.7 112.1 86.4 1793.3 1383.1 2,620,800 2,555,280 10.0% 7.7% 40.0% 41.0% 50% 51.3% 2 23384 29561 6.0 29.6
function job time (sec) average time (sec) total time (sec) # IOPs % directory ops % open/close These values % write may % read benefit unaturally IOP rate (IOP/s) from Data rate (MB/s) cache.3
read 47.8 46.3 740.8 65540 0% 0% 99.9% 1415 1415.4
COMMENTS: 1. For the most part these are non-cached IOP rates. Moreover, the IOP rates quoted here are based on application transactions. The non-cached IOP rates quoted for storage controllers are based on consistant 4K transactions measured by the controller and not the application. 2. The write IOP rate is based on the harmonic aggregate; however, this measure is slightly compromised by the large job time variance of 24.1%. By comparison, the natural aggregate is 15988. The "true" IOP rate (i.e., when all 16 tasks were active) would be closer to 20000. The variance for the read rate was only 4.0%. 3. The tree traversal algorithm used by this benchmark may lend itself to an unaturally cache friendly situation not typical for many small file access patterns.
The Big Picture
Use large files and streaming access where possible.
Seek Arm Mechanics

The more data transfered for every seek arm movement the better performance is!
6. GPFS Management and Overhead Functions
In general, GPFS is designed to perform the same functions on each node and the functions performed on behalf of an application are executed on the node where it is generated. However, there are specialized management and overhead operations which are performed globally that affect the operation of the other nodes in the cluster.
GPFS Management Functions

Like any file system, GPFS has several classes of overhead functions, but it does not concentrate them into a dedicated "metadata" server. Rather, it distributes them over several to many nodes reducing the impact of their overhead and risk exposure. Some aspects of these functions are concentrated on specific nodes* where the cost of its distribution outweighs its value.
Metanode+ Configuration Manager
AKA "cluster configuration manager"
Cluster Manager Manager Nodes

File System Managers
File system configuration Manage disk space allocation Quota management Security services
Token Managers
Quorum Nodes
These functions are generally overlapped with other dedicated nodes (e.g., NSD servers, login nodes) though in very large clusters (e.g., over 2000 nodes) this must be done carefully so that their function is not impacted by network congestion. Does not require a server license
Metanodes
Problem
Cant afford exclusive inode lock to update file size and mtime Cant afford locking whole indirect blocks
Solution
Metanode (one per file) collects file size, mtime/atime, and indirect block updates from other nodes
How it works
Metanode is elected dynamically and can move dynamically Only the metanode reads & writes inode and indirect blocks Merges inode updates by keeping largest file size and latest mtime Synchronization Shared write lock allows concurrent updates to file size and mtime. Operations that require exact file size/mtime (e.g., stat) conflict with the shared write locks. Operations that may decrease file size or mtime
Comments
This is not a metadata server concentrating metadata operations on 1 or a small number of dedicated nodes. Rather, its a distributed algorithm processing metadata transactions across clients in the cluster. There is minimal overhead per metanode
Configuration Manager
There is a primary and backup configuration manager per GPFS cluster
Specified when the cluster is created using mmcrcluster Common practice assign to manager nodes and/or NSD servers.
Function
Maintains the GPFS configuration file /var/mmfs/gen/mmsdrfs on all nodes in the GPFS cluster. This configuration file can not be updated unless both the primary and backup configuration managers are functioning.
Minimal overhead
Cluster Manager
There is one cluster manager per GPFS cluster Selected by election from the set of quorum nodes
can be changed using mmchmgr
Functions
Monitors disk leases (i.e., "heartbeat") Detects failures and directs recovery within a GPFS cluster determines whether quorum exists
This guarantees that a consistent token management domain exists; if communication were lost between nodes without this rule, the cluster would become partitioned and the partition without a token manager would launch another token management domain (i.e., "split brain")
Manages communications with remote clusters distributes certain configuration changes to remote clusters handles GID/UID mapping requests from remote clusters Selects the file system manager node by default, it is chosen from the set of designated manager nodes choice can be overridden using mmchmgr or mmchconfig commands
Network Considerations
GbE is adaquate!
Heartbeat network traffic is light and packets are small default heartbeat rate = 1 disk lease / 30 sec per node cluster manager for a 4000 node cluster receives 133 disk leases per second But network congestion must not be allowed to interfere with the heartbeat by default, disk lease lasts 35 sec, but a node has last 5 sec to renew lease best practice: assign to a lightly used or dedicated node in clusters over 1000 nodes
Manager Nodes
Designating manager nodes
They are specified when a cluster is created (using mmcrcluster)
can be changed using mmchnode
Can specify up to 128 manager nodes

If not specified, GPFS selects 1 node to be a combined file system and token manager node.
Function
File system managers Token managers
Best practices
smaller clusters (less than 1000 nodes): commonly overlaped with NSD servers and/or quorum nodes Some customers overlap quorum and manager nodes. larger clusters (more than 1000 nodes): Be careful overlapping them with login nodes. assign to lightly used or dedicated nodes
do not overlap with NSD servers
File System Managers

There is exactly one file system manager per file system
File system managers are uniformly distributed over manager nodes
A file system manager never spans more than 1 node, but if there are more file systems than manager nodes, there will be multiple file system managers per manager node.
Choice can be overridden by mmchmgr command

any node can be chosen (n.b., it does not have to be a manager node)
There are 4 file system management functions

1. file system configuration
adding disks changing disk availability repairing the file system mount/umount processing (this is also done the node requesting the operation)
2. disk space allocation management

controls which regions of each disk are allocated to each node (striping management)
3. quota management
enforces quotas if it has been enabled (see mmcrfs and mmchfs commands) allocates disk blocks to nodes writing to the file system generally more disk blocks are allocated than requested to reduce need for frequent requests
4. security services
see manual for details some differences appear to exist between AIX and Linux based systems
Low overhead
Token Managers
Token managers run on manager nodes
GPFS selects some number of manager nodes to run token managers
GPFS will only use manager nodes for token managers the number of manager nodes selected is based on the number of GPFS client nodes
Token state for each file system is uniformly distributed over the selected manager nodes
there is 1 token manager per mounted file system on each selected manager node
1 manager node can process >= 500,000 "tokens" using default settings
In this context, a token is a set of several tokens ~= 600 bytes on average. total number of tokens = number of nodes * (maxFilesToCache + maxStatCache) + all currently open files If the selected manager nodes can not hold all of the tokens, GPFS will revoke unused tokens, but if that does not work, the token manager will generate an ENOMEM error. This usually happens when not enough
manager nodes were desiginated.
Function
Maintain token state (see earlier slides)
Overhead
CPU usage is light Memory usage is light to moderate (e.g., at most 512 MB by default)
can be changed using mmchconfig tokenMemLimit=<value>
Message traffic is variable, but not excessive. It is characterized by many small packets
If network congestion impedes token traffic, performance will be compromised, but it will not cause instability. If NSD servers and GPFS clients are also used for token management, large block transfers (e.g., >= 512 KB) may impede token messages. If these issues are impeding token response,
chances are good that users will never notice.
Quorum
Problem
If a key resource fails (e.g., cluster manager or token manager) GPFS will spawn a new one to take over. But if the other one is not truly dead (e.g., network failure), this could create 2 independent resources and corrupt the file system.
Solution
Quorum must be maintained to recover failing nodes. 2 options Node quorum (default): must have at least 3 quorum nodes Node quorum with tiebreaker disks: used in 1 or 2 node clusters
LAN Switch
token mgr
X
If an ISL fails, GPFS must not allow frame 1 to spawn an independent cluster manager frame 2 to spawn an independent token manager
LAN Switch
cluster mgr
frame 1
frame 2
Node Quorum
How it Works
Node quorum is defined as one plus half of the explicitly defined quorum nodes in the GPFS cluster. There are no default quorum nodes. The smallest node quorum is 3 nodes. Selecting quorum nodes: best practices
Use caution in a "five 9's" environment Select nodes most apt to remain active Select nodes that rely on different failure points example: select nodes in different racks or on different power panels. In smaller clusters (e.g., < 1000 nodes) select administrative nodes common examples: NSD servers, login nodes In large clusters, either select dedicated nodes or overlap with manager nodes. do not overlap with NSD servers Select an odd number of nodes (e.g., 3, 5, or 7 nodes) More than 7 nodes is not necessary; it increases failiure recovery time without increasing availability.
LAN Fabric
QN QN QN QN QN
Compute Node Compute Node Compute Node Compute Node Compute Node
frame 1
frame 2
frame 3
frame 4
NSD Server
NSD Server
NSD Server NSD Server
San Switch
Nodes designated QN are quorum nodes.
Node Quorum Example

LAN Switch
token mgr cluster mgr
X
Ouch!
LAN Switch
cluster mgr token mgr
Without the quorum rule:

frame 1 could start an independent cluster manager frame 2 could start an independent token manager file system would be corrupted!
No Quorum Rule
This is sometimes called "split brain".
frame 1
frame 2
LAN Switch
token mgr
LAN Switch
cluster mgr token mgr
Maintaining quorum semantics

guarantees frame 1 can not start an independent cluster manager token manager remains inactive allows frame 2 to spawn a new token manager and remain active
Maintaining Quorum
quorum node
quorum node quorum node
frame 1
frame 2
Node Quorum with Tiebreaker Disks

How it Works
Node quorum with tiebreaker disks runs with as little as one quorum node available so long as there is access to a majority of the quorum disks.
there can be a maximum of only 2 quorum nodes the number of non-quorum nodes is unlimited (can be as small as zero) there can be 1 to 3 tiebreaker disks (n.b., odd number of disks is best)
tiebreakers disks must be directly accessible from the quorum nodes, but do not have to belong to any particular file system must have a cluster-wide NSD name as defined through the mmcrnsd command tiebreaker disks must be SAN attached (FC or IP) or VSDs
same rules apply in selecting quorum nodes for both quorum options (see previous page) select quorum nodes with mmcrcluster or mmchconfig commands select tiebreaker disks with mmchconfig command
LAN
QN QN
Node #1
Node #2
EXAMPLE: GPFS remains active with the minimum of a single available quorum node and two available tiebreaker disks.
San Switch
d0 d1 d2 d3 TB d4 TB d5 TB
Nodes designated QN are quorum nodes Disks designated TB are tiebreaker disks
7. Node Failure Recovery
Several of the GPFS management functions we have just considered are designed for node failure recovery and maintaining data integrity. In this section, let's take a closer look at how this all fits together.
Disk Lease
Disk Lease (AKA, "heartbeat")
Disk leasing is the mechanism facilitating failure recovery in GPFS. Its a GPFS-specific fencing mechanism. A node can only access the file system if it has a disk lease. If a node fails or can not access the LAN, it can not renew its lease. Recovery begins after a lease expires. It gives time for I/O "in flight" to complete. It guarantees consistent file system metadata logs. It reduces the risk of data corruption during failure recovery processing. Failure recovery will have little or no effect on other nodes in the cluster.
Heartbeat Related Tuning Parameters

mmchconfig parameters failureDetectionTime (default = 35 sec)
starting with GPFS v3.2, leaseDuration is derived from other parameters and should be left at its default value
leaseRecoveryWait (default = 35 sec) minMissedPingTimeout (default = 3 sec) maxMissedPingTimeout (default = 60 sec) Best practice: do not alter these defaults without guidence from IBM
There's a reason they are not documented!
Node Failure Recovery
The next few slides take a careful look at node failure recovery under several scenarios. There are more cases than this, but the illustrated scenarious cover the basic concepts.

Non-Cluster Manager*
Any node other than the cluster manager.
lease renewal request sent
lease duration = failureDetectionTime leaseRecoveryWait
node returned to cluster
Suppose this node has just failed.

a. Last time this node renewed its lease b. This node sends lease renewal request to cluster manager c. The cluster manager detects that the lease has expired, and starts pinging this node d. The cluster manager decides that the node is dead and runs the node failure protocol e. The file system manager starts log recovery f. The revived node returns to the cluster. This is the process where a failed
It must validate it has access to quorum to return to the cluster.
node is removed from the cluster. Failure processing is completed before recovery processing begins.
Node Failure Recovery: the Deadman Timer

Non-Cluster Manager*
Any node other than the cluster manager.
lease duration = failureDetectionTime leaseRecoveryWait
deadman timer
Suppose this node has just failed
The deadman timer duration = 2/3 * leaseRecoveryWait. It is necessary to set this parameter long enough to be certain that data in flight is gone. If not, data in flight may arrive after the log replays and corrupt the file system. mmfsck can generally fix this.
Problem If an IOP in flight arrives after the log has been replayed, it would arrive out of order and corrupt the file system. This can only happen on nodes which are writing and have direct access to disk. Solution If this node is not completely dead, it starts a "deadman timer" thread once its lease duration expires at time c. If there is an IOP in flight (e.g., hung IOP) when the deadman timer expires, it will panic the kernel to prevent it from completing.

Cluster Manager
leaseRecoveryWait
failureDetectionTime
Suppose the cluster manager has just failed

a. Last time that the old cluster manager answered a lease renewal request from another quorum node. b. Last time a quorum node sent a new lease request to the old cluster manager. This is also the last time that the old cluster manager could have renewed its own lease. c. A quorum node detects that it is unable to renew its lease and starts pinging the old cluster manager. d. The quorum node decides that the old cluster manager is dead and runs an election to take over as new cluster manager (assuming quorum can still be maintained). e. The election completes and the new cluster manager runs the node failure protocol. f. The file system manager starts log recovery. g. The revived node returns to the cluster.
Without manual intervention, it will come back as a quorum node.
Notes on the Recovery Process

Recovering from a node failure (includes daemon failures)
cluster manager broadcasts failed node status to cluster
uses heartbeat to detect node failures uses join/leave messages to manage a node failure
initiate recovery process for each file system mounted on a failed node
ensure that failed node no longer has access to FS disks use logs to rebuild metadata that was being modified at the time of failure to a consistent state release locks held by failed node are released mmfsck recovers blocks that have been allocated but assigned to a file during recovery
recover the failed node and add it back to the cluster
Logs (i.e., GPFS Recovery Logs)

created at file system creation time, additional logs created as needed file system manager assigns a log to each node accessing the file system
logs are replicated
logging (and rigid sequencing of operations) preserve atomicity of on-disk structures data blocks are written to disk before control structures referencing them
prevents contents of previous data block being accessed in new file
metadata blocks are written/logged so that there will never be a pointer to a block marked unallocated that is not recoverable from the log log recovery is run as part of recovery to node failure affecting locked objects
8. Miscelleaneous
GPFS Memory Usage and Accounting

Heap
Moderate amount (e.g., 40 MB) of general purpose memory (e.g., thread stacks) which is accounted as belonging to GPFS (i.e., mmfsd) Similar for both AIX and Linux
Token Heap
Memory used for processing tokens which is accounted as belonging to GPFS (i.e., mmfsd) Size is negligible except for token manager nodes Similar for both AIX and Linux
Shared Segment
Chunk of common memory available to all tasks; used by GPFS for the inode/stat caches Since it is available to all tasks, the portion used by GPFS is not accounted as belonging to GPFS AIX: unpinned, allocated via shmat (32 bit) or kernel call (64 bit) Linux: pinned, allocated via mmap
Pagepool
AIX: pinned, allocated via shmat (32 bit) or kernel call (64 bit), but not accounted as belonging to GPFS Linux (32/64 bit): pinned, allocated via mmap and is accounted as belonging to GPFS
COMMENT: When using general tools to measure memory usage (e.g., top), the difference in memory allocation mechanisms for a particular GPFS/OS combination leads to different memory accounting. In particular, GPFS memory usage appears larger under Linux than AIX since the pagepool is attributed to GPFS under Linux, but not under AIX.
GPFS Code Structure

All nodes equal in principle (same code installed on all nodes)
Some nodes perform special roles (dynamically elected) Controllable via config options
Components:
Kernel extension / kernel modules AIX: single kernel extension Linux: three kernel modules
tracing, portability/GPL layer, I/O
Daemon (i.e., mmfsd) Commands and scripts ts-command: just a stub that sends params to daemon mm-scripts:
Call ts-commands Some are just wrappers around ts-commands with additional error checking Some do more: manage cluster configuration (update mmsdrfs), gpfs startup/shutdown, etc.
GPFS Subnets
Public
Ethernet Switch
(1 GbE network) subnet 3: 30:30:30.x
Private LAN*
subnet 1: 10.0.10.x
COMMENTS: Build the GPFS cluster using the existing public Ethernet network (i.e., subnet 3)
Private LAN*
subnet 2: 10:0:20.x
Use IP addresses 30:30:30:0 in mmcrcluster

b01 b02 b03 b04 b05 b06 b07 b08 b09 b10 b11 b12
a01 a02 a03 a04 a05 a06 a07 a08 a09 a10 a11 a12
Use GPFS subnets to prioritize which subnet a node will use for GPFS transactions
mmchconfig subnets="10:0:10:0" -N nodelst.a* mmchconfig subnets="10:0:20:0" -N nodelst.b*
This will cause the nodes to use their high speed network first if they can find the file they need on it. By default, subnet 30:30:30:0 is the lowest priority subnet and is used as needed; e.g.,
node a05 accesses /f1/xxx via 10.0.10:0 node a05 accesses /f2/zzz via 30:30:30:0
Nodes accessing files over Ethernet may be BW constrained compared to private LANs which are assumed to be a high speed network. See mmchconfig command
San Switch
DS5000-01
/f1 built on disks from DS4000-01
San Switch
DS5000-02
Assume this storage is under file system /f2
* The private LAN is generally a high speed switch; e.g., IB, Myrinet, Federation Assume nodelst.a contains the nodes a01-a12 and nodelst.b contains nodes b01-b12.
GPFS Multi-Cluster Feature

The Big Picture
Problem: nodes outside the cluster need access to GPFS files Solution: allow nodes outside the cluster to natively (i.e., no NFS) mount the file system
Home cluster responsible for admin, managing locking, recovery, etc. Separately administered remote nodes have limited status
Can request locks and other metadata operations Can do I/O to file system disks over global SAN Are trusted to enforce access control, map user Ids,
Cluster 1 Nodes
Cluster 2 Nodes
Local disk access
Site 1 SAN
Site 2 SAN
Remote disk access
Cluster 1 File System Remote disk access Global SAN Interconnect
Uses:
High-speed data ingestion, postprocessing (e.g. visualization) Sharing data among clusters Separate data and compute sites (Grid) Forming multiple clusters into a supercluster for grand challenge problems
Site 3 SAN
Visualization System
Scaling: max supported GPFS cluster size
GPFS Multi-Cluster
Example
IP Switch Fabric
Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node NSD A1 NSD A2 NSD A3 NSD A4
Inter-Switch Link (at least GbE speed!)

COMMENTS: Cluster_B accesses /fsA from Cluster_A via the NSD nodes
see example on next page
SAN
UID/GIDA mmname2uid
Cluster_A /fsA Home Cluster
Cluster_B mounts /fsA locally as /fsAonB OpenSSL (secure socket layer) provides secure access between clusters for "daemon to daemon" communication using the TCP/IP based GPFS protocol. However, nodes in the remote cluster do not require ssh contact to nodes in the home cluster or vice-verse.
UID MAPPING EXAMPLE (i.e., Credential Mapping) 1. pass Cluster_B UID/GID(s) from I/O thread node to mmuid2name 2. map UID to GUN(s) (Globally Unique Name) 3. send GUN(s) to mmname2uid on node in Cluster_A 4. generate corresponding CLUSTER_A UID/GID(s) 5. send Cluster_A UID/GIDs back to Cluster_B node runing I/O thread (for duration of I/O request)
COMMENTS: mmuid2name and mmname2uid are user written scripts made available to all users in /var/mmfs/etc; these scripts are called ID remapping helper functions (IRHF) and implement access policies simple strategies (e.g, text based file with UID <-> GUN mappings) or 3rd party packages (e.g., Globus Security Infrastruction from Teragrid) can be used to implement the remapping procedures
IP Switch Fabric
GUN mmuid2name UID/GIDB
Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node Compute Node NSD B1 NSD B2
SAN
Cluster_B /fsAonB Remote Cluster
See http://www-1.ibm.com/servers/eserver/clusters/whitepapers/uid_gpfs.html for details.
GPFS Multi-Cluster
Example
Mount a GPFS file system from Cluster_A onto Cluster_B
On Cluster_A
1. Generate public/private key pair
mmauth genkey new COMMENTS key pair is placed in /var/mmfs/ssl
On Cluster_B
4. Generate public/private key pair
mmauth genkey COMMENTS key pair is placed in /var/mmfs/ssl public key default file name id_rsa.pub
public key default file name id_rsa.pub
2. Enable authorization
5. Enable authorization
mmauth update . -l AUTHONLY
mmauth update . -l AUTHONLY
3. Sysadm gives following file to Cluster_B

/var/mmfs/ssl/id_rsa.pub
6. Sysadm gives following file to Cluster_A

/var/mmfs/ssl/id_rsa.pub
COMMET: rename as cluster_A.pub
COMMENT: rename as cluster_B.pub
7. Authorize Cluster_B to mount file systems owned by Cluster_A

mmauth add cluster_B -k cluster_B.pub
9. Define cluster name, contact nodes and public key for cluster_A
mmremotecluster add cluster_A -n nsd_A1,nsd_A2 -k Cluster_A.pub
8. Authorize Cluster_B to mount a particular FS owned by Cluster_A

mmauth grant cluster_B -f /dev/fsA
10. Identify the FS to be accessed on cluster_A

mmremotefs add /dev/fsAonB -f /dev/fsA -C Cluster_A -T /fsAonB
11. mount FS locally

mmmount /dev/fsAonB
Communication Between Clusters

All nodes in both clusters must have TCP/IP connectivity between each other; this is used only for "daemon to daemon" (n.b., mmfsd) communication via the GPFS protocol. This does not allow remote shell (e.g., ssh or rsh) access between nodes; remote users can not access the home cluster by this mechanism. OpenSSL guarantees secure communications.
Contact Nodes
The contact nodes are used only when a remote cluster first tries to access the home cluster; one of them sends configuration information to the remote cluster after which there is no further communication. It is recommended that the primary and backup cluster manager be used as the contact nodes.
Subnet vs. Multi-Cluster

Combining Subnets with Multi-Clusters to Support Multiple Fabrics
The Myrinet LAN spans clusters C1 and C2 Common Ethernet LAN spans all nodes
Cluster C2 x336-01 x336-02 x336-03 x336-04 x336-05 x336-06 x336-07 x336-08 x336-09 x336-10 x336-11 x336-12 x346-13
nsd server
A1 A2 A3
Ethernet Switch
COMMENTS: common Ethernet 2 high speed subnets multiple NSD servers
Cluster C3 x3550-15 x3550-16 x3550-17 x3550-18 x3550-19 x3550-20 x3550-21 x3550-22 x3550-23 x3550-24 x3550-25
The IB LAN spans clusters C1 and C3
M y r i n e t S w i t c h
Configuration Alternatives Single cluster

subnets
3 clusters (as shown)

subnets multi-cluster
Open question: Which one is best?
I n f i n i b a n d S w i t c h
Cluster C1
A4 A5 A6 A7 A8 A9 A10 A11 A12
x3550-26 x3650-27
nsd server
x346-14
nsd server
RAID RAID Controller Controller FC Disk Controller

FC SAN Switch
x3650-28
nsd server
Legacy System
New System
WARNING: The disk controller can only be mounted by nodes with the same OS (i.e., the NSD servers either must all be AIX or they must all be Linux).
GPFS is Available for AIX, Linux and Windows

GPFS has been designed so that its architecture, commands, programming APIs and other GPFS specific entities are nearly identical for AIX, Linux and Windows. Same source code, but...
Linux requires building something called the "portability layer"; it does not change the kernel. Mapping Unix permissions to Windows The core of GPFS continues to operate on Unix UID/GID values. Windows GPFS nodes perform the task of mapping to Windows SIDs: explicit Unix-Windows ID maps are defined in Active Directory; implicit (default) maps for Windows SIDs are created from a reserved range of UID/GID values; and unmapped Unix IDs are cast into a foreign domain for Windows. Explicit maps persist only in the Active Directory. Implicit maps persist in the file system.
GPFS under Windows can be used only as a client.

GPFS 3.2 supports Windows Server 2003 R2 SP2 x64 GPFS 3.3 supports Windows Server 2008 SP2 x64
Mixed OS Clusters
A single GPFS cluster or GPFS multi-cluster can have nodes running under AIX, Linux and/or Windows at the same time! Restriction
All LUNs for a particular file system must run under the same OS Corollary: the NSD nodes for a given file system must run under same OS Special request (i.e., RPQ) is required to use Windows for an NSD server
9. GPFS Environment
GPFS does not exist in isolation; it must be integrated with other components when designing an overall solution. The following pages look at selected hardware and software components (disk controllers, disks and storage servers) commonly used with GPFS file systems.
Tested Scaling Limits

Largest File System
tested - 4 PB architectural limit - 2^99
Largest Tested Cluster

AIX
tested - 1530 nodes over 128 nodes requires review by IBM
Linux tested
3500+ nodes in a multi-cluster (including one cluster with 2560 nodes) 4000+ nodes in a single cluster
over 512 nodes requires review by IBM
Tested Disk Products

GPFS v2.3 - v3.2
AIX
DCS9900, DCS9550 DS3000, DS4000, DS5000 series systems ESS and DS8000 series (i.e., shark) SAN Volume Controler (V1.1 and V1.2, V2.1) 7133 Seriel Disk System (i.e., SSA) EMC Symmetrix DMX (FC attach only) Hitachi Lightning 9900 (HDLM required)
COMMENT: GPFS does not rely on the SCSI persistant reserve for failover. This reduces the risk associated with using non-tested storage controllers with GPFS. However, starting with GPFS 3.2, SCSI persistant reserve is available as an option on NSD servers under AIX (see mmchconfig).
9910, 9960, 9970V, 9980V
Linux DCS9900, DCS9550 DS3000, DS4000, DS5000 series systems EMC Symmetrix DMX 1000 with PowerPath v3.06 or v3.07
See http://publib.boulder.ibm.com/clresctr/library/gpfsclustersfaq.html for the complete list of tested disk This is NOT the only disk that will work with GPFS. In general, any reasonable block device will work with GPFS. According to the FAQ page, the "GPFS support team will help customers who are using devices outside of this list of tested devices, to solve problems directly related to GPFS, but not problems deemed to be issues with the underlying device's behavior including any performance issues exhibited on untested hardware." Before adopting such devices for use with GPFS, it is urged that the customer first run proof of concept tests.
Redundant Array of Inexpensive Disk (RAID)

RAID Concept and definitions
RAID Set: a set of disks containing user data and parity information Parity: information used to reconstruct user data lost when a disk in a RAID group fails Other RAID Levels
Common RAID Levels:

RAID 3, 4+P
User Data Parity Information
JBOD: Just a Bunch Of Disk RAID 0: striping without redundancy e.g., 4+0P RAID 1: mirroring RAID 10: striping across mirrored groups m1m1 - m2m2 - m3m3 - m4m4
RAID 5, 4+P
User Data and Parity Information interleaved
RAID 6*, 4+2P

RAID 3 extension
User Data Parity Information User Data and Parity Information interleaved
RAID 6*, 4+2P

RAID 5 extension
* Since RAID 6 group has 2 redundant disks, it is common to make the RAID group larger (e.g., 8+2p)
Redundant Array of Inexpensive Disk (RAID)

Comments on Terminology and Best Practices
RAID Sets
Different OEM vendors use different names for the grouping of disks that I am calling a "RAID Set". I am using this term as a generic alternative. DDN (DCS9000): "Tier" IBM (DS8000): "Array"
"array sites" and "ranks" are closely related terms
LSI (DS3000, DS4000, DS5000): "Array"
LUNs
A LUN (logical unit) is an entry in /dev for Unix based OSs examples
Linux: /dev/sdb AIX: /dev/hdisk2
A rose by any other name has just has many thorns. ;->
Best Practice: LUNs vs. RAID sets

There should be 1 LUN per RAID set.
Multiple LUNs per RAID set can lead to "LUN thrashing" where the seek arm "bounces" between LUNs to service uncoordinated requests. Older versions of Linux and storage controller microcode did not support LUN sizes large enough to accomodate today's large disks. This forced users configure multiple LUNs per RAID set. This issue applies to the DS3000, DS4000, DS8000, DCS9000. Storage controllers from some other vendors are designed to create very large RAID sets presented as multiple LUNs to the OS as a best practice.
All examples in this presentation are configured with 1 LUN per RAID set.
Disk Technology
FC, SAS, SCSI
Enterprise Class
different protocols, same mechanical standards
SATA/2
Cost Optimized Rotational speed: 7200 rpm Common drive sizes
750 GB, 1 TB, 2 TB
Cheap and Cheerful
Rotational speed: 15 Krpm Common drive sizes

300 GB, 450 GB, 600 GB
90% duty cycle MTBF = 2.0 MHour4 Single drive IOP performance, no caching1
420 IOP/s
Duty cycle is generally ignored now. MTBF < 1.6 MHour with 50% duty clyce4 MTBF < 0.9 MHour with 90% duty cycle4 Single drive IOP performance, no caching1
with command tag queueing: 120 IOP/s without command tag queueing: 60 to 70 IOP/s
Single drive BW cache disabled

write = 50.8 MB/s read = 95.4 MB/s
Single drive BW3 cache disabled

cache enabled
write = 154.6MB/s read = 123.6 MB/s
cache enabled
Footnotes: 1. IOP rates assume 4K records 2. DS4800
dd buffer size = 1024K cache block size = 16K segment size = 256K
Some Risks to be Considered

Probability of "dual disk failures" SATA: TBD enterprise class storage: 1 failure / 73 yrs
calculation based on binomial distribution with ~= 6200 SSA disks
3. DS4700
dd buffer size = 1024K cache block size = 16K segment size = 64K
Sensitivity to "head wobble"

compounded by very high density drives
4. MTBF based on Hitachi disk
Comments on the Proper Use of SATA

Market segment:
"scalable SATA storage for small to midrange businesses at an affordable price" "for data archival, data reference, and near-line storage applications" (e.g., lower tier storage in HSM)
Risk Management Strategies

do not use in large, I/O intensive storage systems or as first tier storage partition storage into several file systems or storage pools
reduces collateral damage, but compromise peak data rate for an application
for GPFS, designate FC disk as metadataOnly and SATA disk as dataOnly

reduces risk of metadata loss and increases access rate to metadata
adopt RAID6 configuration
Storage Controllers
HPC systems requiring high bandwidth and/or large capacities use disk controllers to manage external disk. Common IBM choices for HPC include DS3000
low cost of entry
DS4000
DS4800 replaced by the DS5300 DS4700 is higher end product cf. DS3000, yet with a low cost of entry
DS5000
DS5300 balanced streaming and IOP performance DS5100 lower performance with a lower cost of entry
DS8000
DS8300 provides very high reliability with good IOP performance
DCS9000
designed specifically for HPC optimizing streaming BW and capacity
DS3000 Series
DS3200
DS3400
3-Gbps SAS connect to host Direct-attach For System x 2U, 12 disks Dual Power Supplies Support for SAS disks SAS or SATA disks Expansion via EXP3000 Starting under $4,500 US 4-Gbps Fibre connect to host Direct-attach or SAN For System x & BladeCenters 2U, 12 disks Dual Power Supplies Support for SAS disks SAS or SATA disks Expansion via EXP3000 Starting under $6,500 US
COMMENT: DS3300 provides an iSCSI interface for the same basic hardware. Its not commonly used with GPFS.
DS3400
Example Configuration
GbE connections to client nodes
NSD Server: x3650 M2

at least 8 Cores at least 6 GB RAM 2 dual-port 4 Gb/s FC HBAs (2xFC4)
at most 760 MB/s per adapter
TbE
Ethernet Switch
GbE
GbE
2 x FC4
single 10 GbE (TbE) adapter per node

at most 725 MB/s per adapter (recommend Myricom TbE)
Disk Controller: DS3400 with EXP3000

Peak sustained performance (theoretical)
2 x FC4
GbE
GbE
TbE
streaming: write < 700 MB/s, read < 900 MB/s IOP rate: write < 4500 IOP/s, read < 21,000 IOP/s
12 disks per DS3400 plus 12 disks per EXP3000

Controller-A Controller-B
up to 48 disks Example: 15Krpm SAS disks @ 450 GB/disk 4 x 4+P RAID 5 + 2 hot spares (optimize streaming performance)
raw ~= 10 TB, usuable ~= 7 TB
DS3400-01
9 x 4+P RAID 5 + 3 hot spares (optimize IOP performance)

ESM-B
ESM-A
Example: SATA disks @ 1 TB/disk

4 x 8+2P RAID 6 + 2 hot spares (optimize capacity)
EXP3000-01
ESM-A
ESM-B
EXP3000-01
ESM-A
ESM-B
EXP3000-01
WARNING: The DS3400, while relatively fast and inexpensive, is not well suited for large configurations, especially when using SATA drives. RAID array rebuilds are common in large configurations (e.g., 10 x DS3400s with 480 drives) especially for SATA. But a DS3400's performance is significantly compromised during a rebuild. Therefore at any given time in file systems aggregated across many DS3400s, a RAID array rebuild will be in progress and the expected value of file system performance will be significantly less than the maximum possible sustained rates.
DS3400
Benchmark Results
GPFS Parameters
blocksize = 256K or 1024K pagepool = 1G maxMBpS = 2000
Bandwidth Scaling
BW per 4+P RAID 5 array using 15 Krpm disks read cache: ON with default read ahead write cache: ON RAID sets write (MB/s) read (MB/s) 1 239 294 2 471 615 4 655 860 8 647 875
DS3400 Parameters
RAID 5 array = 4+P segment size = 64K or 256K cache page size = 16K read ahead = default write cache = enabled write cache mirroring = disabled
This is not a best practice
The most efficient streaming occurs using only 4 arrays.
Streaming Job*
record size = 1024K file size = 4G number of tasks = 8 access pattern = seq
8 RAID sets
IOP Job*
record size = 2K total data accessed = 1G number of tasks = 16 access pattern = small file
stream* IOP*
write 650 MB/s 19,100 IOP/s
read 875 MB/s 23,000 IOP/s

Benchmark System: These results were produced using the configuration from the previous page with 16 client nodes (x3550) connected to a LAN via GbE.
* Configuration was optimized differently for each test. Benchmark tool: ibm.v4a Theoretical max IOP rates for the DS3400 cached < 96,000 IOP/s (512 B transactions) uncached: write < 4,200 IOP/s, write < 19,000 IOP/s (4 KB transactions)
DS5000 Series
Controller Support Modules (Fans, power supplies)
DS5300
model 1818-53A
Controllers Interconnect Module (batteries, midplane) Power/cooling Controllers (ESMs)
4u
EXP5000
model 1818-D1A
3u
Drives FC/SATA
16 Disks per Disk Enclosure
DS5300
Controller/Enclosure Overview
Dual, redundant RAID controllers Dual, redundant power, battery backup and fans Internal busses (theoretical)
PCI-E x8 simplex rate = 2 GB/s
Controller A
Host Side Connections PCI-E x8
Controller B
Host side connections (measured)
16 host-side connections FC4 < 380 MB/s FC8 < 760 MB/s Active/passive architecture
Drive side connections (measured)

16 drive-side connections FC4 < 380 MB/s
Supported enclosures
EXP5000 (FC switched, 4 Gb/s) EXP810 (FC switched, 2 Gb/s) Maximums 28 enclosures, 16 disks per enclosure 448 disks
Drive Side Connections FC4

stack 1 stack 2 stack 3 stack 4
Disk Technology
15 Krpm FC disk (300, 450 GB) SATA (750, 1000 GB) Peak sustained rates (theoretical) Streaming (to media) requires 192 x 15Krpm drives write < 6 GB/s read < 6 GB/s IOP Rate (to media) requires 448 x 15Krpm drives write < 45,000 IOP/s read < 172,000 IOP/s
Loop Pair Enclosure Stacks
stack 5 stack 6 stack 7 stack 8
A loop pair is a set of redundant drive side cables as shown in this diagram. A stack is a set of enclosures along a loop pair. A DS5300 supports at most 4 x EXP5000 enclosures per stack, but never more than 28 x EXP5000 total. Be sure to balance the number of enclosures across the stacks.
DS5300
Rearview
Controller A
Controller B
Annotated photo of controller B
Host interface cards Eight host ports
Power connection
Serial connection
Dual Ethernet connections
Eight drive ports
EXP5000
16 drives in 3U enclosure 4 Gbps FC interfaces / ESMs
High-speed, low-latency interconnect from controllers to drives
FC - in FC 1B
Supports intermixing FC and SATA drives Unique speed-matching technology

3 Gbps SATA II drives effectively run at 4 Gbps speeds
FC -out FC 1A
Switched architecture
Higher performance, lower latency Drive isolation, better diagnostics
RoHS compliant NEBS level 3 certified

FC - in FC 1A
*
logical layout
FC 1B FC -out
FOOTNOTES ESM A is the primary path for the odd drives ESM B s the primary path for the even drives
If an ESM fails, the other ESM can access all of the drives.
1B
1A
ESM A
Only use highlighted ports

(EXP5000 does not support "trunking")
ESM B
1A 1B
DS5300
Cabling and Disk to Array Mapping
Careful attention must be given to cabling and disk to array mapping on the DS5300 in order guarantee optimum streaming performance. This issue is less significant for IOP performance.
WARNINGS:
Default array mappings (e.g., created by SMclient) are not guaranteed to be optimum! Rules and best practices for the DS4800 do not always apply to the DS5300.
See file /c/my_stuff/storage/DS5000/XBB2_Data_Flow.ppt for more examples.
DS5300
Drive Side Cabling - 8 Enclosures
Balance*: Best streaming performance is achieved using a multiple of 8 x EXP5000 drawers with the same number of drawers per stack. Optimum performance is achieved using 8, 16 or 24 stacks.
If ignored, performance penalty ~= 25%; it does not affect IOP rates.
ESM A
ESM A
2
ESM B
ID: 11
ESM B
ID: 25
ESM A
controller A
ESM A
4
ESM B
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 5 6 7 8
GbE GbE
ID: 31
ESM B
ID: 45
DS5300
ESM A
GbE GbE
1 2 3 4
ESM B
ESM A
6
ESM B
1 2 3 4 5 6 7 8
controller B
ID: 65
ID: 51
ESM A
ESM A
8
ESM B
ID: 71
ESM B
ID: 85
Stacks
When attaching enclosures, drive loops are configured as redundant pairs (i.e., loop pairs) utilizing one port from each controller; the enclosures along a loop pair are called a stack.
1 2 3 4 5 6 7 8
Tray ID
Tray ID is assigned during system configuration. The values are not arbitrary. Best practice: 10's digit: stack number 1's digit: ordinal number within a stack
DS5300
Drive Side Cabling - 16 Enclosures
1
Stack # Tray ID #
ID: 11
ESM B
ESM A
Balance*: Best streaming performance is achieved using a multiple of 8 x EXP5000 drawers with the same number of drawers per stack. Optimum performance is achieved using 8, 16 or 24 stacks.
If ignored, performance penalty ~= 25%; it does not affect IOP rates.
ESM A
2
ESM B
ID: 25
ESM A
ESM A
ID: 12
ESM B
ID: 26
ESM B
ESM A
ESM A
4
ESM B
ID: 31
ESM B
ID: 45
ESM A
ESM A
controller A
ID: 32
ESM B
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 5 6 7 8
GbE GbE
ID: 46
ESM B
DS5300
ESM A
GbE GbE
1 2 3 4
ESM B
ESM A
6
ESM B
1 2 3 4 5 6 7 8
controller B
ID: 65
ESM A
ID: 51
ESM A
ID: 52
ESM B
ID: 66
ESM B
ESM A
ID: 71
ESM B
ESM A
Another Tray ID Best Practice: Start the 1's digit in the odd numbered stacks at 1 and in the even numbered stacks at 5. We do this because there can be up to 4 drawers in a stack.
ESM A
8
ESM B
ID: 85
ESM A
ID: 72
ESM B
ID: 86
ESM B
DS5300
Drive Side Cabling and Disk to Array Mapping
controller A controller B
4xFC4 XOR ASIC 4xFC4 4xFC4
XOR ASIC
loop switches*
4xFC4
FC4 | FC8
FC4 | FC8
FC4 | FC8
FC4 | FC8
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports In this example there is 1 drawer per stack. You can have at most 4 drawers per stack, but not more than 28 drawers total. Stack numbers highlighted by the yellow box.
4 1 2 3
host ports
Array Assignments: An array is a set of disks belonging to a RAID group. Arrays are assigned to (i.e., owned by) a single controller. Optimum performance requires careful attention being given to assigning disks to arrays and assigning arrays to the controllers.
E S M A E S M B
5 6 7 8
Optimum Both vertical paths or both diagonal paths can be active at same time.
x x
Remember: by default ESM-A accesses odd disks ESM-B accesses even disks This is independent of controller preference.
Sub-Optimum A vertical path and a diagonal path can not both be active at the same time.
Loop Switches
DS5300
Data Flow Example #1A
controller A prefers odd slots in stacks 1, 3, 5, 7 even slots in stacks 2, 4, 6, 8 controller B prefers even slots in stacks 1, 3, 5, 7 odd slots in stacks 2, 4, 6, 8 4xFC4
XOR ASIC data block 1 FC4 | FC8 FC4 | FC8
XOR ASIC
4xFC4 p loop switches 2 4 FC4 | FC8 FC4 | FC8
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports
1 ABABABABABABABAB
host ports
data path for array X

2 BABABABABABABABA 3 ABABABABABABABAB
Stack # Tray ID
1 2 3 4 5 6
Slot #
7 8 9 10 11 12 13 14 15 16
4 BABABABABABABABA 5 ABABABABABABABAB 6 BABABABABABABABA 7 ABABABABABABABAB 8 BABABABABABABABA
1 2 3 4 5 6 7 8
11 1 25 2 31 3 45 4 51 p 65 71 85
In analyzing this access pattern, remember the even/odd preference of the drawers. Irregardless of controller preference, odd disks are accessed by ESM A and even disks are accessed by right ESM B.
E S M A E S M B
Mapping Disks to Array Rule: Assign disks to arrays diagonally with 1 per tray as shown. Array Ownership Rule: Assign array to controller accessing the first disk in the array.
Array X: 4+P RAID 5, owned by controller A Array Y: 4+P RAID 5, owned by controller B Tray protected ("barber pole"), optimum performance
DS5300
Data Flow Example #1B
controller A prefers odd slots in stacks 1, 3, 5, 7 even slots in stacks 2, 4, 6, 8 controller B prefers even slots in stacks 1, 3, 5, 7 odd slots in stacks 2, 4, 6, 8 XOR ASIC
XOR ASIC data block 1 2 FC4 | FC8 FC4 | FC8
data block p loop switches 2 1 4 3 p FC4 | FC8 FC4 | FC8
3 4
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports
1 ABABABABABABABAB
host ports
data path for array X data path for array Y
2 BABABABABABABABA 3 ABABABABABABABAB
Stack # Tray ID
1 2 3 4 5 6
Slot #
7 8 9 10 11 12 13 14 15 16
4 BABABABABABABABA 5 ABABABABABABABAB 6 BABABABABABABABA
1 2 3 4 5 6 7 8
11 1 25 2 31 3 45 4 51 p 65 71 85
1 2 3 4 p
7 ABABABABABABABAB 8 BABABABABABABABA
In analyzing this access pattern, remember the even/odd preference of the drawers. Irregardless of controller preference, odd disks are accessed by ESM A and even disks are accessed by right ESM B.
E S M A E S M B
Mapping Disks to Array Rule: Assign disks to arrays diagonally with 1 per tray as shown. Array Ownership Rule: Assign array to controller accessing the first disk in the array.
Array X: 4+P RAID 5, owned by controller A Array Y: 4+P RAID 5, owned by controller B Tray protected, optimum performance
DS5300
Sample Complete Disk to Array Mappings for Example #1
4+P RAID 5, Tray Protected with optimum performance

11 25 31 45 51 65 71 85 A1 B24 A23 B22 A21 B20 A19 B16 B2 A1 B24 A23 B22 A21 B20 A19 A3 B2 A1 B24 HS B22 A21 B20 B4 A3 B2 A1 HS HS B22 A23 A7 B4 A5 B2 A1 HS HS B24 B8 A7 B6 A5 B2 A3 HS HS A9 B8 A7 B6 A5 B4 A3 HS B10 A11 B8 A7 B6 A5 B4 A3 A13 B12 A11 B8 A9 B6 A5 B4 B14 A13 B12 A11 B10 A9 B6 A7 A17 B14 A13 B12 A11 B10 A9 B8 B18 A17 B14 A15 B12 A11 B10 A9 A19 B18 A17 B16 A15 B12 A13 B10 B20 A19 B18 A17 B16 A15 B14 A13 A23 B20 A21 B18 A17 B16 A15 B14 B24 A23 B22 A21 B18 A19 B16 A15
Arrays owned by Controller A Arrays owned by Controller B
8+2P RAID 6, Tray Protected with optimum performance

11 12 25 26 31 32 45 46 51 52 65 66 71 72 85 86 A1 B24 A23 B20 A21 B18 A17 B16 A15 B12 A11 B10 A9 B6 A7 B4 B2 A1 B24 A23 B22 A21 B18 A17 B16 A15 B12 A11 B10 A9 B8 A7 A3 B2 A1 B24 A23 B22 A21 B18 A17 B16 A15 B12 A13 B10 A9 B8 B4 A3 B2 A1 B24 A23 B22 A21 B18 A17 B16 A15 B14 A13 B10 A9 A7 B4 A3 B2 A1 B24 A23 B22 A21 B18 A19 B16 A15 B14 A13 B10 B8 A7 B4 A3 B2 A1 B24 A23 B22 A21 B20 A19 B16 A15 B14 A13 A9 B8 A7 B4 A5 B2 A1 B24 HS B22 A21 B20 A19 B16 A15 B14 B10 A9 B8 A7 B6 A5 B2 A1 HS HS B22 A21 B20 A19 B16 A15 A13 B10 A11 B8 A7 B6 A5 B2 A1 HS HS B22 A21 B20 A19 B16 B14 A13 B12 A11 B8 A7 B6 A5 B2 A1 HS HS B22 A21 B20 A19 A17 B14 A13 B12 A11 B8 A7 B6 A5 B2 A3 HS HS B22 A23 B20 B18 A17 B14 A13 B12 A11 B8 A7 B6 A5 B4 A3 HS HS B24 A23 A19 B18 A17 B14 A13 B12 A11 B8 A9 B6 A5 B4 A3 HS HS B24 B20 A19 B18 A17 B14 A13 B12 A11 B10 A9 B6 A5 B4 A3 HS HS A23 B20 A19 B18 A17 B14 A15 B12 A11 B10 A9 B6 A5 B4 A3 HS B24 A23 B20 A19 B18 A17 B16 A15 B12 A11 B10 A9 B6 A5 B4 A3 There is little need for hot spares under RAID 6. The 16 "extra" disks could be configured as 2 x 4+4 RAID 10 arrays.
It is an open question as to whether these should be used as GPFS metadataOnly disks. While their capacity is more than enough, there may not be enough available to sustain the required metadata IOP rates for the other 24 arrays under IOP instensive workloads. But they could be used as a cache under GPFS ILM for frequently accessed files.
Best Practice: Adopt tray protection using the following configurations. 8 trays using 4+P RAID 5 or 4+2P RAID 6 16 trays using 4+P or 8+P RAID 5, or 8+2P RAID 6
This is generally called a "barber pole" configuration.
DS5300
Data Flow Example #2A
controller A prefers all slots in stacks 1, 3, 5, 7 controller B prefers all slots in stacks 2, 4, 6, 8
XOR ASIC data block 1 3 p FC4 | FC8 FC4 | FC8
4xFC4
XOR ASIC
loop switches
4xFC4 2 4 FC4 | FC8 FC4 | FC8
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports
1 AAAAAAAAAAAAAAAA
host ports While this configuration may not lead to optimum streaming performance, it is generally good enough for many application environments and is easy to configure. The adoption of an FC4 switched drive side network has significantly reduced the negative impact of this configuration's performance compared to the DS4500 and DS4800.
E S M A E S M B

2 BBBBBBBBBBBBBBBB 3 AAAAAAAAAAAAAAAA
Stack # Tray ID
1 2 3 4 5 6
Slot #
7 8 9 10 11 12 13 14 15 16
4 BBBBBBBBBBBBBBBB 5 AAAAAAAAAAAAAAAA 6 BBBBBBBBBBBBBBBB 7 AAAAAAAAAAAAAAAA 8 BBBBBBBBBBBBBBBB
1 2 3 4 5 6 7 8
11 1 2 3 4 p 1 2 3 4 p 25 1 2 3 4 p 1 2 3 4 p 31 45 51 65 71 85
Mapping Disks to Array Rule: Horizontally and continguously assign disks to same array. Array Ownership Rule: Arrays in stacks 1, 3, 5, 7 are assigned to same controller (e.g., controller A) Arrays in stacks 2, 4, 6, 8 are assigned to same controller (e.g., controller B)
Array X, Y: 4+P RAID 5, owned by controller A Array A, B: 4+P RAID 5, owned by controller B Horizontal volume: performance is "good enough"
In analyzing this access pattern, remember the even/odd preference of the drawers. Irregardless of controller preference, odd disks are accessed by ESM A and even disks are accessed by ESM B.
DS5300
Data Flow Example #2B
XOR ASIC data block 1 2 3 4 p FC4 | FC8 FC4 | FC8
XOR ASIC
loop switches
data block 2 1 4 3 p FC4 | FC8 FC4 | FC8
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports
1 AAAAAAAAAAAAAAAA
host ports While this configuration may not lead to optimum streaming performance, it is generally good enough for many application environments and is easy to configure. The adoption of an FC4 switched drive side network has significantly reduced the negative impact of this configuration's performance compared to the DS4500 and DS4800.
E S M A E S M B
data path for array X data path for array B
2 BBBBBBBBBBBBBBBB 3 AAAAAAAAAAAAAAAA
Stack # Tray ID
1 2 3 4 5 6
Slot #
7 8 9 10 11 12 13 14 15 16
4 BBBBBBBBBBBBBBBB 5 AAAAAAAAAAAAAAAA 6 BBBBBBBBBBBBBBBB 7 AAAAAAAAAAAAAAAA 8 BBBBBBBBBBBBBBBB
1 2 3 4 5 6 7 8
11 1 2 3 4 p 1 2 3 4 p 1 2 3 4 p 25 1 2 3 4 p 1 2 3 4 p 1 2 3 4 p 31 45 51 65 71 85
Mapping Disks to Array Rule: Horizontally and continguously assign disks to same array. Array Ownership Rule: Arrays in stacks 1, 3, 5, 7 are assigned to same controller (e.g., controller A) Arrays in stacks 2, 4, 6, 8 are assigned to same controller (e.g., controller B)
Array X, Y: 4+P RAID 5, owned by controller A Array A, B: 4+P RAID 5, owned by controller B Horizontal volume: performance is "good enough"
DS5300
Data Flow Example #2C
XOR ASIC data block 7 FC4 | FC8 FC4 | FC8 1 p
XOR ASIC
data block 2 q 1 3 p 2 4 q loop switches 3 5 4 6 5 7 6 8 FC4 | FC8 FC4 | FC8
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports
1 BBBBBBBBBBBBBBBB
host ports
data path for array W data path for array Z
2 AAAAAAAAAAAAAAAA 3 BBBBBBBBBBBBBBBB
E S M A
E S M B
Stack # Tray ID
1 2 3 4 5 6
Slot #
7 8 9 10 11 12 13 14 15 16
4 AAAAAAAAAAAAAAAA 5 BBBBBBBBBBBBBBBB 6 AAAAAAAAAAAAAAAA 7 BBBBBBBBBBBBBBBB 8 AAAAAAAAAAAAAAAA
1 2 3 4 5 6 7 8
11 25 31 45 51 65 71 85
1 5 p 3 7 1 5 p 3 7 2 6 q 4 8 2 6 q 4 8 3 7 1 5 p 3 7 1 5 p 4 8 2 6 q 4 8 2 6 q
Mapping Disks to Array Rule: Distribute the disks uniformly, horizontally and contiguously across stacks 1, 3, 5, 7 xor 2, 4, 6, 8. Array Ownership Rule: Arrays in stacks 1, 3, 5, 7 are assigned to same controller (e.g., controller B) I swapped the controllers Arrays in stacks 2, 4, 6, 8 are assigned to same controller (e.g., controller A) around to make a point.
Array W, X: 8+P+Q RAID 6, owned by controller B Array Y, Z: 8+P+Q RAID 6, owned by controller A Stack oriented: performance is "good enough"
DS5300
Sample Complete Disk to Array Mappings for Example #2C
8+2P RAID 6, Stack Oriented with "good enough" performance

11 25 31 45 51 65 71 85 A1 B2 A1 B2 A1 B2 A1 B2 A1 B2 A1 B2 A1 B2 A1 B2 A1 B2 A1 B2 A3 B4 A3 B4 A3 B4 A3 B4 A3 B4 A3 B4 A3 B4 A3 B4 A3 B4 A3 B4 A5 B6 A5 B6 A5 B6 A5 B6 A5 B6 A5 B6 A5 B6 A5 B6 A7 B8 A5 B6 A5 B6 A7 B8 A7 B8 A7 B8 A7 B8 A7 B8 A7 B8 A7 B8 A7 B8 A7 B8 A9 B10 A9 B10 A9 B10 A9 B10 A9 B10 A9 B10 A9 B10 A9 B10 A11 B12 A11 B12 A9 B10 A9 B10 A11 B12 A11 B12 A11 B12 A11 B12 A11 B12 A11 B12 A11 B12 A11 B12 HS HS HS HS HS HS HS HS
Arrays owned by Controller A Arrays owned by Controller B

Alternatively, use the "extra" disks to create 4+4 RAID 10 arrays instead of using them as hot spares. However, be sure to distribute them in a more optimal pattern across the trays, such as a "barber pole" distribution.
8+2P RAID 6, Stack Oriented with "good enough" performance

11 12 25 26 31 32 45 46 51 52 65 66 71 72 85 86 A1 B2 A1 B2 A1 B2 A1 B2 A1 B2 A1 B2 A1 B2 A1 B2 A1 B2 A1 B2 A3 B4 A3 B4 A3 B4 A3 B4 A3 B4 A3 B4 A3 B4 A3 B4 A3 B4 A3 B4 A5 B6 A5 B6 A5 B6 A5 B6 A5 B6 A5 B6 A5 B6 A5 B6 A5 B6 A5 B6 A7 B8 A7 B8 A7 B8 A7 B8 A7 B8 A7 B8 A7 B8 A7 B8 A7 B8 A7 B8 A9 B10 A9 B10 A9 B10 A9 B10 A9 B10 A9 B10 A9 B10 A9 B10 A11 B12 A11 B12 A11 B12 A11 B12 A11 B12 A11 B12 A9 B10 A9 B10 A13 B14 A13 B14 A13 B14 A13 B14 A11 B12 A11 B12 A11 B12 A11 B12 A15 B16 A15 B16 A13 B14 A13 B14 A13 B14 A13 B14 A13 B14 A13 B14 A15 B16 A15 B16 A15 B16 A15 B16 A15 B16 A15 B16 A15 B16 A15 B16 A17 B18 A17 B18 A17 B18 A17 B18 A17 B18 A17 B18 A17 B18 A17 B18 A17 B18 A19 B20 A19 B20 A19 B20 A19 B20 A19 B20 A19 B20 A17 B18 A21 B22 A19 B20 A19 B20 A21 B22 A21 B22 A19 B20 A19 B20 A21 B22 A23 B24 A21 B22 A21 B22 A21 B22 A21 B22 A21 B22 A21 B22 A23 B24 A23 B24 A23 B24 A23 B24 A23 B24 A23 B24 A23 B24 A23 B24 A23 B24 HS HS HS HS HS HS HS HS HS HS HS HS HS HS HS HS
COMMENT: Hot spares vs. RAID 6 There is little need for hot spares with RAID 6, but in a 8 tray configuration, there is not room for another 8+2P RAID 6 array. Therefore, configure the other 8 disks as a 4+4 RAID 10 array. In the 16 tray configuration, there is room for 1 more 8+2P RAID 6 array, but this will create an imbalance that will hurt GPFS performance. Therefore configure the other 16 disks as 2 x 4+4 RAID 10 arrays.
DS5300
Data Flow Example #3
controller A prefers odd slots in all stacks controller B prefers even slots in all stacks
XOR ASIC data block 4xFC4
4xFC4
XOR ASIC
contention FC4 | FC8 FC4 | FC8
1 2
3 4
4xFC4 5 loop switches FC4 | FC8 FC4 | FC8
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports
1 ABABABABABABABAB
host ports

2 ABABABABABABABAB 3 ABABABABABABABAB
Stack # Tray ID
1 2 3 4 5 6
Slot #
7 8 9 10 11 12 13 14 15 16
4 ABABABABABABABAB 5 ABABABABABABABAB 6 ABABABABABABABAB 7 ABABABABABABABAB 8 ABABABABABABABAB
Optimum Both vertical paths or both diagonal paths can be active at same time.
1 2 3 4 5 6 7 8
11 25 31 45 51 65 71 85
1 2 3 4 p
x x
Array X: 4+P RAID 5, owned by controller A Vertical volume (contention on loop switches)
Disk to Array Mapping Mistake By vertically assigning all disks to array X, contention is created.
Sub-Optimum A vertical path and a diagonal path can not both be active at the same time.
Loop Switches
DS5300
Data Flow Example #4
controller A controller B
data block data block contention FC4 | FC8 FC4 | FC8 1 2
4xFC4
XOR ASIC
3 4
4xFC4 p contention loop switches 1 2 3 4 p FC4 | FC8 FC4 | FC8
8 7 6 5
4 3 2 1
8 7
6 5
4 3
2 1
drive ports
1 2
3 4
5 6
7 8
1 2 3 4
5 6 7 8
host ports
1
host ports
data path for array X data path for array Y
2 3
Stack # Tray ID
1 2 3 4 5 6
Slot #
7 8 9 10 11 12 13 14 15 16
4 5 6
1 2 3 4 5 6 7 8
11 1 25 2 31 3 45 4 51 p 65 71 85
1 2 3 4 p
7 8
Array X: 4+P RAID 5, owned by controller A Array Y: 4+P RAID 5, owned by controller A
Array Ownership Mistake By assigning array Y to controller A, contention is created.
DS5300
Ethernet Switch:
TbE: GPFS, GbE: Administration NSD Server-01 x3650 M2 8 cores, 6 DIMMs NSD Server-02 x3650 M2 8 cores, 6 DIMMs
TbE FC8
Performance Analysis
DS5300 streaming data rate
256 x SATA or 128 x 15Krpm disks: write < 4.5 GB/s, read < 5.5 GB/s
DS5300 IOP rate

256 x SATA disks: write < 3600 IOP/s, read < 12,000 IOP/s 128 x 15Krpm disks: write < 9,000 IOP/s, read < 36,000 IOP/s
GbE
GbE
potential aggregate TbE rate: 8 x TbE < 5.6 GB/s
GbE
GbE
TbE
FC8
725 MB/s per TbE is possible, but 700 MB/s is required
potential aggregate FC8 rate: 8 x FC8 < 6.0 GB/s

780 MB/s per FC8 is possible, but 700 MB/s is required
GbE
GbE
NSD Server-03 x3650 M2 8 cores, 6 DIMMs NSD Server-04 x3650 M2 8 cores, 6 DIMMs
TbE
FC8
8 x FC8
controller A
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 5 6 7 8
GbE GbE
DS5300
TbE FC8
GbE GbE
The GbE administrative network is not illustrated in this diagram.
1 2 3 4
SAN switch not required
GbE
GbE
1 2 3 4 5 6 7 8
controller B
GbE
GbE
TbE
FC8
Disk Drawers
option #1: 128 x 15Krpm FC disk option #2: 256 x SATA disks
TbE
FC8
GbE
GbE
GbE
GbE
TbE
FC8
TbE
FC8
COMMENT: This is a "safe" configuration in the sense that meeting projected performance rates can reasonably be expected (n.b., there are more than enough servers, FC8 and TbE ports to do the job). If HBA failover is required, then 8 dual port HBAs may be adopted (thereby requiring a SAN switch). If 2xFC8 adapters are adopted, then peak performance can be maintained during failure conditions.
GbE
GbE
DS5300
Benchmark Results
To Be Completed
These and other measurements to be validated using GPFS

Peak Write Rates
Write cache enabled, mirroring disabled : TBD Write cache enabled, mirroring enabled, FSWT enabled: TBD Write cache disabled: TBD
Peak Read Rates

Read cache enabled: TBD
DS5020
The DS5020 is an upgrade from the DS4700 Its performance profile is roughly equivalent to the earlier generation DS4800
peak streaming rates < 1500 MB/s
It supports a maximum of 112 disks It uses the EXP5000 disk drawers with FC and/or SATA disk Compared with the DS5300
RAID 6 overhead comprises a greater percentage of processing time (e.g., ~= 25%) cf. the DS5300 (e.g., 10%) Write cache mirroring is not as effective
Best practice: use increments of 2 drawers
DS3000/DS5000 Series Comparison

DS model Max disks Internal disks Host interfaces Drive interfaces Max cache memory IOPS from cache read2 IOPS from disk read2 IOPS from disk write2 BW from cache read (MB/s)2 BW from disk read (MB/s)2 BW from disk write (MB/s)2 RAID Levels FC 15 Krpm SAS 15 Krpm SATA 7.2 Krpm Disk Enclosures
DS32001 48 12 SAS @ 3 Gb/s 6 ports SAS @ 3 Gb/s 2 ports DS34001 48 12 FC @ 1/2/4 Gb/s 4 ports SAS @ 3 Gb/s 2 ports 2 GB 120,000 21,500 4,600 1630 940 725 0, 1, 3, 5, 6, 10
146, 300, 450 @ 3Gb/s 750, 1000 GB @ 3Gb/s
EXP3000
DS5020 112 16
DS5100 256 0 FC @ 4/8 Gb/s 8 ports 4 Gb/s 16 ports 8 GB 700,000 75,000 20,000 3200 3200 2,500 0, 1, 3, 5, 6, 10
146, 300, 450 @ 4Gb/s
DS5300 448 0 FC @ 4/8 Gb/s 16 ports 4 Gb/s 16 ports 16 GB 700,000 172,000 45,000 6,400 6,400 5,300 0, 1, 3, 5, 6, 10
146, 300, 450 @ 4Gb/s 750, 1000 GB @ 4Gb/s
EXP5000, EXP810
2 GB
96,000 19,000 4,200 1670 895 695 0, 1, 3, 5, 6, 10

146, 300, 450 @ 3Gb/s 750, 1000 GB @ 3Gb/s EXP3000
750, 1000 GB @ 4Gb/s

EXP5000, EXP810
1. Not intended for use in large capacity storage systems. Best practices suggest not using more than 4 units under the control of a single file system. 2. Data rates are reported as peak theoretical values and are not feasible in a production environment; they are intended for comparison purposes only. 3. This refers specifically to the 1814-72A.
DCS9900
4u 45u
Couplet Front View
Couplet
"dual RAID controller"
Disk Enclosure
2u
Rear View of "Half of a Couplet"

A single couplet can support up to 1200 disks (i.e., 2 frames).
10 Disk Enclosures per Frame 60 Disks per Enclosure 4u per Enclosure
DCS9900
Couplet Dual RAID controller design Active/active design 5 GB of cache RAID level: 8+2P RAID 6 only 8 host side ports
FC8 or IB 4x DDR2
20 drive side ports

3 GB/s SAS connections facilitates fast RAID rebuild
I can't find an image of the IBM tray.
Can sustain up to 4 RAID rebuilds at the same time without a noticeable impact on performance.
Disk Trays Up to 60 disks per tray Up to 20 trays (1200 disks) per couplet Supports SAS and SATA
SAS: 450 GB SATA: 1 TB, 2 TB* Peak Performance (theoretical+) write < 4.5 GB/s (4M/transaction) read < 5.9 GB/s (4M/transaction)
2 TB @ 7200 RPM will be available in Q1/10
Streaming (using 300 x SATA disks)

FOOTNOTES: Currently, 2 TB drives are only available at 5400 RPM. The will be available at 7200 RPM in Q1/10. These are upper bound rates based on lab measurements using specialized tuning parameters and workload assumptions. They demonstrate what the DCS9900 can do. Actual performance rates will not exceed these values.
IOP Rates
40,000 IOP/s (4K/transaction)
DCS9900
Physical View
Illustrations shown using 60-bay enclosures (model 3S1). IBM also supports a 16-bay enclosure though it is seldom used.
3 x 60-slot Trays SAS xor SATA 150 disks Capacity

300 TB using 2 TB SATA 66 TB using 450 GB SAS
5 x 60-slot Trays SAS xor SATA 150 to 300 disks Capacity

10 x 60-slot Trays SAS and/or SATA 150 to 600 disks Capacity

20 x 60-slot Trays SAS and/or SATA 150 to 1200 disks Capacity

COMMENT: To maximize performance per capacity, peak performance can be achieved using as few as 160 x 15 Krpm SAS drives or with 300 SATA drives. To minimize cost per capacity, the number of drives can be increased up to 1200.
DCS9900
Controller Overview
DCS9900 RAID configuration
8+2P RAID 6 Data accessed using a "byte striping algorithm"
Supported sector sizes are

512, 1024, 2048, 4096 bytes GPFS only supports 512 bytes
DCS9900
controller C1
Hostside Connections 8 x FC8 8 x IB (DDR2)

DCS9900
controller C2
COMMENTS
Recommend creating only 1 LUN per tier for GPFS Parity is computed for each write I/O operation Parity is checked for each read I/O operation
Driveside Connections 20 SAS Loops 4 Gb/s
Tier 1
A A A
B B B
C C C
D D D
E E E
F F F
G G G
H H H
P P P
P P P
Tier 2
Tier 3
8+2P RAID 6
DCS9900
Configuration and Parameter Explanation and Guidelines
DCS9900 Cache Organization
There is 2.5 GB of cache per RAID controller for a total of 5 GB The cache page size is a configurable parameter set this using the command "cache size=<int>"
valid choices are 64, 128, 256, 512, 1024, 2048, 4096 (units are in KB)
Best Practice: set the cache size to 1024, 2048 or 4096

optimum streaming performance occurs for any of these values if GPFS blocksize is 2M or 4M, then cache size = 1024 or 2048 or 4096 gives same performance (n.b., a cache size smaller than the GPFS blocksize is OK!)
Setting the OS transfer size

Set cache page size >= OS transfer size Best Practice: set them to be the same value AIX chdev -l fcs<int> -a max_xfer_size=<hex value>
default = 0x100000 (i.e., 1 MB)
Linux set the max_sectors_kb parameter to the DCS9900 cache size located in /sys/block/<SCSI device name>/queue/max_sectors_kb typical SCSI device names are sdb, sdc, sdd, sde, ... These changes are not persistent, therefore this must be reset after every reboot.
DCS9900
Write Caching: Write Back vs. Write Thru
Enabling write back caching instructs the DCS9900 to write data blocks to cache and return control to the OS; data in the cache is actually written to disk later. If write thru caching is enabled, all data is written both to cache and disk before control is returned to the OS. Enable write back caching using the command "cache writeback=on" warning: if a controller fails, all data in its cache will be lost, possibly before it is written to disk
This can corrupt the file system since metadata can be lost. Adopt proper risk management procedures if write back caching is enabled.
Enable write thru by setting "cache writeback=off" Best Practice: disable write back caching
this will significantly degrade performance
Cache coherence
The DCS9900 has the concept of "LUN ownership" or "LUN affinity". The controller in the couplet that created the LUN owns that LUN. Both controllers in a DCS9900 couplet can both see a given LUN (even though only one of them created it) iff cache coherence is enabled Cache coherence generally has a minimal performance degradation Best Practice: Enable cache coherence using the command "dual coherency=on"
DCS9900
Read Caching: Prefetch

DCS9900 read caching uses a prefetch algorithm. The setting for this parameter is dependent on the GPFS block allocation map setting. Best practice: set GPFS block allocation map to scatter assumes that number of nodes > 8 or number of LUNs > 8 scatter vs. cluster
There are 2 block allocation map types for GPFS: scatter or cluster scatter: randomly distribute file blocks over the LUN cluster: write file data in clusters of contiguous disk blocks COMMENT: There is no guarantee that contiguous file blocks on disk will be accessed in the same order that they are mapped to disk. This problem is exacerbated by increasing "disk entropy" through repeated create/delete cycles. Scatter guarantees uniform performance for multitask jobs accessing a common file.
set MF bit using the command "cache mf=on" disable prefetching using the command "cache prefetch=0"
since file blocks are randomly distributed, prefetching hurts performance
DCS9900
User Data vs. Meta Data
Best Practice: Segregate User Data and Meta Data DCS9900 does streaming well, but randomly distributed small transactions not as well. Since meta data transactions are small, segregating user data and meta data can improve performance for meta data intensive operations. Caveats and warnings Most beneficial in environments with significant meta data processing Must have enough dedicated metadataOnly LUNs on controllers with good enough IOP rates to keep pace with DCS9900 LUNs.
FC Drivers
Linux GPFS uses the qla2400 driver for FC access to the DCS9900
this driver is AVT, not RDAC
Supports the DCS9900 active:active access model AIX GPFS uses the MPIO driver in failover mode Only supports an active:passive access model
Linux Multipathing
While not officially supported, customers familiar with Linux multipathing report being are able to get it work with GPFS and the DCS9900.
DCS9900
Summary of Selected DCS9900 Best Practice Settings
1 LUN per tier LUN block size = 512 (set interactively when the LUN is created) dual coherency=on cache size=1024 (assumes OS transfer size = 1MB) cache writeback=off cache mf=on cache prefetch=0 ncq disabled
Summary of Selected GPFS Best Practice Settings

pagepool >= 256M maxMBpS ~= 2X to 3X LAN connection data rate maxblocksize = 4096K blocksize = 4M (assumes objective is to optimize streaming access) if feasible, segregate user data and meta data set GPFS block allocation = scatter (i.e., mmcrfs -j scatter)
DCS9900
Logical Configuration
The following page illustrates an optimum scheme for These criteria assume an NSD configuration defined LUN to Port Mapping (i.e., zoning) later in these slides. selecting primary and backup servers for each LUN cabling These schema are based on the following design criteria:
1. Guarantee that all controller ports and all HBA ports are uniformly active.
n.b., the DCS9900 supports an active:active protocol
2. If a NSD server fails, its backup server can access its LUNs. 3. Consider LUNs associated with a given HBA or controller port. If an HBA or controller port fails, GPFS failover can access the associated LUNS over alternative paths from a backup NSD server for a given LUN in a balanced manner (i.e., do not access all of the affected LUNs from a single NSD server).
This balance condition applies to performance under degraded conditions. It results in a slightly more complex logical configuration.
4. If one of the controllers in a couplet fails, the file system remains viable using backup NSD servers to access the LUNs of the failed controller over the other controller.
DCS9900
A Proper Logical Configuration
Logical IB Connections*
NSD Servers
primary 01 12 29 backup 64 02 13 30 69 03 14 35 70 04 15 36 73 05 16 41 74 06 17 42 79 07 18 51 80 08 19 52 85 09 20 57 86 10 21 58 91 11 22 63 92 P1 P2
Logical FC8 Connections*
DCS9900 Couplet
Controller C1 - Zoning
ports External LUN lables p1 01 03 05 07 09 11 13 15 17 19 21 -p2 25 27 29 31 33 35 37 39 41 43 45 -p3 49 51 53 55 57 59 61 63 65 67 69 -p4 73 75 77 79 81 83 85 87 89 91 93 -30 64 02 66 04 38 06 40 36 70 08 -10 44 12 46 42 74 14 76 16 78 18 50 -80 20 82 22 84 -56 52 86 54 88 26 90 28 62 58 92 60 94 32 -34 68
I B L A N
25 primary 36 01 backup 60
26 37 02 65
27 38 07 66
28 39 08 75
29 40 13 76
30 41 14 81
31 42 19 82
32 43 20 87
33 44 53 88
34 45 54 93
35 46 59 94
P1 P2
primary
49 60 03 backup 32
50 61 04 37
51 62 09 38
52 63 10 43
53 64 15 44
54 65 16 77
55 66 21 78
56 67 22 83
57 68 25 84
58 69 26 89
59 70 31 90
P1 P2
ports External p1 02 04 06 14 16 18 p2 26 28 30 38 40 42 p3 50 52 54 62 64 66 p4 74 76 78 86 88 90 LUN lables 08 10 12 20 22 -32 34 36 44 46 -56 58 60 68 70 -80 82 84 92 94 -29 63 01 65 03 37 05 39 35 69 07 -09 43 11 45 41 73 13 75 15 77 17 49 -79 19 81 21 83 -55 51 85 53 87 25 89 27 61 57 91 59 93 31 -33 67
primary
73 84 05 backup 40
74 85 06 45
75 86 11 46
76 87 12 49
77 88 17 50
78 89 18 55
79 90 27 56
80 91 28 61
81 92 33 62
82 93 34 67
83 94 39 68
P1 P2
COMMENTS: Couplet 9900A: 89 Tiers External LUN labels: 001..022, 025..046, 049..070, 073..094, 097 Couplet 9900B: 90 Tiers External LUN labels: 001..022, 025..046, 049..070, 073..094, 097, 098 In order to improve managability skip external LUN lables 23, 24, 47, 48, 71, 72, 95, 96 Controller C1 owns the odd LUNs, controller C2 owns the even LUNs In order to allow controller failover it is necessarty to enable cache coherence Command: dual coherencey = ON
Accessed by primary NSD server
Accessed by backup NSD server
DCS9900
The Actual Logical Configuration
DCS9900 Couplet
01 03 05 ... 85 87 02 04 06 ... 86 88 hdisks 2..89 LUNs 01..88 P1 P2
ports Tiers p1 01 03 05 02 04 06 p2 01 03 05 02 04 06 p3 01 03 05 02 04 06 p4 01 03 05 02 04 06 ... ... ... ... ... ... ... ... 85 86 85 86 85 86 85 86 87 88 87 88 87 88 87 88
I B L A N
01 03 05 ... 85 87 02 04 06 ... 86 88 hdisks 2..89 LUNs 01..88
P1 P2
01 03 05 ... 85 87 02 04 06 ... 86 88 hdisks 2..89 LUNs 01..88
P1 P2
ports Tiers p1 01 03 05 02 04 06 p2 01 03 05 02 04 06 p3 01 03 05 02 04 06 p4 01 03 05 02 04 06 ... ... ... ... ... ... ... ... 85 86 85 86 85 86 85 86 87 88 87 88 87 88 87 88
01 03 05 ... 85 87 02 04 06 ... 86 88 hdisks 2..89 LUNs 01..88
P1 P2
COMMENTS:
GPFS is configured in a SAN mode. Since dual coherency=OFF for these tests, P1 sees only the LUNs owned by controller 1 and P2 sees only the LUNs owned by controller 2. So there is no HBA failover. This is simply a configuration error. This does not affect performance..
Controller 1 "owns" LUNs 01, 03, 05, ..., 87 Controller 2 "owns" LUNs 02, 04, 06, ..., 88 LUNs 89, 90 were not used. dual coherency = OFF
DCS9900
Ethernet Switch (Administration)
DCS9900 (2U) RAID Controller C1
8 x FC8 host connections
IB 4xDDR
2xFC8
1 2
host ports drive ports
3 4
host ports
GbE GbE
GbE
GbE
GbE
GbE
IB 4xDDR
2xFC8 SAN switch not required
1 2
3 4
host ports
GbE GbE
60-Bay Disk Tray (4U)
GbE
GbE
IB 4xDDR
2xFC8
IB 4xDDR
2xFC8
Disk trays
Minimum required to saturate DCS9900 BW
GbE
GbE
o o o o
160 x SAS disks: 72 TB or 300 x SATA disks: 300 TB
IB Switch (GPFS)
The 2xFC8 HBAs can be replaced by dual port 4xDDR IB HCAs using SRP. The IB host ports can either be directly attached to the servers or connected to a dedicated IB SAN switch. It is also possible to use an IB switch for a combined LAN and SAN, but this has been discouraged in the past. As a best practice, it is not recommend to use an IB SAN for more than 32 ports.
* These are consistant well formed 4K transactions. A typical GPFS small transaction work load has a mixed transaction sizes resulting from metadata transactions.
COMMENT: More disks (for a total of 1200) can be added to this solution but it will not increase performance.
Peak sustained DCS9900 performance streaming data rate < 5.6 GB/s noncached IOP rate < 40,000 IOP/s* 4xDDR IB HCA (Host Channel Adapter) Potential peak data rate per HCA < 1500 MB/s Required peak data rate per HCA < 1400 MB/s 2xFC8 (dual port 8 Gbit/s Fibre Channel) Potential peak data rate per 2xFC8 < 1500 MB/s Required peak data rate per 2xFC8 < 1400 MB/s
DCS9900
Benchmark Results
GPFS Parameters
blocksize(streaming) = 4096K blocksize(IOP) = 256K pagepool = 1G maxMBpS = 4000
COMMENT The disparity between read and write performance observed below is much less pronounced when using 15Krpm SAS drives. For example, using 160 SAS tiers... write ~= 5700 MB/s, read ~= 4400 MB/s This disparity can be removed using cluster block allocation for SATA disk, but this not recommended.
DCS9900 Parameters
8+2P RAID 6 SATA cache size = 1024K cache prefetch = 0 cache writeback = ON
4 NSD Servers, no GPFS clients

P6-p520, 4 cores, 4.2 GHz, 8 GB RAM 2xFC8
Streaming Job
record size = 4M file size = 32G number of tasks = 1 to 16 access pattern = seq Access Patt Streaming Streaming IOP* IOP* Tier write (MB/s) read (MB/s) write (IOP/s) read (IOP/s)
IOP Job
record size = 4K total data accessed = 10G number of tasks = 32 access pattern = small file (4K to 16K) 1 270 220 7,500 3,800 4 790 710 13,500 5,900 8 1400 1200 30,000 27,300 16 2700 1600 30,400 27,300 32 4800 2900 41,000 33,500 64 5400 3600
DS8000 Series
P6-p595 FC Ports
8 x 4 Gb/s FC ports per system ~= 3 GB/s per system
Fan Sence RPC
DS8300 (Base Unit)

Drive Set
Drive Set
PPS
Drive Set
Drive Set
P6-p595 128 core, 256 GB RAM

8
Management Console Keyboard Display Ethernet Ports
P6-p595 128 core, 256 GB RAM
S A N S W I T C H
PPS
Controller 4-way
Controller 4-way
Battery
4 4 4 4
I/O Adapters Enclosure I/O Adapters Enclosure
Battery
I/O Adapters Enclosure I/O Adapters Enclosure
DS8300 ANALYSIS
Peak BW read < 3500 MB/s write < 1900 MB/s duplex < 2300 MB/s FC Ports 16 @ 4 Gb/s Disks 128
max per base unit
Battery
300 GB/disk @ 15Krpm raw capacity ~= 38 TB
COMMENT: The DS8000 RAID architecture is RAID 5 organized in a combination of 6+P and 7+P RAID sets. While this makes configuration easier, it hurts GPFS streaming performance.
Which Storage System is Best?
Which storage system is the best? What is the best number of disks? What is the best size of disks?
There is no unequivocal answer to these questions. The next page presents a feature comparison with a heuristic evaluation of these feature's value to HPC applications.
Which Storage System is Best for HPC?
Feature Comparison
streaming BW IOP rate performance:capacity1 fast RAID rebuild RAID 6 support parity check on read controller organization2 disk technology max number of disks floor space utilization remote mirroring
RAID N+P where N = 2
k 3
DS3400 good acceptable best no yes yes active/passive SAS, SATA 48 acceptable no yes
DS53002 best best good no yes yes active/passive FC, SATA 448 best yes yes
DS8300 acceptable best acceptable no yes yes active/active FC, FATA 1024 acceptable yes no
DCS9900 best good good yes yes yes active/active SAS, SATA 1200 best no yes
Footnotes: 1. The performance:capacity ratio assessment is based on the minimum number of disks commonly deployed in order to achieve peak streaming BW. Increasing capacity behind a controller will decrease this ratio. See the analysis on following pages. 2. Most storage controllers are based a "dual RAID controller" design in order to avoid single point of failure risks. The RAID controllers are generally associated with the RAID sets in either an active/passive or active/active organization. 3. RAID architecture is described using the expression N+P where N is the number of data disks and P is the number of parity disks in a RAID set. For optimum GPFS performance, N = 2k. This category declares whether N = 2k.
Storage Servers
There are many options for storage servers (i.e., NSD servers) with GPFS clusters. The following pages provide examples illustrating some of the more common choices.
P6-p520
System Architecture
V DIMM V DIMM V DIMM V DIMM
Nova
P6 DCM
4.2 GHz
P6 DCM
4.2 GHz
Nova
V DIMM V DIMM V DIMM V DIMM
Secondary GX Bus (3:1)
Burst simplex < 5600 MB/s duplex < 11200 MB/s Sustained simplex < 4400 MB/s duplex < 6800 MB/s
8:1 GX
Primary GX Bus (4:1) Options 2 x TbE 2 x GbE 4 x GbE
p5-IOC-2
I/O Bridge
Obsidian
Burst simplex < 4200 MB/s duplex < 8400 MB/s Sustained simplex < 3400 MB/s duplex < 5000 MB/s
NOTE: Requires 4 core configuration in order to enable the "direct" GX slot.
G X
G X
P C I E 8X
P C I E 8X
P C I E 8X
P C I X
P C I X
SAS/SATA DVD Tape RAID Controller
2.0 2.0
"Direct" GX Slot
"Pass thru" GX Slot (GX+ only)
"DIRECT" GX Slot IB cards are only supported in this slot. Card options:
dual port, IB 12xSDR @ 6:1 ratio (GX+) dual port, IB 12xDDR @ 3:1 ratio (GX++) RIO2 card @ 8:1 (GX+)
12x IB ports 1X and 4X cables

requires special "width changer" cable
"Pass Thru" GX Slot The pass thru GX slot occupies the same physical space as the 1st PCI-E slot. Therefore you can not use both of these slots. Supports the RIO2 card @ 8:1 (GX+). It does not support IB card. Single PCI Adapter Data Rates PCI-E 8x: Simplex: Burst < 2000 MB/s, Sustained < 1400 MB/s Duplex: Burst < 4000 MB/s, Sustained < 2100 MB/s PCI-X 2.0
Burst < 2000 MB/s, Sustained < 1400 MB/s (this is not a duplex protocol)
Overiew The P6-p520 is cost effective storage server for GPFS in most pSeries clusters using Ethernet. This diagram illustrates those features most useful to its function as a storage server. Alternative Solution The P6-p550 can be used in place of the P6-p520. It provides the same number of I/O slots and bandwidth, but it also has more CPUs; GPFS does not need these extra CPUs, therefore the P6-p520 is recommended.
GX Bus width: 32 bits Rules of thumb:

Sustained simplex rates < 80% of simplex burst rate Sustained duplex rates < 60% of duplex burst rate single SDR "lane" burst < 250 MB/s, sustained < 185 MB/s single DDR "lane" burst < 500 MB/s, sustained < 375 MB/s
P6-p520
RIO Architecture
COMMENT:
This data rate analysis is based on the assumption that the G30 connects to Secondary GX Bus.
6:1 GX Bus on 4.2 GHz system Burst simplex < 2800 MB/s duplex < 5600 MB/s Sustained simplex < 2200 MB/s duplex < 3400 MB/s
Physical Dimensions
Height: 4U Width: 9.5 inches
2 G30s fit side by side in a 19 inch rack
IB PCI-X2 12x PCI-X2 IB PCI-X2 12x PCI-X2
2 x IB HCAs
12X IB Ports
Single IB 12X link* Burst simplex < 3000 MB/s duplex < 6000 MB/s Sustained simplex < 2400 MB/s duplex < 3600 MB/s
* The IB link performance is constrained by the GX bus.
7314-G30
IB 12X to PCI-X 2.0 Bridge 0 1 2 3 IB 12X to PCI-X 2.0 Bridge 0 1 2 3
Single PCI-X 2.0 Adapter Burst < 2000 MB/s Sustained < 1400 MB/s
P C I X 2.0
P C I X 2.0
P C I X 2.0
PCI-X 2.0 Slots 64 bit x 266 MHz
P C I X 2.0
P C I X 2.0
P C I X 2.0
P6-p520
The P6-520 offers only 12xDDR, while 4xDDR is more common, so cables supporting 12xDDR -> 4xDDR conversion are available.
448 x SATA or 192 x 15Krpm disks: write < 4.5 GB/s, read < 5.0 GB/s
IB LAN*
P6-p520
NSD server 4-way, 4.2 GHz, 8 GB RAM
GbE GbE
DS5300 IOP rate

PCI-E #1 #2 #3
2x F C 8
PCI-X2 #4 #5
448 x SATA disks: write < 7,000 IOP/s, read < 24,000 IOP/s 192 x 15Krpm disks: write < 16,000 IOP/s, read < 64,000 IOP/s
GX direct
12X DDR
GX pass-thru
PCI-E #2 #3 PCI-X2 #4 #5
8 x FC8 SAN switch not required
potential aggregate IB rate: 4 x 12xDDR < 5.0 GB/s

1250 MB/s per 12xDDR is possible, and 1250 MB/s is required* limited by IBoIP(sp) protocol

P6-p520
GbE GbE
750 MB/s per FC8 is possible, but only 625 MB/s is required
#1
2x F C 8
GX direct
12X DDR
GX pass-thru
controller A
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 5 6 7 8
GbE GbE
FC8
DS5300
1 2 3 4
GbE GbE
P6-p520
GbE GbE
PCI-E #1 #2 #3
2x F C 8
PCI-X2 #4 #5
1 2 3 4 5 6 7 8
controller B
GX direct
12X DDR
GX pass-thru
Disk Drawers
P6-p520
GbE GbE
#1
2x F C 8
PCI-E #2 #3
PCI-X2 #4 #5
option #1: 192 x 15Krpm FC disks (12 drawers) option #2: 448 x SATA disks (28 drawers)
FOOTNOTES: The peak of 1250 MB/s per 12xDDR IB connection using IPoIB(sp) does not provide an adaquate margin of error to harvest the 5 GB/s potential from the DS5300; therefore this solution as shown may provide an aggregate data rate slightly less than 5 GB/s. However, a TbE connection can be added to each node and accessed via GPFS subnets or NFS to more fully utilize the BW potential of the DS5300.
GX direct
12X DDR
GX pass-thru
Ethernet LAN (Administration)

NOTE: Order the P6-p520 in the 19" form factor so that they do not require special frames.
P6-p575
System Architecture
The P6-p575 is used as a storage server in HPC oriented pSeries clusters using Infiniband. This diagram illustrates those features most useful to its function as a storage server.
Only logical connections are illustrated to reduce diagram complexity. There are actaully 16 physical connections between the quad groups.
Clock Rate 4.7 GHz
32 P6 Cores
P6 DCM P6 DCM cache cache memory memory P6 DCM P6 DCM cache cache memory memory
2:1 GX 2:1 GX
4:1 GX
2:1 GX
4:1 GX Bus Data Rates Burst simplex < 4700 MB/s duplex < 9400 MB/s Sustained simplex < 3700 MB/s duplex < 5600 MB/s 2:1 GX Bus Data Rates Burst simplex < 9400 MB/s duplex < 18,800 MB/s Sustained simplex < 7500 MB/s duplex < 11,300 MB/s Technical Notes The GX bus is 32 bits wide Rules of thumb: - Sustained simplex rates ~= 80% of simplex burst rate - Sustained duplex rates ~= 60% of duplex burst rate
4:1 GX 2:1 GX
p5-IOC-2 (A)
"monk"
PCI-E 16x PXI-E 8x P C I E
Top
P C I E or X Bottom
p5-IOC-2 (B)
I/O Bridge
8:1 GX PCI-X
I/O Bridge
PCI-X2
"monk"
PCI-E 8x P C I E
P C I E or X Bottom
PCI-E 16x
PCI-X2
Galaxy-2
Galaxy-2
Galaxy-2
Galaxy-1
T b E T b E
IB 12x IB 12x
disk 0 disk 1 Top
4X 4X IB IB
4X 4X IB IB
4X 4X IB IB
Using the monk adapter eliminates the bottom PCI adapter.
PCI Riser
RIO Ports Only
PCI Riser
Using the monk adapter eliminates the bottom PCI adapter.
8:1 GX Bus for the RIO ports IB Performance Comments: Burst 4X DDR IB port BW simplex < 2400 MB/s simplex < 1500 MB/s, duplex < 2600 MB/s duplex < 4800 MB/s protocol limitations Sustained AIX supports IPoIB(sp) which is a high performance simplex < 1900 MB/s version of IPoIB duplex < 2900 MB/s simplex < 1250 MB/s, duplex < 2150 MB/s
COMMENTS: The 8:1 bus servicing the RIO ports severely restricts the data rate possible using 12x SDR IB. While this server has limited I/O connectivity, its I/O BW is outstanding. The monk IB ports in particular provides the greatest potential for high speed I/O.
Galaxy-2
4X 4X IB IB
Obsidian SAS
P6-p575
Physical View
"the whole enchilada"
COMMENT: The monk 4X DDR IB HCAs are not shown. If they were, the bottom PCI-E slot would not be available.
P6-p575
RIO Architecture
GX++ Bus @ 5.0 GHz Burst simplex < 10.0 GB/s duplex < 20.0 GB/s Sustained simplex < 8.0 GB/s duplex < 12.0 GB/s 2 x 12xDDR
Single IB 12X link Burst simplex < 6.0 GB/s duplex < 12.0 GB/s Sustained simplex < 4.0 GB/s duplex < 6.0 GB/s
12X HUB
COMMENTS: A P6-p595 provides 4 GX card slots per node. With 8 nodes per CEC, there is a max of 32 GX cards. Maximum Bandwidth Configuration: attach upto 16 PCI-E drawers in a dual loop configuration (as shown). Maximum Capacity Configuration: attach upto 32 PCI-E drawers in a single loop configuration.
Physical Dimensions Height: 2U Width: 24 inches
12X HUB
Per Planar (HUB and bridge limited) Burst simplex < 10.0 GB/s duplex < 20.0 GB/s Sustained simplex: write < 5.0 GB/s, read < 6.0 GB/s duplex < 9.0 GB/s
Model - 5802
Planar 2
Planar 1
IB 12X to PCI-E 8X Bridge 0 1 2 3
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
P C I E X8
PCI-E X8
Burst rate simplex < 2.0 GB/s duplex < 4.0 GB/s
Sustained rate simplex < 1.2 GB/s duplex < 1.8GB/s
Internal Disks: Support upto 26 SAS SFF Drives in 1, 2, or 4 groups.
P6-p575
Server benchmark test needed.
Peak sustained DCS9900 performance
streaming data rate < 5.6 GB/s noncached IOP rate < 40,000 IOP/s
IB Switch (LAN Only)
GbE Switch (Administration)
IB 4xDDR
P6-p575 NSD Server

TbE TbE GbE GbE
IB 4xDDR IB 4xDDR
IB 4xDDR IB - LAN via IBoIP(sp)

Potential peak data rate per port < 1250 MB/s
limited by IBoIP(sp) protocol
IB 4xDDR
TbE
P6-p575 NSD Server

TbE GbE GbE
IB 4xDDR IB 4xDDR
Required peak data rate per port < 1400 MB/s The peak of 1250 MB/s per IB port comes close, but is insufficient to harvest to full BW potential of the couplet. Additional IB ports are needed to fully utilize the BW potential of the couplet.
1 2
3 4
host ports
GbE GbE
IB SAN Switch is not recommended.
IB 4xDDR IB - SAN via SRP

Potential peak data rate per IB port < 760 MB/s Even though the couplet host side ports are IB, they can not exceed 760 MB/s. Therefore, to harvest the full BW potential of the couplet, all 8 host side ports must be used. COMMENT: The P6-p575 is best suited for use as an NSD server in p575 or p595 clusters with an IB based LAN. Otherwise, use the P6-p520.
1 2
3 4
host ports
GbE GbE
Disk trays
o o o o
The work horse...
x3650 M2
System Architecture
The x3650 M2 is a common and cost effective storage server for GPFS in System X environmnets. This diagram illustrates those features most useful to its function as a storage server.
x3650 M2 (2u)
3 DIMMs 3 DIMMs 2 DIMMs
Xeon 5500
Nehalem quad core
PCIe/x16 (512 MB/s1) PCIe/x16 (512 MB/s1)
Xeon 5500
Nehalem quad core
Riser
PCIe x8 PCIe/x8
Riser
Riser options:
1. single PCIe x16 adapter 2. two PCIe x8 adapters 3. two PCI-X 133 MHz adapters
Memory DIMMS Best performance achieved using multiples of 6 DIMMs Fewer DIMMs implies greater BW per DIMM DIMM sizes: 1, 2, 4, or 8 GB GPFS does not require a larger memory capacity for the NSD servers; 6 GB of RAM is adaquate if the x3650 M2 is only used as an NSD server.
PCIe x8 PCIe/x8
GbE GbE GbE GbE
I/O Bridge
PCIe/x2 (1 GB/s1) PCIe/x2 (1 GB/s1)
2 optional extra GbE ports
South Bridge
PCIe/x4 (2 GB/s1)
SAS Controller
SAS Backplane
supports upto 12 x 2.5" SAS disks or SSDs
1. Listed bus rates are theoretical duplex rates assuming 512 MB/s per link. Production data rates will be less. 2. Peak duplex rates for PCIe x8 adapters Gen 1 adapters < 3.2 GB/s These are the data rates as they would be measured from an application perspective. Actual data rates with overhead are much greater. Gen 2 adapters < 6.4 GB/s 3. Aggregate I/O rate over 4 x PCIe x8 adapters < 10 GB/s
* See http://en.wikipedia.org/wiki/PCIe for details on the PCI Express standard
x3650 M2
Ethernet Switch: GbE - System Administration
8 x IB DDR connections IB SAN switch not recommended
IB 4xDDR
IB 4xDDR
1 2
3 4
host ports
GbE GbE
GbE
GbE
GbE
GbE
IB 4xDDR
IB 4xDDR
1 2
host ports
drive ports
3 4
host ports
GbE GbE
GbE
GbE
IB 4xDDR
IB 4xDDR
IB 4xDDR
IB 4xDDR
GbE
GbE
Disk trays
IB Switch (GPFS)
o o o o
COMMENTS - DCS9900 Host Connections Dual port IB 4xDDR HCAs are necessary since the DCS9900 host side ports can deliver at most 760 MB/s . Sharing the LAN based IB switch is not recommended, especially if there are more than 32 NSD servers. The host ports can either be directly attached to the servers or separate IB switch can be used. While IB 4xDDR (RDMA) can deliver rates upto 1500 MB/s over a LAN, in practice IB 4xDDR (SRP) delivers closer to 1300 MB/s over a SAN. The peak data rate for this solution may therefore be closer to 5.2 GB/s.
Peak sustained DCS9900 performance streaming data rate < 5.6 GB/s noncached IOP rate < 40,000 IOP/s LAN: 4xDDR IB HCA (RDMA) Potential peak data rate per HCA < 1500 MB/s Required peak data rate per HCA < 1400 MB/s Host connections: 4xDDR IB HCA (SRP) Potential peak data rate per IB connection < 760 MB/s
The road less traveled...
x3550 M2
System Architecture
The x3550 may be a cost effective storage server for GPFS in some cases. It's main limitation is a lack of PCIe slots. This diagram illustrates those features most useful to its function as a storage server.
x3550 M2 (1u)
Xeon 5500
Nehalem quad core
Xeon 5500
Nehalem quad core
Riser
PCIe x16
Riser options:
Riser 1. single PCIe x16 adapter 2. two PCI-X 133 MHz adapters
Memory DIMMS Best performance achieved using multiples of 6 DIMMs Fewer DIMMs implies greater BW per DIMM DIMM sizes: 1, 2, 4, or 8 GB GPFS does not require a larger memory capacity for the NSD servers; 6 GB of RAM is adaquate if the x3650 M2 is only used as an NSD server.
PCIe x16
I/O Bridge
GbE GbE GbE GbE

2 optional extra GbE ports
South Bridge
PCIe/x4 (2 GB/s1)
SAS Controller
SAS Backplane
supports upto 6 x 2.5" SAS disks or SSDs
1. Listed bus rates are theoretical duplex rates assuming 512 MB/s per link. Production data rates will be less. 2. Peak duplex rates for PCIe x8 adapters Gen 1 adapters < 3.2 GB/s These are the data rates as they would be measured from an application perspective. Actual data rates with overhead are much greater. Gen 2 adapters < 6.4 GB/s 3. Aggregate I/O rate over 4 x PCIe x8 adapters < 10 GB/s
* See http://en.wikipedia.org/wiki/PCIe for details on the PCI Express standard
x3550 M2
NSD Server-01
GbE GbE
x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs
TbE
FC8

8 x FC8 connections SAN switch not required
1 2
3 4
host ports
GbE GbE
NSD Server-02
GbE GbE
TbE
FC8
NSD Server-03
GbE GbE
TbE FC8 TbE FC8
1 2
3 4
host ports
GbE GbE
NSD Server-04
GbE GbE
x3550 M2 8 cores, 6 DIMMs
NSD Server-05
GbE GbE
x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs
Disk trays
TbE FC8
o o o o
NSD Server-06
GbE GbE

TbE FC8
NSD Server-07
GbE GbE
TbE
FC8
NSD Server-08
GbE GbE
x3550 M2 8 cores, 6 DIMMs
TbE
FC8
COMMENT: Do not underestimate the I/O capability of the x3550 M2. It has the same busses and mother board as the x3650 M2, just fewer I/O ports. For example, an NSD server configuration similar to the previous x3650 M2 example could be effectively used instead of the one illustrated in this example.
Peak sustained DCS9900 performance streaming data rate < 5.6 GB/s noncached IOP rate < 40,000 IOP/s TbE (10 Gbit Ethernet Adapter) Potential peak data rate per TbE < 725 MB/s Required peak data rate per TbE < 700 MB/s FC8 (single port 8 Gbit/s Fibre Channel) Potential peak data rate per FC8 < 760 MB/s Required peak data rate per FC8 < 700 MB/s
Blade Center (BC)

Features Useful for Storage Service
Scalability
BC chassis supports up to 14 NSD servers and includes the power, cooling, networking and management infrastructure.
6 x TbE 6 x TbE BC- H 6 x GbE, 3 x TbE 6 x GbE, 3 x TbE COMMENT: This is a maxed out configuration. Typically a smaller number of NSD servers is sufficient.
Management efficiency
Nodes, cabling, switches, and management modules form integrated package Management modules provide a common interface for managing all BC components
Midplane
A M B M l a d e A M M
Space efficiency
A single 9U chassis supports up to 14 NSD servers (plus associated infrastructure!)
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
B l a d e
Power efficiency
BC power modules are as much as 50% more efficient then the smaller power supplies used in rack-mounted servers.
Midplane 6 x FC8 6 x FC8
Blade
Cores RAM
Price
By comparison, it requires a large initial investment if only a small storage server infrastructure is needed. But the incremental costs of scaling out are small.
FC8 < 750 MB/s
I/O Ports
TbE module: in: 14 x TbE out: 6 x TbE Mixed Ethernet module: in: 14 x GbE out: 6 x GbE, 3 x TbE FC module: in: 14 x FC4 out: 6 x FC8
HS21 or HS21XM
4 cores RAM: 4 to 8 GB is adequate up to 32 GB is possible
Potential storage I/O bandwidth (BW)

Server to client BW (Ethernet) TbE < 12 GB/s GbE < 1 GB/s Server to storage controller BW (FC) FC8 < 8 GB/s The external FC ports gate the effective BW to 8 GB/s; this means that we can only effectively use 12 of the 14 blades as NSD servers. The "extra" 2 blades can be used as spares or for other things.
I/O Ports
PCI-E
2 x TbE
Sustained I/O Rates PCI-E (8x) < 1400 MB/s TbE < 725 MB/s GbE < 80 MB/s PCI-X < 700 MB/s FC4 < 380 MB/s
GbE GbE PCI-X

2 x FC4
BladeCenter Configuration
Using External Nodes as Storage Servers
The recommended best practice for using GPFS with blades is to use external nodes as the NSD servers.
LAN (GbE and TbE)
GbE GbE
ANALYSIS Storage 2 x DS3400

TbE 2 x FC4 2 x FC4
GbE
GbE
TbE
2 x FC4 2 x FC4
disk: 15Krpm SAS 48 disks 4+P RAID 5 8 arrays + 8 hot spares usable capacity < 14 TB
Controller-A
Controller-B
Blades (56) GbE to each blade

up to 80 MB/s per blade
DS3400-01
12 x 15 Krpm SAS disks (450 GB/disk)
Aggregate BW
write < 1300 MB/s read < 1450 MB/s
ESM-A
ESM-B
EXP3000
Average BW
write < 20 MB/s per blade read < 25 MB/s per blade
Controller-A
Controller-B
DS3400-02
ESM-A
ESM-B
EXP3000
BladeCenter Configuration
Using Blades as NSD Servers
While not as effective as external servers, blades can be used as NSD servers.
LAN (GbE) SAN (FC4) - 24 ports

N N N N S S S S D D D D S e r v e r S e r v e r S e r v e r S e r v e r

Blades (56) GbE to each blade

Aggregate BW
Average BW
Controller-A
Controller-B
DS3400-01
ESM-A
ESM-B
EXP3000
Controller-A
Controller-B
COMMENT: Given a GbE of LAN, it is necessary to use 20 nodes as NSD servers. Requires SAN switch and 1 FC4 per NSD server. NSD servers can also be used to run applications as GPFS clients. Blades have less utility as storage servers due to more limited I/O capabilities.
DS3400-02
ESM-A
ESM-B
EXP3000
Blade Configuration
Using Blades as NSD Servers
Blades can use GPFS as a SAN file system, but since blade clusters tend to be large, a larger SAN and special SAN tuning is necessary.
LAN (GbE) SAN (FC4) - 64 ports

Blades (56) FC4 per blade

Aggregate BW
Average BW
Controller-A
Controller-B
DS3400-01
ESM-A
ESM-B
EXP3000
Controller-A
Controller-B
COMMENT: Requires larger SAN (1 FC4 per blade) along with a SAN switch. Set queue depth to 1 or 2. There are no hard rules saying that GPFS can not be used for a large SAN, nor are there rules regarding the size of a GPFS SAN, but generally SANs spanning more than 32 nodes are less common for GPFS.
DS3400-02
ESM-A
ESM-B
EXP3000
10. GPFS Configurations
This section contains numerous examples of actual and proposed GPFS configurations illustrating the versatility of GPFS.
They show both ordinary and unusual configurations.
Tinkertoy Computer
On display at Museum of Science, Boston
Disclaimers
The configurations shown in this section are only examples illustrating and suggesting GPFS possible configurations. In some cases, they merely illustrate how systems have been configured, not necessarily how they should be configured. These slides are not intended to be "wiring diagrams"; rather, they are to illustrate basic concepts when integrating various components into an overall solution. Unless stated otherwise, "feeds and speeds" are based on realistic upper bound estimates as measured by the application, but under ideal benchmarking conditions. Performance will vary "according to actual driving conditions".
Balance
The I/O Subsystem Design Goal
Ideally, an I/O subsystem should be balanced. There is no point in making one component of an I/O subsystem fast while another is slow. Moreover, overtaxing some components of the I/O subsystem (e.g., HBAs) may disproportionately degrade performance.
However, this goal can not always be perfectly achieved. A common imbalance is when capacity is more important than bandwidth; then the aggregate bandwidth based on the number of disks may exceed the aggregate bandwidth supported by the electronics of the contollers and/or the number HBAs and storage servers.
"Performance is inversely proportional to capacity." -- Todd Virnoche
GPFS Building Blocks
A convenient design strategy for GPFS solutions is to define a "storage building block", which is the "smallest" increment of storage and servers by which a storage system can grow. Therefore, a storage solution consists of 1 or more storage building blocks. This allows customers to conveniently expand their storage solution in increments of storage building blocks (i.e., "build as you grow" strategy) This solution is made feasible since GPFS scales linearly in the number of disks, storage controllers, NSD servers, GPFS clients, and so forth.
But the Building Blocks Are Getting Larger
FC and SAS disks: 450 GB/disk SATA Disks: 2 TB/disk Storage Controllers: 0.5 PB to 1.0 PB Storage Servers: several GB/s This presents a challenge for smaller storage systems.
Building Block #1A

Performance Optimized
Ethernet Switch
TbE 2 x FC4 2 x FC4
NSD Server: x3650-M2

8 Cores, 6 GB RAM 2 dual-port 4 Gb/s FC HBAs (2xFC4)
GbE
GbE
Single 10 GbE (TbE) adapter per node

at most 750 MB/s per adapter (Myricom adapter)
GbE
GbE
TbE
2 x FC4 2 x FC4
Disk Controller: DS3400 with EXP3000

12 disks per DS3400 plus 12 disks per EXP3000
SAS disk, 450 GB/disk @ 15Krpm
LUNs: 4+P RAID5 sets

Aggregate Capacity and Performance

Capacity
48 disks @ 450 GB/disk raw = 21 TB, usuable = 14.4 TB includes 8 hot spares
This is excessive, but there are only 4 hot spares per DS3400
DS3400-01
ESM-A
ESM-B
EXP3000
Performance
streaming: write < 1300 MB/s, read < 1600 MB/s IOP: write < 18,000 IOP/s, read < 22,000 IOP/s
Alternative: Capacity Optimized

DS3400-02
Use 4 drawers of 1 TB SATA disk per DS3400 Capacity

84 disks @ 1 TB/disk in 8+2P RAID 6 configuration raw = 84 TB, usuable = 64 TB includes 4 hot spares Performance: TBD (perhaps 1200 MB/s?)
ESM-A
ESM-B
EXP3000
COMMENT:
Usable capacity could be increased to 72 TB using 4+P RAID5 arrays, but this is not a best practice.
Building Block #1A

2 Building Blocks
TbE
2 x FC8
DS3400-01
GbE
GbE
ESM-A
ESM-B
GbE
GbE
TbE
2 x FC8
EXP3000
TbE 2 x FC8 TbE 2 x FC8
Controller-A
Controller-B
GbE
GbE
GbE
GbE
COMMENT: Using 2xFC8 per NSD server instead 4xFC4 per NSD server with a SAN switch simplifies cabling.
DS3400-02
ESM-A
ESM-B
EXP3000
Ethernet Switch
FC8 SAN Switch (24 ports)
FC4
Controller-A
Controller-B
DS3400-03
ESM-A
ESM-B
Aggregate Capacity and Performance

Capacity
88 disks @ 450 GB/disk raw < 38 TB, usuable < 28 TB includes 4 hot spares per DS3400
This is excessive, but there are only 4 hot spares per DS3400
EXP3000
Controller-A
Controller-B
DS3400-04
Performance
streaming: write < 2500 MB/s, read < 3000 MB/s IOP: write < 35,000 IOP/s, read < 40,000 IOP/s
ESM-A ESM-B
EXP3000
WARNING: Scaling beyond 2 building blocks (i.e., 4 x DS3400) is not recommended when performance is critical because RAID rebuilds over multiple DS3400s significantly impede performance. If scaling beyond this is required, then deploy multiple GPFS file systems or storage pools to limit the impact of RAID rebuilds.
Building Block #1B

Maximizing Capacity and Harvesting Unused Server Bandwidth
General Idea: x3650 M2 bandwidth is not fully utilized in building block #1a. This larger building block supports 2X more DS3400s in order to efficiently use the server bandwidth.
WARNING: Scaling beyond 1 building block (i.e., 4 x DS3400) is not recommended. See WARNING on previous page.
Ethernet Switch
NSD Server-01* x3650 M2 8 cores, 6 DIMMs
Requires tie-breaker disks for quorum if only one building block is deployed.
Controller-A
Controller-B
DS3400-04
12 x SATA disks (1 TB/disk)
IB 4xDDR
2 x FC8
ESM-A ESM-B
GbE
GbE
2 x FC8
EXP3000
GbE GbE
NSD Server-02* x3650 M2 8 cores, 6 DIMMs
IB 4xDDR
2 x FC8 2 x FC8
ESM-A
ESM-B
EXP3000
ESM-A ESM-B
IB Switch SAN Switch (24 ports)

FC4
Controller-A
FC8 FC4
EXP3000
FC4
Controller-B
FC4
DS3400-04
ESM-A ESM-B ESM-A
DS3400-04
ESM-B ESM-A
DS3400-04
ESM-B
EXP3000
ESM-A ESM-B ESM-A
EXP3000
ESM-B ESM-A
EXP3000
ESM-B
EXP3000
ESM-A ESM-B ESM-A
EXP3000
ESM-B ESM-A
EXP3000
ESM-B
EXP3000
EXP3000
EXP3000
NSD Server: x3650-M2

8 Cores, 6 GB RAM 2 dual-port 8 Gb/s FC HBAs (2xFC8)
Capacity Optimized
Use 4 drawers of 1 TB SATA disk per DS3400 Capacity per DS3400
42 disks @ 1 TB/disk in 8+2P RAID 6 configuration raw = 42 TB, usuable = 32 TB includes 2 hot spares
Aggregate Capacity
raw = 168 TB, usuable = 128 TB includes 8 hot spares
IB HCA (4xDDR2)
at most 1500 MB/s per HCA
Aggregate Performance
streaming rate < 3 GB/s
Building Block #1A vs. #1B

Performance vs. Capacity
15Krpm FC disk @ 450 GB/disk
Capacity raw < 38 TB usable < 28 TB Performance streaming rate write < 2500 MB/s read < 3000 MB/s IOP rate write < 35,000 IOP/s read < 40,000 IOP/s Performance to Usable Capacity Ratio streaming rate write < 89 MB/s / TB read < 107 MB/s / TB IOP rate write < 1250 IOP/s / TB read < 1430 IOP/s / TB Floor Space+
SATA @ 1 TB/disk
Capacity raw < 168 TB usable < 128 TB Performance streaming rate write < 2500 MB/s read < 3000 MB/s IOP rate* write < 15,000 IOP/s read < 20,000 IOP/s Performance to Usable Capacity Ratio streaming rate write < 19 MB/s / TB read < 23 MB/s / TB IOP rate write < 117 IOP/s / TB read < 156 IOP/s / TB Floor Space+
Racks (42u x 19"): 1 Usable Capacity per rack: 128 TB/rack
Racks (42u x 19"): 1 Usable Capacity per rack: 29 TB/rack
FOOTNOTES: SATA IOP rates need validation testing (n.b., they are a SWAG ;->) This ratio is misleading in this case since a rack is not fully utilized for this solution.
Building Block #2A

IB Switch (GPFS)
IB 4xDDR 2xFC8 FC8
The plain vanilla Linux configuration
controller A
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 5 6 7 8
GbE GbE
GbE
GbE
DS5300
1 2 3 4
GbE GbE
GbE
GbE
IB 4xDDR
2xFC8
1 2 3 4 5 6 7 8
controller B
GbE
GbE
IB 4xDDR
2xFC8
Disk Drawers
option #1: 128 x 15Krpm FC disks (8 x EXP5000) option #2: 480 x SATA disks (8 x EXP5060)
IB 4xDDR
2xFC8
128 x 15Krpm disks: write < 4.3 GB/s, read < 5.4 GB/s 480 x SATA disks: write < 4.2 GB/s, read < 5.3 GB/s
GbE
GbE

FOOTNOTE: The 15Krpm IOP rates assume of good locality. Assuming poor locality, these rates could be: write < 9,000 IOP/s, read < 15,000 IOP/s.
DS5300 IOP rate

128 x 15Krpm disks: write < 26,000 IOPs, read < 35,000 IOP/s* 480 x SATA disks: write < 7,000 IOP/s, read < 24,000 IOP/s

1500 MB/s per 12xDDR is possible, and 1350 MB/s is required

Capacity Analysis
15Krpm FC Disk
128 disks @ 450 GB/disk 24 x 4+P RAID 5 arrays + 8 hot spares raw capacity < 56 TB, usable capacity < 42 TB
SATA disk
480 disks @ 2 TB/disk 48 x 8+2P RAID 6 arrays raw capacity < 960 TB, usable capacity < 768 TB
Building Block #2B

The P6-520 offers only 12xDDR, while 4xDDR is more common, so cables supporting 12xDDR -> 4xDDR conversion are available.
The plain vanilla AIX configuration
128 x 15Krpm disks: write < 4.3 GB/s, read < 5.4 GB/s 480 x SATA disks: write < 4.2 GB/s, read < 5.3 GB/s
IB LAN*
P6-p520
GbE GbE
PCI-E #1 #2 #3
2x F C 8
PCI-X2 #4 #5
DS5300 IOP rate

128 x 15Krpm disks: write < 26,000 IOPs, read < 35,000 IOP/s* 480 x SATA disks: write < 7,000 IOP/s, read < 24,000 IOP/s
GX direct
12X DDR
GX pass-thru
PCI-X2 #4 #5
8 x FC8 SAN switch not required

1250 MB/s per 12xDDR is possible, and 1350 MB/s is required

controller A
P6-p520
GbE GbE
PCI-E #1 #2 #3
2x F C 8
8 7 6 5 4 3 2 1 8 7 6 5 4 3 2 1 5 6 7 8
GbE GbE
GX direct
12X DDR
GX pass-thru
FC8
DS5300
1 2 3 4
GbE GbE
P6-p520
GbE GbE
1 2 3 4 5 6 7 8
controller B
PCI-E #1 #2 #3
2x F C 8
PCI-X2 #4 #5
FOOTNOTE: The 15Krpm IOP rates assume of good locality. Assuming poor locality, these rates could be: write < 9,000 IOP/s, read < 15,000 IOP/s.
GX direct
12X DDR
GX pass-thru
Disk Drawers
option #1: 128 x 15Krpm FC disks (8 x EXP5000) option #2: 480 x SATA disks (8 x EXP5060)
P6-p520
GbE GbE
#1
2x F C 8
PCI-E #2 #3
PCI-X2 #4 #5
Capacity Analysis
15Krpm FC Disk
128 disks @ 450 GB/disk 24 x 4+P RAID 5 arrays + 8 hot spares raw capacity < 56 TB, usable capacity < 42 TB
GX direct
12X DDR
GX pass-thru
SATA disk Ethernet LAN (Administration)

NOTE: Order the P6-p520 in the 19" form factor so that they do not require special frames.
480 disks @ 2 TB/disk 48 x 8+2P RAID 6 arrays raw capacity < 960 TB, usable capacity < 768 TB
Building Block #2A or #2B

4 Building Blocks - Performance vs. Capacity
o o
NSD Server-01
DS5300-01 Disk Enclosers
FC Disk 8 x EXP5000 128 disks - or SATA Disk 8 x EXP5060 480 disks
o
RACK #4 RACK #3 RACK #2
DS5300-02
NSD Server-02
Disk Enclosers
FC Disk 8 x EXP5000 128 disks
NSD Server-03 NSD Server-04 NSD Server-05 NSD Server-06 NSD Server-07 NSD Server-08 NSD Server-09
I B L A N
RACK #1 client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node
- or SATA Disk 8 x EXP5060 480 disks


NSD Server-10 NSD Server-11 NSD Server-12 NSD Server-13 NSD Server-14 NSD Server-15 NSD Server-16
client node client node client node client node client node client node
Building Block #2A vs. #2B

4 Building Blocks - Performance vs. Capacity
15Krpm FC disk @ 450 GB/disk
Capacity raw < 224 TB usable < 168 TB Performance streaming rate write < 16 GB/s read < 20 GB/s IOP rate write <104,000 IOP/s read < 140,000 IOP/s Performance to Usable Capacity Ratio streaming rate write < 97 MB/s / TB read < 122 MB/s / TB IOP rate write < 620 IOP/s / TB read < 830 IOP/s / TB Racks
SATA @ 1 TB/disk
Capacity raw < 3840 TB usable < 3072 TB Performance streaming rate write < 16 GB/s read < 20 GB/s IOP rate write < 28,000 IOP/s read < 96,000 IOP/s Performance to Usable Capacity Ratio streaming rate write < 5.3 MB/s / TB read < 6.7 MB/s / TB IOP rate write < 9.1 IOP/s / TB read < 31 IOP/s / TB Racks
Storage Racks (42u x 19"): 5 Server Racks (42u x 19"): 5
Storage Racks (42u x 19"): 4 Server Racks (42u x 19"): 5
Building Block #3A

Performance Optimized
IB Switch (LAN Only)
IB 4xDDR IB 4xDDR
+
GbE Switch (Administration)

P6-p575 NSD Server
IB 4xDDR IB 4xDDR
TbE
TbE
GbE
GbE
IB 4xDDR IB 4xDDR
P6-p575 NSD Server

TbE TbE GbE
3 4
IB 4xDDR IB 4xDDR
GbE
DCS9900 Performance Streaming data rate write < 5.7 GB/s read < 4.4 GB/s Noncached IOP rate (4K transactions) write < 40,000 IOP/s read < 65,000 IOP/s LAN: 4xDDR IB HCA (RDMA)+ Potential peak data rate per port < 1250 MB/s
Limited by IBoIP(sp) protocol.
1 2
GbE GbE
Required peak data rate per port < 700 MB/s

The reason this value is so low is that the NSD servers are configured with 4 IB LAN ports.+
host ports
1 2
3 4
host ports
GbE GbE
SAN: 4xDDR IB HCA (SRP) Potential peak data rate per host connection < 780 MB/s
Limited by the busses in the DCS9900.
Required peak data rate per host connection < 715 MB/s
Capacity Analysis
15Krpm FC Disk 160 disks @ 450 GB/disk 16 x 8+2P RAID 6 tiers raw capacity < 70 TB usable capacity < 56 TB
FOOTNOTES: 4 IB LAN ports per NSD server is overkill, but 2 IB LAN ports are not quite enough. Since peak performance is the objective of this design, the "extra" IB LAN ports are recommended. If you need more than 300 SAS disks to meet capacity requirements, a SATA solution may be sufficient; n.b., data is secure on SATA given the DCS9900 RAID 6 architecture.
5 Disk Trays*
Min required to saturate couplet performance
o o o 160 x 15Krpm SAS disks o
Building Block #3A

4 Building Blocks, Performance Optimized
GbE administrative network not shown
Analysis
Capacity
IB LAN Switch
DCS9900 Couplet
5 disk trays (160 SAS disks) 450 GB/disk @ 15Krpm usable capacity = 56 TB
IB 4X DDR
P6p575-01 P6p575-02 P6p575-03 P6p575-04 P6p575-05 P6p575-06
DCS9900 Couplet
DCS9900 Couplet
P6p575-07 P6p575-08
NSD Servers
DCS9900 Couplet
raw < 280 TB usable < 224 TB Performance streaming rate write < 20 GB/s read < 16 GB/s IOP rate write < 160,000 IOP/s read < 260,000 IOP/s Performance to Usable Capacity Ratio streaming rate write < 91 MB/s / TB read < 77 GB/s / TB IOP rate write < 714 IOP/s / TB read < 1160 IOP/s / TB Racks
Storage Racks (45u x 19"): 2 Server Racks: 1
Building Block #3B

Balanced Capacity/Performance
8 x FC8 host connections
IB 4xDDR
2xFC8
1 2
3 4
host ports
GbE GbE
GbE
GbE
GbE
GbE
IB 4xDDR
2xFC8 SAN switch not required
1 2
3 4
host ports
GbE GbE
GbE
GbE
IB 4xDDR
2xFC8
IB 4xDDR
2xFC8
GbE
GbE
o Close to minimum required o o to saturate couplet o performance
5 Disk trays
300 x SATA disks
IB Switch (GPFS)
DCS9900 Performance Streaming data rate write < 4.8 GB/s read < 3.1 GB/s Noncached IOP rate write < 47,000 IOP/s read < 33,000 IOP/s LAN: 4xDDR IB HCA (RDMA) Potential peak data rate per HCA < 1500 MB/s Required peak data rate per HCA < 1200 MB/s SAN: 2xFC8 (dual port 8 Gbit/s Fibre Channel) Potential peak data rate per 2xFC8 < 1500 MB/s Required peak data rate per 2xFC8 < 1200 MB/s
Capacity Analysis
SATA 300 disks @ 1 TB/disk 30 x 8+2P RAID 6 tiers raw capacity < 300 TB usable capacity < 240 TB
Building Block #3B

4 Building Blocks, Balanced Capacity/Performance
Analysis
Capacity raw < 1200 TB usable < 960 TB Performance streaming rate write < 18 GB/s read < 12 GB/s IOP rate write < 180,000 IOP/s read < 130,000 IOP/s Performance to Usable Capacity Ratio streaming rate write < 19 MB/s / TB read < 12 MB/s / TB IOP rate write < 187 IOP/s / TB read < 135 IOP/s / TB Racks
Storage Racks (45u x 19"): 2 Server Racks (42u x 19": 1
DCS9900 Couplet 5 disk trays (300 disks) 1 TB / SATA disk usable capacity = 240 TB
IB LAN Switch
FC8 IB 4X DDR
x3650 M2 #01 x3650 M2 #02 x3650 M2 #03 x3650 M2 #04 x3650 M2 #05 x3650 M2 #06 x3650 M2 #07 x3650 M2 #08 x3650 M2 #09 x3650 M2 #10 x3650 M2 #11 x3650 M2 #12 x3650 M2 #13 x3650 M2 #14
x3650 M2 #15 x3650 M2 #16
NSD Servers
Building Block #3C

Capacity Optimized
NSD Server-01
GbE GbE
x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs x3550 M2 8 cores, 6 DIMMs
TbE
FC8

8 x FC8 connections SAN switch not required
1 2
3 4
host ports
GbE GbE
NSD Server-02
GbE GbE
TbE
FC8
NSD Server-03
GbE GbE
TbE
FC8
1 2
3 4
host ports
GbE GbE
NSD Server-04
GbE GbE
TbE
FC8
NSD Server-05
GbE GbE
TbE
FC8
NSD Server-06
GbE GbE
TbE
FC8
o 20 Disk trays Close to minimum required o o to saturate couplet performance o
1200 x SATA disks
NSD Server-07
GbE GbE
TbE
FC8
NSD Server-08
GbE GbE
TbE
FC8
DCS9900 Performance Streaming data rate write < 5.4 GB/s read < 3.5 GB/s Noncached IOP rate* write < 52,000 IOP/s read < 33,000 IOP/s
FC8 (single port 8 Gbit/s Fibre Channel) Potential peak data rate per FC8 < 760 MB/s Required peak data rate per FC8 < 700 MB/s
Capacity Analysis
SATA 1200 disks @ 2 TB/disk 120 x 8+2P RAID 6 tiers
FOOTNOTES: Validation testing needed
TbE (10 Gbit Ethernet Adapter) Potential peak data rate per TbE < 725 MB/s Required peak data rate per TbE < 700 MB/s
raw capacity < 2400 TB usable capacity < 1920 TB
Multi-tiered Storage
Example: Building Blocks 2A and 3B
DS5300 12 disk trays (192 disks) 450 GB / disk @ 15Krpm usable capacity = 63 TB
IB LAN Switch
FC8 IB 4X DDR
x3650 M2 #01 x3650 M2 #02 x3650 M2 #03 x3650 M2 #04 x3650 M2 #05 x3650 M2 #06 x3650 M2 #07 x3650 M2 #08 x3650 M2 #09 x3650 M2 #10 x3650 M2 #11 x3650 M2 #12
NSD Servers
SAN Configurations
The concept of integrating storage servers and controllers into building blocks does not generalize aswell for SAN file systems. The following pages illustrate how GPFS can be deployed using a SAN configuration.
COMMENT: If the configuration is small enough, a SAN switch (e.g., Brocade or McData) is not needed.
SAN #1
Linux/Blades
LAN (GbE) SAN*
FC4: 56 ports, FC8: 8 ports
Blades (56) Ports per blade

1 x GbE 1 x FC4
3 4
3 4
1 2
1 2
FC4 (4 Gbit/sec)
Max Aggregate FC4 BW

~= 20 GB/s
5 x 60-disk Drawer
SATA Disk 300 x disks (1 TB) 30 x 8+2P RAID 6 capacity < 240 TB*
Average I/O BW per blade for this configuration

Since there are greater than 32 hosts attached to the SAN, reduce the queue depth setting to a value <= 4.
Storage DCS9900
data rate write < 5 GB/s, read < 3 GB/s disk: SATA 300 disks 8+2P RAID 6 raw capacity < 300 TB usable capacity < 240 TB
SAN #2A
AIX/System P - Optimize IOP Performance
P6-p595 FC Ports
FC8 = 8 Gbit/s;
usable BW < 760 MB/s
DS5300-01 EXP5000 Drawers
S A N S W I T C H
FC Disk (15krpm) 28 drawers 448 x disks (300 GB) 88 x 4+P RAID 5 8 x hot spares capacity < 103 TB*
16 x FC8 ports per system Configured with enough Aggregate data rate
at most 12 GB/s
RIO drawers, peak sustained BW of a P6-p595 is much greater than this.
DS5300 ANALYSIS
Assume 4+P RAID 5 Data Rate per DS5300 write < 4.5 GB/s read < 5.0 GB/s aggregate write < 9.0 GB/s read < 10.0 GB/s IOP Rate per DS5300 write < 30 Kiop/s read < 150 Kiop/s aggregate write < 60 Kiop/s read < 300 Kiop/s Capacity per DS5300 raw < 130 TB usable < 103 TB aggregate raw < 260 TB usable < 206 TB

FC Disk (15krpm) 28 drawers 448 x disks (300 GB) 88 x 4+P RAID 5 8 x hot spares capacity < 103 TB* Capacity is specified as usable capacity.
Quoted data rates are a conservative estimate, especially for the read rates. Validation is required.
p6-595 128 core, 256 GB RAM

RIO - 5802
16
Quoted IOP rates are derived from benchmark tests using different configurations are provided for informational purposes only. Validation is required.
COMMENTS Since the objective is to optimize the IOP rate per DS5300, faster (15Krpm) but smaller (300 GB/disk) FC disks were chosen. Max IOP performance requires using all of the disks supported by a single DS5300 (i.e., 448) and specialized tuning (e.g., "short stroking"); this tuning will decrease the usable capacity. The number of FC8 ports are configured to also support peak streaming BW. Best practice: Configure with at least 4 partitions for use with GPFS.
SAN #2B-1
AIX/System P - Optimize IOP Performance
P6-p595 FC Ports
16 x FC8x ports per system at most 12 GB/s per system


S A N S W I T C H


p6-595 128 core, 256 GB RAM

RIO - 5802
16
FC8
p6-595 128 core, 256 GB RAM

RIO - 5802
16
DS5300 ANALYSIS
Aggregate Data Rate write < 18 GB/s read < 20 GB/s Aggregate IOP Rate write < 120 Kiop/s read < 600 Kiop/s
Aggregate Capacity raw < 525 TB, usable < 410 TB
COMMENT Since the objective is to optimize the IOP rate per DS5300, faster (15Krpm) but smaller (300 GB/disk) FC disks were chosen. Max IOP performance requires using all of the disks supported by a single DS5300 (i.e., 448) and specialized tuning (e.g., "short stroking"); this tuning will decrease the usable capacity. The number of FC8 ports are configured to also support peak streaming BW.
Capacity is given as usable capacity FC8 = 8 Gbit/s; usable BW < 760 MB/s
Will fewer FC8 ports still allow peak IOP rate if streaming rate is not important?
SAN #2B-2
AIX/System P - Multi-tiered Solution
P6-p595 FC Ports
24 x FC8x ports per system at most 18 GB/s per system

S A N S W I T C H

1 2
1 2

1 2
1 2
p6-595 128 core, 256 GB RAM

RIO - 5802
12 12
p6-595 128 core, 256 GB RAM

RIO - 5802
12
FC8
3 4
3 4
3 4
3 4
12
DS5300 ANALYSIS
Aggregate Data Rate write < 28 GB/s read < 26 GB/s Aggregate IOP Rate write < 200 Kiop/s read < 660 Kiop/s Aggregate Capacity raw < 1.7 PB, usable < 1.4 PB Comment DS5300 optimizes IOP rate DCS9900 optimizes capacity
10 x 60-disk Drawer
SATA Disk 10 drawers 600 x disks (1 TB) 60 x 8+2P RAID 6 capacity < 480 TB*
10 x 60-disk Drawer
Usable capacity FC8 = 8 Gbit/s; usable BW < 760 MB/s
LAN vs. SAN

An Example
IB LAN
IB can be replaced by Ethernet. In that case, use 2 x TbE in plance if IB HCAs in the NSD servers.
Ethernet LAN SAN*

FC4: 56 ports, FC8: 8 ports x3650 M2 Set queue depth <= 4
1 2
1 2
3 4
3 4
x3650 M2
x3650 M2
5 x 60-disk Drawer
SAS Disk 160 x disks 16 x 8+2P RAID 6
3 4
3 4
x3650 M2
1 2
1 2
Analysis
Average performance per node is the same. write < 96 MB/s read < 78 MB/s peak performance for any one node SAN: at most 380 MB/s LAN: at most 1500 MB/s Network considerations SAN file system 2 networks smaller queue depth LAN file system simpler network larger queue depth
5 x 60-disk Drawer
SAS Disk 160 x disks (1 TB) 16 x 8+2P RAID 6
The FC8 host connections could be replaced with IB host connections. In that case, the DCS9900 could even be attached to the IB LAN, but that only increases the IB switch port count with little added benefit.
Other Miscellaneous GPFS Configurations
The following pages contain other examples of GPFS configurations rurther illustrating the versatility of GPFS.
Mixed AIX/Linux Environment

System P and System X with Mixed IB/Ethernet Fabric
Ethernet Switch (TbE and GbE)
Requires GPFS subnets
IB Switch
AIX Rack #1 AIX Rack #2
P6-p575
Nodes 19-32
Linux Rack #3
x3550-M2
nodes 33-64
client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node
Linux Rack #4
x3550-M2
nodes 65-96
FC8
NSD Servers and Storage

P6-p520 #1 P6-p520 #2 P6-p520 #3 P6-p520 #4
P6-p575
Nodes 5-18
client node client node client node client node client node
client node client node client node client node client node client node client node client node client node client node client node client node client node client node
DCS9900
client node client node client node
client node client node client node client node client node client node
DCS5300 Data Rate

write < 5.4 GB/s read < 3.5 GB/s
Capacity
raw: 600 TB usuable: 480 TB
COMMENTS: This is an egalitarian network in the sense that

any client node (pSeries or xSeries) can consume upto the max BW of its IB adapter if all of the pSeries nodes are quiescent, then the xSeries nodes can use all of the potential aggregate DCS9900 BW if all of the xSeries nodes are quiescent, then the pSeries nodes can use all of the potential aggregate DCS9900 BW
BW per client node

92 client nodes write: ~= 60 MB/s per node read: ~= 40 MB/s per node
COMMENTS: Switch Fabric

IB can not be used to all nodes since AIX uses IBoIP(sp) and Linux used RDMA. Alternatively, all nodes could use an homogenous Ethernet fabric.

System P and System X with a Mixed LAN
Common Ethernet Fabric
This network connects to all nodes but is not shown in the diagram.
IB Switch
AIX Rack #1
P6-p575
Nodes 1-14
AIX Rack #2
P6-p575
Nodes 14-28
3 GPFS Subnets 1. p575 client nodes 2. x3550-M2 client nodes 3. all NSD nodes
Ethernet Switch (TbE and GbE)

Linux Rack #3
x3550-M2
nodes 29-56
NSD node NSD node
Linux Rack #4
x3550-M2
nodes 57..88
TbE TbE TbE TbE
client node client node client node client node client node client node client node client node client node client node client node client node client node client node
NSD node NSD node client node client node client node client node client node client node client node client node client node client node client node client node
FC8 Switch
NSD node NSD node

client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node client node
DCS9900
Close-up of NSD node configurations

P6-p575 NSD Server
GbE GbE
GbE
IB 12xDDR
2 x FC8
GbE
NSD Server x3650 M2 8 cores, 6 DIMMs
IB 4xDDR
2 x FC8
As designed, each set of nodes can use upto half of the potential DCS9900 BW, but neither set of nodes can use more than half of the potential DCS9900 BW.

System P SAN with System X NSD Cluster
Channel Bonded TbE 1400 MB/s per node
There is no IB switch connection between the p595s!
GbE 80 MB/s per node
Ethernet Switch
Linux Rack #1 Linux Rack #2
x3550-M2
nodes 33-64
P6-p595
128 Cores 256 GB RAM
P6-p595
128 Cores 256 GB RAM
x3550-M2
nodes 1-32
DS5300
T b E
2x FC 8
RIO
5802
T b E
2x FC 8
T b E
2x FC 8
RIO
5802
T b E
2x FC 8
FC Disk (15krpm) 12 drawers 192 x disks (450 GB) 36 x 4+P RAID 5 12 x hot spares capacity < 63 TB
DS5300 Data Rate p595 BW Distribution

Capacity
raw: 84 TB usuable: 63 TB
SAN configuration at most 2.5 GB/s per p595 if x3550-M2s are idle at most 1.1 GB/s per p595 if x3550-M2s are busy
x3550-M2 Data Rate Distribution

NSD configuration at most 2.8 GB/s over all x3550s if p595s are idle at most 1.4 GB/s over all x3550s if p595s are busy
COMMENT: By design, under load this system is load balanced between both classes of nodes. The TbE network can provide up to half of the potential DS5300 BW to the x3550-M2 nodes leaving the other half for use locally on the p595 nodes.
GPFS on Blue Gene

BG/P Architecture
Full System 72 Racks - 72x32x32 Torus
BG/P scales up to 256 racks Peak rate 3.56 PF/s

Terminology: a "compute card" is also called a "node".
Full Rack
Cabled 8x8x16 Torus
32 Node Cards
Node Card 32 compute cards - 4x4x2 torus 2 I/O cards 13.9 TF/s 2 TB 8 to 64 x TbE Compute Card 1 chip per card 20 DRAMs Chip 4 cores @ 850 MHz 13.6 GF/s 2.0 GB DDR2 13.6 GF/s 8 MB EDRAM
2 compute cards fit back to back
1 PF/s 144 TB
435 GF/s 64 GB 0 to 2 x TbE
Formula: 0.85 GHz * 4 cores * 4 pipes per core * 1 FLOP/pipe = 13.6 GF/s
GPFS on Blue Gene

BG/P Networking
3 Networks in BG/P
1. 3D Torus for point-to-point communications 2. Global tree for reduction, all-to-one communications and file I/O between the I/O and compute nodes 3. 10 Gbit/sec Ethernet (TbE) for file I/O, host interface, control and monitoring
3D Torus Each node is connected to its 6 nearest neighbors.
GPFS on Blue Gene

BG/P Storage I/O
An I/O card is similar to a compute card except that is has a single TbE port Each node card can be configured cards with1, 2 or no I/O Each rack can be configured with 8 to 64 I/O cards - default = 16 I/O cards I/O cards connect to compute cards over the tree network Each I/O card acts as a storage client; external nodes act as storage servers
Node Card
I/O Cards
2 x TbE ports
GPFS on Blue Gene

BG/P Storage I/O
Compute Node
COMMENT: GPFS does not run on the compute nodes.
I/O Node
GPFS Client
NSD Servers E t h e r n e t S w i t c h
Application
records POSIX calls
CIOD
records
libc
mmfsd
Server #1
CNK
Compute Node Kernel tree packets
Linux Kernel
tree packets
BG/P ASC
BG/P ASC
Server #2
BG/P Tree Network
BG/P I/O Node Functions

Interface to/from the control system Proxy for the compute nodes Proxy for the debug server
Disk
GPFS on Blue Gene

BG/P Storage Building Block
Ethernet Switch
redo with DS5300
TbE BW per Server 2 TbE per server data rate per server < 1500 MB/s FC BW per Server 2 dual port HBAs per server data rate per server < 1560 MB/s Use RIO drawers to avoid overloading the common GX bus for TbE and PCI-E slots.
P6-p520
TbE TbE
#1
PCI-E #2 #3
PCI-X #4 #5
GX direct
GX pass-thru
P6-p520
TbE TbE
#1
PCI-E #2 #3
PCI-X #4 #5
GX direct
GX pass-thru
DS4800 aggregate data rate

PCI-X2
2x FC 4
IB 12x
PCI-X2
2x FC 4
PCI-X2
2x FC 4
IB 12x
PCI-X2
2x FC 4
aggregate capacity
raw: 37.5 TB usable: 28.0 TB
controller A
Alternative: Use 1 DCS9550 instead of 2 DS4800s
4 3 2 1
4 3 2 1
controller A
4 3 2 1
4 3 2 1
DS4800
1 2 3 4 1 2 3 4
controller B
DS4800
1 2 3 4 1 2 3 4
controller B
EXP810-1
4 disk drawers cabling not shown
o o o
EXP810-1
4 disk drawers cabling not shown
o o o
EXP810-4
EXP810-4
GPFS on Blue Gene

Single BG/P Frame Configuration
Balanced Design
8 TbE connections from the BW source 8 TbE connections from the BW sync
redo with DS5300
NSD - 1 P6-p520, TbE, 2 x FC4 NSD - 2 P6-p520, TbE, 2 x FC4

RIO-1 7314-G30 RIO-2 7314-G30
BG/P Rack 1024 Compute Nodes 8 I/O Nodes

need at least 8 I/O nodes can have at most 64 I/O nodes default: 16 I/O nodes
I/O Node - 1 I/O Node - 2 I/O Node - 3 I/O Node - 4 I/O Node - 5 I/O Node - 6 I/O Node - 7 I/O Node - 8
GbE Switch
GPFS Parameters
page pool = 4 GB maxMBpS >= 2000 MB/s block size = 1024 KB
NSD - 3 P6-p520, TbE, 2 x FC4 NSD - 4 P6-p520, TbE, 2 x FC4

RIO-3 7314-G30 RIO-4 7314-G30
DS4800 Parameters
RAID config = 4+P read ahead multiplier = 0 write caching = off write mirroring = off read caching = on segment size = 256 KB cache block size = 16 KB
DS4800-01 RAID arrays 1..12 4 x EXP810 DS4800-02 RAID arrays 13..24 4 x EXP810 DS4800-03 RAID arrays 25..36 4 x EXP810 DS4800-04 RAID arrays 37..48 4 x EXP810
Bandwidth
Aggregate
BW per compute node

write < 4.0 MB/s per compute node read < 6.0 MB/s per compute node
Alternative:
This is an "I/O poor" design using the only 8 I/O nodes (the minimum requirement). We could make this an "I/O rich" design by adding all 32 I/O nodes (the maximum allowed), but to be usefull we would need to increase the number of building blocks to 8.
The following pages contain some successful legacy designs that are still relevant today.
Smaller Building Blocks Using Different
The previous building blocks all assume the existance of a high BW switch fabric (i.e. TbE or IB). However, many users have existing networks based on GbE only (n.b., TbE switch ports). This leads to a no building block with different granularity.
Smaller Building Blocks

Optimized for NFS and NO TbE
Basic Building Block

2 NSD nodes x3550 dual core, dual socket (4 CPUs) per NSD node 16 GB RAM per NSD node 1 dual port FC HBA @ 4 Gb/s per NSD node 2 internal GbE ports per NSD node dual port GbE adapter per NSD node 2 SAN switches SAN32B (32 ports) 4 Gb/s fabric 1 disk controller DS4800 using 4 Gb/s host side and drive side ports 2 disk enclousures ("drawers") EXP810 with at most 16 disks per drawer disk options
COMMENT
If a tape backup system is added to the storage cluster, use 2 x3650s with the following configuration for each x3650: dual core, dual socket 16 GB RAM 2 internal GbE ports dual port GbE dual port FC HBA for disk 4 Gb/s dual port FC HBA for tape 4 Gb/s
SATA/2 @ 500 GB/disk 10 Krpm FC @ 146 or 300 GB/disk 15 Krpm FC @ 73 or 146 GB/disk

1 Building Block
Ethernet Switch (1 GbE ports for NFS and System Administration) Ethernet Switch (1 GbE port for GPFS)
Active Passive Active
P1 P2
Sustained NFS performance:

Active
P1 P2
Passive
x3550-01
G b E G b E
NSD server 4-way 16 GB RAM
PCI-Ex Slots
P1 P2
x3550-02
G b E G b E
Active
4 3 2 1
PCI-Ex Slots
P1 P2
at most 240 MB/s per building block
Passive
Active
SAN Switch (SAN32B)
SAN Switch (SAN32B)
COMMENTS
The 2 Ether ports dedicated to the the NFS and sysadm networks is a 2-way channel-bond, but it has 2 IP addresses.
sustained peak BW < 240 MB/s
controller A
4 3 2 1
ANALYSIS
Each x3550
"dual core, dual socket" (4 CPUs) 16 GB RAM 2 PCI-Express slots per node
1 dual port HBA @ 4 Gb/s 1 dual port GbE adapter
DS4800
1 2 3 4 1 2 3 4
controller B
EXP810
2 built-in GbE ports at most 380 MB/s per HBA 2 x 2-way "Ether channels"
at most 150 MB/s / Ether channel
The 2 ports dedicated to GPFS are not a channel-bond; using ethernet protocols, they are configured as an active/passive bond under the same IP address.
sustained peak BW < 80 MB/s the GPFS network is only used for GPFS overhead traffic (e.g., tokens, heartbeat, etc.) and thus minimal BW is used
EXP810
Disk Enclosures
2 EXP810 enclosures at most 16 disks per enclosure
6 x 4+P RAID 5 arrays
assume 32 x 300 GB/disk @ 10 Krpm

write: 540 MB/s (read caching off) read: 540 MB/s (write caching off)

assume 32 x 500 GB/disk SATA

9.4 TB (raw)
4.5 TB (raw)
16 TB (raw)

2 Building Blocks
Active Connections
GPFS Network
Passive Connections
SAN Switch (SAN32B)
ANALYSIS
2 Building Blocks
at most 480 MB/s NFS BW limited by the GPFS GbE adapters
NFS and System Administration
SAN Switch (SAN32B)
G b E G b E
Disk Enclosures
4 EXP810 enclosures at most 16 disks per enclosure
12 x 4+P RAID 5 arrays
controller A 4 3 2 1 4 3 2 1
x3550-01
NSD server
P1 P2
P1 P2

DS4800
1 2 3 4 1 2 3 4 controller B
G b E
G b E
x3550-02
NSD server
P1 P2
P1 P2
32 TB (raw) assume 64 x 300 GB/disk @ 10 Krpm

EXP810
G b E G b E
x3550-03
NSD server
P1 P2
P1 P2
19 TB (raw) assume 64 x 146 GB/disk @ 15 Krpm

EXP810
9 TB (raw)
G b E
G b E
x3550-04
NSD server
P1 P2
P1 P2
EXP810
EXP810

6 Building Blocks
ANALYSIS
6 Building Blocks
at most 1400 MB/s NFS BW

96 TB (raw)
GPFS Network NFS and System Administration
56 TB (raw)
G b E
G b E
x3550-01
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-04
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-07
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-10
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-02
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-05
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-08
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-11
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-03
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-06
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-09
NSD server
P1 P2
P1 P2
G b E
G b E
x3550-12
NSD server
P1 P2
P1 P2
SAN Switch (SAN32B)
SAN Switch (SAN32B)
controller A
4321
4 321
DS4800
1234 1234
controller B
EXP810
EXP810
EXP810
EXP810
EXP810
EXP810
EXP810
EXP810
EXP810
EXP810
EXP810
EXP810
Linux Cluster Using Internal SCSI Drives

16 16
GbE Switch
(at least 32 ports)
x335-01 x335-02 x335-03 x335-04 x335-05 x335-06 x335-07 x335-08 x335-09 x335-10 x335-11 x335-12 x335-13 x335-14 x335-15 x335-16
scsi
scsi
G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E G b E
x335-17 x335-18 x335-19 x335-20 x335-21 x335-22 x335-23 x335-24 x335-25 x335-26 x335-27 x335-28 x335-29 x335-30 x335-31 x335-32
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
This was a POC test done by a customer. Each node is both a compute client and NSD node. GPFS was built on the 2nd internal SCSI disk. They now use it in production on clusters of 128 nodes. GPFS 2.3 RH 9.1 Feeds and Speeds
internal SCSI disk ~= 30 MB/s aggregate ~= 1 GB/s
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
Risks
many single points of failure if a node crashes, GPFS file system is unavailable until the node is on-line if a disk fails, the file system will be corrupted and data will be lost NOTE: GPFS robustness design requires "twin tailed disk"
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
Advantage
very inexpensive excellent performance scaling
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
scsi
COMMENTS: This configuration is not recommended since it presents a single point of failure risk. While less than optimal, this risk can be eliminated using GPFS replication.
scsi
scsi
scsi
scsi
Blades Using Internal IDE Drives

Blade Center
b b b l l l a a a d d d e e e 01 02 03 b b b b l l l l a a a a d d d d e e e e 04 05 06 07 b b l l a a d d e e 08 09 b b b b l l l l a a a a d d d d e e e e 10 11 12 13 b l a d e 14
Blade Specs
IBM HS20 dual Xeon @ 2.8 GHz 4 GB RAM 2 IDE drives (40 GB at 5400 RPM)
GPFS Configuration Parameters

Block size = 256 KB Pagepool = 64 MB NSD configuration (via internal GbE network) file system scope limited to single chassis
I I I I I I I I I I I I I I D D D D D D D D D D D D D D E E E E E E E E E E E E E E I I I I I I I I I I I I I I D D D D D D D D D D D D D D E E E E E E E E E E E E E E
Disk System Specs

internal IDE drives (see Blade Specs) 1 full disk and a 30 GB partition from the other drive in each blade is used for the file system 28 LUNs JBOD
GbE Ports
Benchmark Results
FC Ports
bonnie (see http://linux.maruhn.com/sec/bonnie.html) single task read rate = 80 MB/s aggregate (1 task per blade) read rate = 560 MB/s (i.e., 40 MB/s per blade) baseline test (read from a single local disk using ext2) read rate = 30 MB/s
COMMENT: the primary application is blast
COMMENTS: This configuration is not recommended since it presents a single point of failure risk. While less than optimal, this risk can be eliminated using GPFS replication.
Heterogenous Cluster
System
1536-node, 100 TF pSeries cluster 2 PB GPFS file system (one mount point) 500 RAID conroller pairs, 11000 disk drives 126 GB/s parallel I/O measured to a single file (134GB/s to multiple files)
125 NSD/VSD nodes
SAN Organization with Excellent Scaling

40 Nodes
Itanium (Linux) intra-node: GbE 3 FC HBAs/node
1 connection via each FC switch
3 Switches 2 Gb FC
. . .
Frame-15
All disks directly attached to servers via FC switches.

switch 01
Frame-01
4 x FAStT600 8 x EXP700
FAStT600-01 EXP700
switch 02
to Frame-15 8 connections
EXP700 FAStT600-02 EXP700 EXP700 FAStT600-03 EXP700 EXP700 FAStT600-04 EXP700
. . .
to Frame-01 8 connections
switch 03
Aggregate BW = 15 GB/s (sustained) goto: http://www.sdsc.edu/Press/2004/11/111204_SC04.html
EXP700
SAN Organization with Excellent Scaling

40 Linux Nodes (SDSC Booth) 3 FC HBAs per Node 15 Storage frames 60 FAStT600s 2520 disks 240 LUNs 8+P 4 LUNs per FAStT600 73 GB/disk @ 15 Krpm Sustained Aggregate Rate
15 GB/s 380 MB/s per node 256 MB/s per FAStT600
Written Exercise #1
Suppose you have been asked to design the storage subsystem. The cluster will be running Linux with 256 4-way compute nodes (i386 or x86-64). The application job mix will be varied. The message passing traffic will vary from light to moderate (bursts of large messages that are latency tolerant) to heavy (numerous small packet messages that are latency intolerant for the duration of the job, or jobs that will have large packet transfers upon startup and close to terminiation that are latency tolerant); the customer believes that node message passing BW will be at most 50 to 80 MB/s. The storage I/O will also be variable. Some jobs will require lots of BW at the beginning and end of the jobs using a streaming access pattern (large records, sequential access), others will required sustained, moderate access over the life of the job using a streaming pattern, and 1 job will require sustained, but light access over the life of the job, but the access pattern will be small records, irregularly distributed over the seek offset space. The jobs on the cluster are parallel. Finally, there are about 200 users with Windows or Linux based PCs in their office that must access this cluster's file system. Typically, the aggregate file system BW for the cluster will be in the neighborhood of 1 GB/s, though aggregate burst rates could be as high as 2 GB/s. Individual nodes must be able to sustain storage I/O BW up to 60+ MB/s, though more typical node rates are less than 5 MB/s. The file system must start at 50 TB, but be expandable to 100 TB in the future.
Design the cluster by specifying - network (GbE, Myrinet, IB or mixed) and its topology - storage nodes - disk and controllers What additional information do you need in order to make a better specification?
The dog ate my homework!
Written Exercise #2
Suppose you have been asked to design a new storage subsystem to be shared by 2 clusters. The first cluster is a new one that consists of 32 P6-p575 nodes (32 cores per node with 64 GB or RAM) using IB (4x DDR) for the LAN. The other cluster is a legacy system that was designed for written exercise #1. The new storage subsystem needs to be accessible by both clusters though it will be primarily used by the new pSeries cluster (85% usage) as a scratch file system; the legacy cluster accesses on the new file system will mostly be reads. This new pSeries cluster must also be able to access the storage subsystem in the legacy cluster, though it will only account for 15% of the storage work load on that storage subsystem with most of the accesses being reads. The job mix is varied on the new pSeries cluster; it will be used for parallel jobs with heavy message passing requirements, but the storage I/O access pattern will be largely streaming oriented (large records sequentially accessed in large files), but there will be one job flow that must access many small files (2K to 256K with the average being 8K). The small file workload will account for 25% of the overall workload on the pSeries cluster. The new storage system must be able to support high data rates (8 GB/s) for the streaming work load and high IOP rates for the small file work load (80,000 files per second). The capacity of the new storage system must be 250 TB.
Design the cluster by specifying - storage network (GbE, Myrinet, IB or mixed) and its topology - storage nodes - disk and controllers What additional information do you need in order to make a better specification?
Better excuse... My homework got lost in the Ether
Home Work
Ether
11. GPFS System Administration
Let's take a look at some selected sysadm details. Much of this this information is also relevant to programmers. This discussion is not intended to be exhaustive; rather it is intended to provide general guidance and examples (i.e., "give me an example and I will figure out how to do it"). The emphasis is more on concept than syntax. Nor are all options explanined. See manuals for further details.
COMMENT: Unless stated otherwise, specific examples are based on GPFS 3.1. For the most part, there is little or no change between GPFS versions 2.3, 3.1 and 3.2 for these commands.
Where to Find Things in GPFS

Some useful GPFS directories
/usr/lpp/mmfs
/bin... commands (binary and scripts)
most GPFS commands begin with "mm"
/gpfsdocs... pdf and html versions of basic GPFS documents /include... include files for GPFS specific APIs, etc. /lib... GPFS libraries (e.g., libgpfs.a, libdmapi.a) /samples... sample scripts, benchmark codes, etc.
/var/adm/ras
error logs
files... mmfs.log.<time stamp>.<hostname> (new log every time GPFS restarted) links... mmfs.log.latest, mmfs.log.previous
/tmp/mmfs
used for GPFS dumps sysadm must create this directory see mmchconfig
/var/mmfs
GPFS configuration files
same directory structure for both AIX and Linux systems
Today's trivia question...

Question: What does mmfs stand for? Answer: Multi-Media File System... predecessor to GPFS in the research lab
GPFS Information and Downloads from the Web

GPFS FAQs (for versions 3.1, 3.2, 3.3)
http://publib.boulder.ibm.com/clresctr/library/gpfsclustersfaq.html In addition to FAQs, this web page has links to technical documentation for CSM GPFS RSCT etc.
GPFS upgrades
https://www14.software.ibm.com/webapp/set2/sas/f/gpfs/download/home.html Once the base version of GPFS is installed, upgrades can be freely downloaded and installed for AIX and Linux
Comments on Selected GPFS Manuals

GPFS Documentation
Concepts, Planning and Installation Guide
One of the most helpful manuals on GPFS... it provides an excellent conceptual overview of GPFS. If this were a university class, this manual would be your assigned reading. :->
Administration and Programming Reference and Advanced Administration Guide

Documents GPFS related administrative procedures and commands as well as an API guide for GPFS extensions to the POSIX API. The command reference is identical to man pages.
Problem Determination Guide

Many times, GPFS error messages in the mmfs.log files have an error number. You can generally find these referenced in this guide with a brief explanation regarding the cause of the message. They will often point to likely earlier error messages helping you to find the cause of the problem as opposed to its symptom.
DMAPI Guide Documentation available online at...

http://publib.boulder.ibm.com/clresctr/windows/public/gpfsbooks.html#aix_rsctpd22wo
Available with the GPFS SW distribution in /usr/lpp/mmfs/gpfsdocs

note to sysadm's... Be sure to install this directory! Also install the man pages!
IBM Redbooks and Redpapers

http://www.redbooks.ibm.com/... do a search on GPFS
GPFS provides a number of commands to list parameter settings, configuration components and other things. I call these the "mmls" or "mm list" commands.
COMMENT: By default, nearly all of the mm commands require root authority to execute. However, many sysadm's reset the permissions on mmls commands to allow programmers and others to execute them as they are very useful for the purposes of problem determination and debugging.
Selected "mmls" Commands

mmlsfs <file system device name>
[root@gpfs_node1 gpfs_install]# mmlsfs gpfs1 <-- you could also type "mmlsfs /dev/gpfs1" flag value description ---- -------------- -----------------------------------------------------s roundRobin Stripe method -f 131072 Minimum fragment size in bytes -i 512 Inode size in bytes -I 32768 Indirect block size in bytes -m 2 Default number of metadata replicas -M 2 Maximum number of metadata replicas -r 1 Default number of data replicas -R 2 Maximum number of data replicas -j scatter Block allocation type -D posix File locking semantics in effect -k posix ACL semantics in effect -a 1048576 Estimated average file size -n 32 Estimated number of nodes that will mount file system -B 4194304 Block size -Q none Quotas enforced none Default quotas enabled -F 91798862 Maximum number of inodes -V 9.03 File system version. Highest supported version: 9.03 -u yes Support for large LUNs? -z no Is DMAPI enabled? -E yes Exact mtime mount option -S no Suppress atime mount option -K whenpossible Strict replica allocation option -P system Disk storage pools in file system -d nsd_meta1;nsd_meta2;nsd_meta3;nsd_meta4;nsd_lun1;nsd_lun2;nsd_lun3;nsd_lun4;nsd_lun5; nsd_lun6;nsd_lun7;nsd_lun8;nsd_lun9;nsd_lun10;nsd_lun11;nsd_lun12; -A yes Automatic mount option -o none Additional mount options -T /gpfs1 Default mount point

mmlsconfig
[root@gpfs_node1 gpfs_install]# mmlsconfig Configuration data for cluster gpfs_node1: ------------------------------------clusterName gpfs_node1 clusterId 13882392465829736019 clusterType lc Unless explicitly specified, the cluster name is the autoload yes primary cluster configuration server hostname. useDiskLease yes maxFeatureLevelAllowed 913 maxblocksize 4096k pagepool 256m maxMBpS 2000 [gpfs1] takeOverSdrServ yes
COMMENTS: Lists configuration parameters applying to the cluster. Generally only lists configuration paramters that have been changed.
File systems in cluster gpfs_node1: -----------------------------/dev/gpfs1
An undocumented alternative to mmlsconfig that provides more info

mmfsadm
2 modes: console mode or command line mode
example using command line mode mmfsadm dump config

list all configuration paramters (including defaults) more information than mmlsconfig

mmlscluster
[root@gpfs_node1 gpfs_install]# mmlscluster GPFS cluster information ======================== GPFS cluster name: GPFS cluster id: GPFS UID domain: Remote shell command: Remote file copy command:
gpfs_node1 13882392465829736019 gpfs_node1 /usr/bin/ssh /usr/bin/scp
GPFS cluster configuration servers: ----------------------------------Primary server: gpfs_node1 Secondary server: gpfs_node2 Node Daemon node name IP address Admin node name Designation ----------------------------------------------------------------------------------------------1 gpfs_node1 192.168.42.101 gpfs_node1 quorum 2 gpfs_node2 192.168.42.102 gpfs_node2 quorum 3 gpfs_node3 192.168.42.103 gpfs_node3 quorum 4 gpfs_node4 192.168.42.104 gpfs_node4 quorum

mmgetstate -a
[root@gpfs_node2 gpfs_install]# mmgetstate -a Node number Node name GPFS state -----------------------------------------1 gpfs_node1 arbitrating 2 gpfs_node2 active 3 gpfs_node3 active 4 gpfs_node4 down If the mmgetstate command is issued too soon after the mmstartup command, some nodes will be listed with an "arbitrating" state, meaning only that the daemon has not had enough time to start. This is common in very large clusters. (It may take a couple minutes for all daemons to start in this case). If the state is "down", that generally means that there is a problem or that the daemon has not yet been started.

mmlsdisk <file system device name>
display current configuration and state of the disks in a file system
[root@gpfs_node1 gpfs_install]# mmlsdisk gpfs1 disk driver sector failure holds holds name type size group metadata data ------------ -------- ------ ------- -------- ----nsd_lun1 nsd 512 4001 no yes nsd_lun2 nsd 512 4001 no yes nsd_lun3 nsd 512 4001 no yes nsd_lun4 nsd 512 4002 no yes
storage status availability pool ------------- ------------ -----------ready up system ready up system ready up system ready up system
status
suspended: indicates that data is to be migrated off this disk being emptied: transitional status in effect while a disk deletion is pending replacing: transitional status in effect for old disk while replacement is pending replacement: transitional status in effect for new disk while replacement is pending
availabity
up: disk is available to GPFS for normal read and write operations down: no read and write operations can be performed on this disk recovering: an intermediate state for disks coming up during this state GPFS verifies and corrects data read operations can be performed, but write operations cannot unrecovered: the disks was not successfully brought up
Other Selected "mmls" Commands

mmlsattr <file name>
query file attributes
mmlsmgr <file system device name(s)>

display which node is the file system manager for the specified file systems
mmlsnsd
display current NSD information in the GPFS cluster -X cool new option for GPFS 3.2
Maps the NSD name to its disk device name in /dev on the local node and, if applicable, on the NSD server nodes. Using the -X option is a slow operation and is recommended only for problem determination.
[root]# mmlsnsd -X -d "hd3n97;sdfnsd;hd5n98" Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------hd3n97 0972846145C8E927 /dev/hdisk3 hdisk c5n97g.ppd.pok.ibm.com server node,pr=no hd3n97 0972846145C8E927 /dev/hdisk3 hdisk c5n98g.ppd.pok.ibm.com server node,pr=no AIX hd5n98 0972846245EB501C /dev/hdisk5 hdisk c5n97g.ppd.pok.ibm.com server node,pr=no hd5n98 0972846245EB501C /dev/hdisk5 hdisk c5n98g.ppd.pok.ibm.com server node,pr=no sdfnsd 0972845E45F02E81 /dev/sdf generic c5n94g.ppd.pok.ibm.com server node sdfnsd 0972845E45F02E81 /dev/sdm generic c5n96g.ppd.pok.ibm.com server node LINUX
GPFS provides a number of commands needed to create the file system.
These commands of necessity require root authority to execute.
Selected "mm" Commands

mmcrcluster
create a GPFS cluster parameters and options -n specifies a file with a list of node descriptors - NodeName:NodeDesignations
NodeName is the IP address or hostname NodeDesignators specifies client or server, quorum or non-quorum
-p primary GPFS cluster configuration server node -s secondary GPFS cluster configuration server node -R specify remote file copy command (e.g., rcp or scp) -r specify remote shell command (e.g., rsh or ssh) The remote copy and remote shell commands must adhere to the same syntax format as the rcp and rsh commands, but may implement an alternate authentication mechanism.

mmstartup -a
manually start up the mmfsd daemons on all nodes in cluster setting autoload=yes via mmchconfig command will automatically launch mmfsd if mmfsd can not start automatically, you will see runmmfs running and a lot of messages in /var/adm/ras/mmfs.log.latest
mmshutdown -a
unmount all GPFS file systems and shut down mmfsd daemons on all nodes always do this before rebooting nodes if possible
If you do not need to do this on all nodes, the -W or -w parameters will allow you specify which nodes in the cluster to start up/shut down mmfsd.

mmcrnsd (version 3.1)
Creates and globally names Network Shared Disks for use by GPFS. mmfsd daemon must be running to execute mmcrnsd (i.e., do mmstartup first) mmcrnsd -F disk.lst
(version 3.1) disk.lst is a "disk descriptor file whose entries are in the format
DiskName:PrimaryServer:BackupServer:DiskUsage:FailureGroup:DesiredName:StoragePool
DiskName: The disk name as it appears in /dev PrimaryServer: The name of the primary NSD server node. BackupServer: The name of backup NSD server node. Specifying primary and secondary servers is recommended even in SAN mode. DiskUsage: dataAndMetadata (default) or dataOnly or metadataOnly FailureGroup: GPFS uses this information during data and metadata placement to assure that no two replicas of the same block are written in such a way as to become unavailable due to a single failure. All disks that are attached to the same adapter or NSD server should be placed in the same failure group. Applies only to GPFS in non-SAN mode. DesiredName: Specify the name you desire for the NSD to be created. Default format... gpfs<integer>nsd StoragePool: Specify name of the storage pool that the NSD is assigned to; this parameter is used by mmcrfs command
dsk.lst is modified for use as the input file to the mmcrfs command
-v no
verify the disk is not already formatted as an NSD; a value of no means do NOT verify

mmcrnsd (version 3.2 change)
Creates and globally names Network Shared Disks for use by GPFS. mmfsd daemon must be running to execute mmcrnsd (i.e., do mmstartup first) LOOK:: mmcrnsd -F disk.lst 2 colons
disk.lst is a "disk descriptor file whose entries are in the format DiskName:ServerList::DiskUsage:FailureGroup:DesiredName:StoragePool
DiskName: The disk name as it appears in /dev ServerList: Is a comma separated list of NSD server nodes. You may specify up to eight NSD servers in this list. The defined NSD will preferentially use the first server on the list. If the first server is not available, the NSD will use the next available server on the list. DiskUsage: dataAndMetadata (default) or dataOnly or metadataOnly FailureGroup: GPFS uses this information during data and metadata placement to assure that no two replicas of the same block are written in such a way as to become unavailable due to a single failure. All disks that are attached to the same adapter or NSD server should be placed in the same failure group. Applies only to GPFS in non-SAN mode. DesiredName: Specify the name you desire for the NSD to be created. Default format... gpfs<integer>nsd StoragePool: Specify name of the storage pool that the NSD is assigned to; this parameter is used by mmcrfs command
dsk.lst is modified for use as the input file to the mmcrfs command
-v no
verify the disk is not already formatted as an NSD; a value of no means do NOT verify

Disk Descriptor Files Are Both Input and Output Files
WARNING: The mmcrnsd command modifies the disk.lst file. Therefore, make a backup copy of it. [root@gpfs1 gpfs_install]# cat disk.lst Version 3.1 Example /dev/sdb:gpfs1:gpfs2:::nsd_lun1: /dev/sdc:gpfs1:gpfs2:::nsd_lun2: /dev/sdd:gpfs2:gpfs1:::nsd_lun3: /dev/sde:gpfs2:gpfs1:::nsd_lun4: [root@gpfs_node1 gpfs_install]# mmcrnsd -F disk.lst -v no . . . . . . . . . [root@gpfs_node1 gpfs_install]# cat disk.lst # /dev/sdb:gpfs1:gpfs2:::nsd_lun1: nsd_lun1:::dataAndMetadata:4001:: # /dev/sdc:gpfs1:gpfs2:::nsd_lun2: nsd_lun2:::dataAndMetadata:4001:: # /dev/sdd:gpfs2:gpfs1:::nsd_lun3: nsd_lun3:::dataAndMetadata:4002:: # /dev/sde:gpfs2:gpfs1:::nsd_lun4: nsd_lun4:::dataAndMetadata:4002::
NOTE: This is a Linux example. Under AIX disk names are generally of the form hdisk<x>

mmcrvsd
only useful with pSeries/AIX systems using the HPS (i.e., "federation") or SP switch (i.e., "colony"); it replaces IP with a more efficient protocol
GPFS no longer requires RSCT, but RSCT must be installed to use this protocol
syntax is similar to mmcrnsd

see Administration and Programming Reference for details (n.b., it has several more optional parameters)
The mmcrvsd output disk descriptor file can no longer be used as input to the mmcrfs command to build the file system. It is necessary to create NSDs (via the mmcrnsd command) using the output disk descriptor file from the mmcrvsd command after creating the VSDs.
mmcrlv
no longer required and no longer exists If you do create LVs manually using crlv, GPFS will not configure properly!

mmcrfs <mountpoint> <device name> <options>
Create a GPFS file system -F specifies a file containing a list of disk descriptors (one per line)
this is the output file from mmcrnsd
-A -B -E -m -M -n -N -Q -r -R -S -v -z -D
yes -> mount after starting mmfsd, no -> manually mount, automount -> mount at first use block size (16K, 64K, 256K (default), 512K, 1024K, 2048K, 4096K)
default is yes
If you choose a block size larger than 256 KB, you must run mmchconfig to change the value of maxblocksize to a value at least as large as BlockSize.
specifies whether or not to report exact mtime values default number of copies (1 or 2) of i-nodes and indirect blocks for a file default max number of copies of inodes, directories, indirect blocks for a file estimated number of nodes that will mount the file system max number of files in the file system (default = sizeof(file system)/1M activate quotas when the file system is mounted (default = NO) default number of copies of each data block for a file default maximum number of copies of data blocks for a file suppress the periodic updating of the value of atime verify that specified disks do not belong to an existing file system enable or disable DMAPI on the file system (default = no) specify nfs4 to allow "deny-write open lock" to block writes for NFS V4 exported file systems default=posix -k specify the authorization protocol; the options are <posix | nfs4 | all>
Typical example
mmcrfs /fs fs -F disk.lst -A yes -B 1024k -v no
GPFS provides a number of commands to change configuration and file system parameters after being initially set.
I call these the "mmch" or "mm change" commands. There are some GPFS parameters which are initially set only by default; the only way to modify their value is using the appropriate mmch command. n.b., There are restrictions regarding changes that can be made to many of these parameters; be sure to consult the Concepts, Planning and Installation Guide for tables outlining what parameters can be changed and under which conditions they can be changed. See the Administration and Programming Reference manual for further paramter details.
Selected "mmch" Commands

mmchconfig <attributes> <parameters>
change GPFS default configuration attributes parameters node list
specify node file using -n <node file> or a comma seperated list of nodes on the command line the default is all nodes in the cluster
-i changes take effect immediately and are permanent -I changes take effect immediately but do not persist after GPFS is restarted parameters do not apply to all attributes; carefully review Administration and Programming Guide for details attributes (selection of the more common or nettlesome ones) autoload: Start mmfsd automatically when nodes are rebooted. Valid values are yes or no. dataStructureDump: the default is /tmp/mmfs
do not use a GPFS directory (it may not be available) warning: files can be large (200 MB or more)... be sure to delete them when done
designation: explicitly designate client, manager, quorum, or nonquorum nodes maxblocksize: default is 1024K; n.b., mmcrfs blocksize (-B) can not exceed this. maxMBpS (data rate estimate (MB/s) on how much data can be transferred in or out of 1 node)
The value is used in calculating the amount of IO that can be done to effectively prefetch data for readers and write-behind data from writers. By lowering this value, you can artificially limit how much IO one node can put on all of the disk servers. This is useful in environments in which a large number of nodes can overrun a few storage servers. The default is 150 MB/s which can severally limit performance on HPS ("federation") based systems.
pagepool: minimum = 4M, default = 64M, max = 50% of memory

Unecessarily large pagepools result in a law of dimishing returns.
maxFilesToCache: values in range 1 to 100,000

size = 2.5 KB * maxFilesToCache
maxStatCache: values in range 1 to 100,000

size = 176 B * maxStatCache
tiebreakerDisks: to use this feature, provide a list of disk names (i.e., there NSD name)

mmchfs <device name> <options>
Change attributes of an existing GPFS file system
Files created after issuing the mmchfs assume the new attributes. Existing files are not affected.
Options the same as for mmcrfs

-A, -E, -D, -k, -m, -Q, -r, -S, -z
Options not available under mmcrfs

-F -T -V -W changes the max number of files that can be created change mountpoint of file system starting at the next mount of the file system. change the file system format to the latest format supported by this version assign a new device name to the file system (i.e., change the name in /dev)
Options available under mmcrfs, but not available under mmchfs -B, -M, -n, -N, -R, -v changing these parameters requires rebuilding the FS Carefully review the following documents for more details Concepts, Planning and Installation Guide Administration and Programming Reference

mmchlicense
GPFS is provides dynamic means to add and/or remove many components. This allows thesysadm a convient means to grow the current infrastructure or to re-allocate resources to other places by deleting them from an existing system.
mmadddisk and mmdeldisk
GPFS provides the means to dynamically add and remove disks from a GPFS cluster.
mmadddisk <device> -F disk.lst -r -v {yes | no} mmdeldisk <device> -F disk.lst -r -c
<device> is the GPFS device in /dev disk.lst entries in the form DiskName:::DiskUsage:FailureGroup
see documentation for details
v: verify that disk does not already belong to file system

if using the yes option and if the disk has had a file system on it before, GPFS will not add it to file system
r: rebalance disks (see also -N parameter in documentation)

this can take a lot of overhead for a dynamic environment and is probably unecessary
notes and caveats

mmadddisk: if a NSD does not exist for the disk, must first create it mmdeldisk: if disk is bad and cannot be read, use -c
mmaddnode and mmdelnode
GPFS provides the means to dynamically add and remove nodes from a GPFS cluster.
mmaddnode -n node.lst
add nodes toan existing cluster and create mount points and device entries on the new nodes
under some circumstances (e.g., re-installing node), it may be necessary copy the mmsdrfs file to the new nodes (n.b., get copy from primary cluster server... see mmlscluster)
mmdelnode -n node.lst
remove nodes from an existing cluster notes and caveats
primary/secordary cluster servers primary/secordary NSD servers must first unmount GPFS file system use caution when deleting quorum nodes
There are a number of ways to measure file system performance. There are some very simple techniques that provide useful insight while there are other more elaborate alternatives. Some are common and some are unique to GPFS.
Measuring Performance
Benchmarking
good ones
IOR, xdd bonnie, iozone
See Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 2nd ed., Morgan Kaufmann, 1996 (pp. 20-21)
Real programs with instrumentation Synthetic benchmarks
not recommended for HPC GPFS specific benchmarks

gpfsperf, ibm.v4b
Kernels
nsdperf
very few of these exist for measuring file system performance.
Toy benchmarks
dd, home grown varieties
Measurement Tools
System tools (iostat, nmon) GPFS commands (mmpmon, nsdperf) Controller tools
Measuring Bandwidth
iostat
Measuring I/O time within the application using timing functions like rtc() or gettimeofday() are useful from a job perspective, but do not accurately measure actual I/O rates (e.g., they can overlook locking delays, include PVM message passing overhead, ignore variance).
The iostat command shows actual disk activity (this is the AIX version). iostat <time interval> <number of samples> Use dsh to collect from multiple nodes. export WCOLL=/wcoll dsh -a > iostat 10 360 > `hostname -s`.iostat
rsh can also be used
See man pages for more options. See also the vmstat command for CPU oriented measures.
Measuring Bandwidth
iostat
flash008> iostat 10 360 tty: tin tout 0.0 0.0 Disks: hdisk1 hdisk0 hdisk3 hdisk5 hdisk2 hdisk4 tty: tin 0.0 % tm_act 0.0 0.2 16.5 0.0 0.0 16.3 avg-cpu: % user 49.8 % sys 5.2 % idle 44.9 % iowait 0.1 hdisk0, hdisk1 local JFS directory hdisk3, hdisk4 mounted locally on this VSD server node hdisk2, hdisk5 mounted locally on this VSD server node only in failover mode % iowait 0.0
Kbps 0.0 3.3 940.8 0.0 0.0 249.5
tps 0.0 0.4 19.3 0.0 0.0 19.4
Kb_read Kb_wrtn 0 0 5537695 4424062 2119811596 709528896 5262 0 5262 0 39842779 710476158 % sys 8.1 % idle 91.6
tout 0.0 % tm_act 0.0 0.0 67.6 0.0 0.0 67.4
avg-cpu:
% user 0.3
Disks: hdisk1 hdisk0 hdisk3 hdisk5 hdisk2 hdisk4 tty: tin 0.0
Kbps 0.0 0.0 10041.6 0.0 0.0 10036.8
tps 0.0 0.0 39.6 0.0 0.0 39.4
Kb_read 0 0
Kb_wrtn 0 0 0 100416 0 0 0 0 0 100368 % idle 96.5 % iowait 0.0
tout 0.0 % tm_act 0.0 0.0 56.7 0.0 0.0 57.0
avg-cpu:
% user 0.1
% sys 3.4
Disks: hdisk1 hdisk0 hdisk3 hdisk5 hdisk2 hdisk4
Kbps 0.0 0.0 8111.9 0.0 0.0 8062.3
tps 0.0 0.0 32.0 0.0 0.0 31.9
Kb_read 0 0 0 0 0 0
Kb_wrtn 0 0 81200 0 0 80704
Measuring Bandwidth
iostat
Meaning of iostat columns. %usr - percent application CPU time %sys - percent of kernel CPU time %idle - percent of CPU idle time during which there were no outstanding disk I/O requests %iowait - percent of CPU idle time during which there were outstanding disk I/O requests %tm_act - percent of time that the hdisk was active (i.e., bandwidth disk utilization) Kbps - volume of data read and/or written to the hdisk in kilobytes per second tps - transfers (i.e., I/O requests) per second to the hdisk Kb_read - total data read from the given hdisk over the last time interval in KB Kb_wrtn - total data written to the given hdisk over the last time interval in KB
n.b., Kbps = (Kb_read + Kb_wrtn)/time_interval
Measuring Latency mmpmon
mmpmon Performance monitoring tool It can be used to

display I/O statistics per mounted file system for each file system on a node display aggregate I/O statistics from multiple nodes
see the nlist option
display latency measurements in a histogram format
Command syntax for with output in human readable form

mmpmon -i command_file
commands specifying what is displayed and how they are displayed is done by a command file see chapter 5 in the GPFS 3.1 Advanced Administration Guide for details
Miscellaneous
Starting with version 3.1, it can collect statistics from multiple nodes It requires root access Up to 5 instances of mmpmon can be run on 1 node at one time
It is effective, but still a little cumbersome to use
Measuring Latency mmpon
Latency Histogram Example

> cat command_file <-- be sure to login as root while an application is running rhist on <-- enable histograms (n.b., rhist off disables histograms) > mmpmon -i command_file mmpmon node 192.168.1.2 name gandalf rhist on OK > cat command_file rhist nr 64k;256k;1m; 1;10;30;100 <-- this is in the form rhist nr <packet sizes> <space> <time intervals> > mmpmon -i command_file time is specified in units of milliseconds mmpmon node 192.168.1.2 name gandalf rhist nr 64k;256k;1m; 1;10;30;100 OK > cat command file <-- display histogram data for non-empty bins rhist s > mmpmon -i command_file mmpmon node 10.10.111.4 name c4f01p1g rhist s OK read timestamp 1159397896/443573 size range 0 to 65536 count 512 latency range 0.0 to 1.0 count 511 latency range 30.1 to 100.0 count 1 size range 262145 to 1048576 count 4096 latency range 0.0 to 1.0 count 3945 latency range 1.1 to 10.0 count 27 latency range 10.1 to 30.0 count 27 latency range 30.1 to 100.0 count 43 latency range 100.1 to 0.0 count 54 mmpmon node 10.10.111.4 name c4f01p1g rhist s OK write timestamp 1159397896/443616 size range 262145 to 1048576 count 2048 latency range 0.0 to 1.0 count 1636 latency range 1.1 to 10.0 count 214 latency range 10.1 to 30.0 count 162 latency range 30.1 to 100.0 count 34 latency range 100.1 to 0.0 count 2 c4f01p1m:/u/cmis/home/padenr/ibm.v3g/ibm_mpi/mmpmon >
Measuring Latency
Multi-node mmpmon Usage
Latency Histogram Example Across Multiple Nodes

When using the nlist option, you must combine it within the command file of the associated command each time for the command
> cat nlist rhist rhist > cat nlist rhist > cat nlist rhist command_file_1 new n01 n02 n03 nr 64k;256k;1m; on command_file_2 new n01 n02 n03 s command_file_3 new n01 n02 n03 off
n04 1;10;30;100
<-- apply rhist <--
n02 n03 n04 apply rhist on to nodes n01 n02 n03 n04
nr
to nodes n01
n04
<-- apply rhist
to nodes n01
n02 n03 n04
n04
<-- apply rhist
off
to nodes n01
n02 n03 n04
Measuring Bandwidth
nsdperf
Measuring Bandwidth
Code Instrumentation
Problem Determination Tools
Other Topics
Waiters (i.e., waiting threads)
see p. 48, 89, of Problem Determination Guide
mmfsadm
see p. 9, of Problem Determination Guide mmfsdam dump config mmfsadm dump all
WARNING: Creates a file up to 100's of MB in size
mmfsadm dump tscomm

list of GPFS sockets and their status
mmfsadm dump threads

active thread tracebacks (useful to find threads that are looping instead of waiting)
mmfsadm cleanup
alternative to mmshutdown; designed to recycle mmfsd on a node without hanging
mmdsh -N { Node[,Node,...] | Nodefile | nodeclass} { command }

nodeclass
all, clientnodes, managernodes, mount, nonquorumnodes, nsdnodes, quorumnodes
gpfs.snap
see pp. 6-8, of Problem Determination Guide
Creating Trace Reports under Loss of Quorum Condition

If a node looses GPFS Quorum, it will execute the following user defined script
/var/mmfs/etc/mmQuorumLossExit this script must exist on all nodes in the cluster for example...
Launch mmtrace on all nodes in cluster. When a node looses quorum, GPFS executes the following script which recycles the tracing (there-by generating a trace report). The tracefile is called lxtrace.trc.<hostname>. > cat mmQuorumLossExit echo `hostname` LOST GPFS QUORUM echo RECYCLING mmtrace date /usr/lpp/mmfs/bin/mmtrace COMMENT: A GPFS trace is fixed size (default is 16M... set environment variable TRCFILESIZE to change size). Trace data wrapps around once it hits the end of the file. The time duration represented by the tracefile is proportional to its size.
GPFS Security
GPFS Security Model

GPFS administration is based on an environment of trust; this facilitates performance and convenience. It relies on external measures to insure security. version 3.3: implicitly define the scope of trust prior to version 3.3: scope of trust spans the cluster
Security vs. performance and convenience... a classic example of being caught between a rock and a hard place
GPFS Security
Defining Administrative Domain
GPFS Security
Administrative Access
It is necessary to properly configure security in order to administer GPFS; this includes the following...
Provide standard root access to designated system administrators. most GPFS commands require root authority Establish an authentication method between nodes in the GPFS cluster. Designate a remote communication program for remote shell and remote file copy commands. a subset of nodes must allow root level communication without the use of a password and without any extraneous messages common choices are ssh/scp and rsh/rcp
designated using the mmcrcluster and mmchcluster commands the selected option must use the rsh/rcp CLI
GPFS uses remote shell and remote file copy commands to do things like...
GPFS commands executed by a system administrator on a given node propogate configuration information to and perform administrative tasks on other nodes in the cluster. GPFS automatically communicates changes of system state across the nodes of a cluster
Further information can be found in

GPFS Administration and Programming Reference, version 3.3, pp. 1-3.
GPFS Security - Traditional

Example: Restricting Access to Trusted Environment
NON-TRUSTED
Nodes outside the trusted network are not part of the GPFS cluster These nodes can be given restricted access to the GPFS cluster via NFS. Login Nodes Job Scheduler Nodes Desk Top Nodes (in user's office)
TRUSTED
mmchconfig adminMode=allToAll
The domain of trust can be extended over a WAN via GPFS multi-cluster. Use OpenSSL for access security. (n.b., root squash option)
C a m p u s W i d e N e t w o r k
NO root access Frame #8

Ethernet Frame #1 Switch
client - 01 client Ethernet Switch- 02 client - 03 client - 01 client - 04 client - 02 client - 05 client - 03 client - 06 client - 04 client - 07 client - 05 client - 08 client - 06 client - 09 client - 07 client - 10 client - 08 client - 11 client - 09 client - 12 client - 10 client - 13 client - 11 client - 14 client - 12 client - 15 client - 13 client - 16 client - 14 client - 17 client - 15 client - 18 client - 16 client - 19 client - 17 client - 20 client - 18 client - 21 client - 19 client - 22 client - 20 client - 23 client - 21 client - 24 client - 22 client - 25 client - 23 client - 26 client - 24 client - 27 client - 25 client - 28 client - 26 client - 29 client - 27 client - 30 client - 28 client - 31 client - 29 client - 32 client - 30 client - 31 client - 32
480 x SATA 1 TB/disk 7200 rpm/disk
DS5300 EXP5060 #1 EXP5060 #2 EXP5060 #3 EXP5060 #4

Ethernet Switch
EXP5060 #5 EXP5060 #6 EXP5060 #7 EXP5060 #8
NSD Server - 01 NSD Server - 02 NSD Server - 03 NSD Server - 04 NSD Server - 05 NSD Server - 06 NSD Server - 07 NSD Server - 08
COMMENTS: Only the nodes within trusted network have direct access to GPFS file system User accounts do NOT exist on nodes in the trusted network User access is indirect via job schedulers, login nodes, etc. FOOTNOTE * Prior to version 3.3, this was the only option for a GPFS cluster.
Scope of trust = all nodes in GPFS cluster
GPFS Security - New

Example: Restricting Access to Trusted Environment
NON-TRUSTED
Nodes outside the trusted network are not part of the GPFS cluster These nodes can be given restricted access to the GPFS cluster via NFS. Login Nodes Job Scheduler Nodes Desk Top Nodes (in user's office)
TRUSTED
C a m p u s W i d e N e t w o r k
mmchconfig adminMode=central NO root access NO root access Frame #8
Ethernet Frame #1 Switch
client - 01 client Ethernet Switch- 02 client - 03 client - 01 client - 04 client - 02 client - 05 client - 03 client - 06 client - 04 client - 07 client - 05 client - 08 client - 06 client - 09 client - 07 client - 10 client - 08 client - 11 client - 09 client - 12 client - 10 client - 13 client - 11 client - 14 client - 12 client - 15 client - 13 client - 16 client - 14 client - 17 client - 15 client - 18 client - 16 client - 19 client - 17 client - 20 client - 18 client - 21 client - 19 client - 22 client - 20 client - 23 client - 21 client - 24 client - 22 client - 25 client - 23 client - 26 client - 24 client - 27 client - 25 client - 28 client - 26 client - 29 client - 27 client - 30 client - 28 client - 31 client - 29 client - 32 client - 30 client - 31 client - 32
480 x SATA 1 TB/disk 7200 rpm/disk
DS5300 EXP5060 #1 EXP5060 #2 EXP5060 #3 EXP5060 #4

Ethernet Switch
EXP5060 #5 EXP5060 #6 EXP5060 #7 EXP5060 #8
NSD Server - 01 NSD Server - 02 NSD Server - 03 NSD Server - 04 NSD Server - 05 NSD Server - 06 NSD Server - 07 NSD Server - 08
COMMENTS: Only the nodes within trusted network have direct access to GPFS file system User accounts do NOT exist on nodes in the trusted network User access is indirect via job schedulers, login nodes, etc. FOOTNOTE * mmchconfig adminMode is a new feature in version 3.3.
Scope of trust = administrative nodes only
GPFS Security
Example Configuring Passwordless ssh/scp Authentication
[root@nsd1 ~]# cd .ssh Create the /root/.ssh directory if it does not exist. Generate the public/private key pair (the other option is dsa) [root@nsd1 .ssh]# ssh-keygen -t rsa -f id_rsa Generating public/private rsa key pair. Leave these responses blank to avoid passwords. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in id_rsa. Your public key has been saved in id_rsa.pub. The key fingerprint is: dc:74:17:46:0e:ea:ad:96:50:df:d3:bf:99:86:d6:c8 root@nsd1 Append public key file to the authorized_keys file. [root@nsd1 .ssh]# cat id_rsa.pub >> authorized_keys Be sure you can ssh to yourself without a password; all nodes must be able to do this. [root@nsd1 .ssh]# ssh nsd1 The authenticity of host 'nsd1 (172.31.1.78)' can't be established. RSA key fingerprint is d8:4a:cd:96:45:25:34:19:34:fa:23:98:36:c0:ed:7e. This is normal the first time you ssh to a node. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'nsd1,172.31.1.78' (RSA) to the list of known hosts. Last login: Thu Oct 9 17:01:06 2008 from nsd1 [root@nsd1 .ssh]# exit Connection to nsd1 closed. [root@nsd1 .ssh]# dir total 20 -rw------- 1 root root 391 Oct 9 17:06 authorized_keys Be sure the permissions are 600 and the owner/group is root. -rw------- 1 root root 1675 Oct 9 17:05 id_rsa Be sure the permissions are 644 -rw-r--r-- 1 root root 391 Oct 9 17:05 id_rsa.pub and the owner/group is root. -rw-r--r-- 1 root root 398 Oct 9 17:06 known_hosts [root@nsd1 .ssh]#
The known_hosts file is generated "automagically" when a remote node first logs into the local node via ssh. In this example, it was created when we "sshed" to ourselves and answered "yes".
GPFS Security
It is necessary for all nodes in the [root@nsd1 .ssh]# for i in 2 3 4 GPFS cluster to have ssh keys. It is > do common practice to generate the > scp authorized_keys id_rsa id_rsa.pub known_hosts nsd$i:.ssh keys on one node and copy them to all other nodes in the GPFS cluster. > done The authenticity of host 'nsd2 (172.31.1.79)' can't be established. RSA key fingerprint is 48:db:31:71:76:4f:25:f0:37:b1:62:29:d6:87:5e:4e. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'nsd2,172.31.1.79' (RSA) to the list of known hosts. root@nsd2's password: ******** Answering "yes" to this authorized_keys 100% 391 0.4KB/s 00:00 request causes ssh to id_rsa 100% 1675 1.6KB/s 00:00 "automagically" append encrypted public keys to the id_rsa.pub 100% 391 0.4KB/s 00:00 local known_hosts file. known_hosts 100% 796 0.8KB/s 00:00 Subsequent logins to any of The authenticity of host 'nsd3 (172.31.1.80)' can't be established. these remote nodes will no longer encounter this request. RSA key fingerprint is e9:96:bc:31:a6:7f:e5:29:92:06:f3:ac:3d:5a:2b:3c. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'nsd3,172.31.1.80' (RSA) to the list of known hosts. root@nsd3's password: ******** The first ssh access to remote authorized_keys 100% 391 0.4KB/s 00:00 nodes requires a password. After properly copying the id_rsa 100% 1675 1.6KB/s 00:00 keys to these other nodes, a id_rsa.pub 100% 391 0.4KB/s 00:00 password challenge will no longer happen. known_hosts 100% 1194 1.2KB/s 00:00 The authenticity of host 'nsd4 (172.31.1.81)' can't be established. RSA key fingerprint is e2:3d:1b:3f:ef:6f:b8:bd:5e:0a:ab:e0:56:1b:83:39. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'nsd4,172.31.1.81' (RSA) to the list of known hosts. root@nsd4's password: ******** authorized_keys 100% 391 0.4KB/s 00:00 id_rsa 100% 1675 1.6KB/s 00:00 id_rsa.pub 100% 391 0.4KB/s 00:00 known_hosts 100% 1592 1.6KB/s 00:00 [root@nsd1 .ssh]#
GPFS Security
[root@nsd1 .ssh]# ssh nsd1 ssh nsd2 ssh nsd3 ssh nsd4 ssh nsd1 date Host key verification failed. This is a simple test to be sure ssh is configured properly.
It failed because the known_hosts file was incomplete on the other nodes in the GPFS cluster.
[root@nsd1 .ssh]# for i in 2 3 4 TRICK: Since the known_hosts file on the local node is now complete after copying the keys to all of the other nodes in the GPFS cluster, > do simply copy it again to all of the other nodes in the GPFS cluster. > scp known_hosts nsd$i:.ssh > done known_hosts 100% 1592 1.6KB/s 00:00 known_hosts 100% 1592 1.6KB/s 00:00 known_hosts 100% 1592 1.6KB/s 00:00 [root@nsd1 .ssh]# ssh nsd1 ssh nsd2 ssh nsd3 ssh nsd4 ssh nsd1 date Thu Oct 9 17:14:45 EDT 2008 The test completed properly this time.
This test is not fool proof, however.
[root@nsd1 .ssh]# ls -al drwx------ 2 root root drwxr-x--- 12 root root [root@nsd1 ~]#
4096 Oct 3 18:23 .ssh 4096 Sep 24 03:25 root
Be sure the permissions are 700 and the owner/group is root. Be sure the permissions are 750 and the owner/group is root. WARNING: Some implementations of ssh/scp may not allow passwordless access if the permissions are not set properly.
COMMENT: This is a tedious process! For large clusters, automated tools are used to do this task.
The following pages is a potpourri of practical sysadm and tuning experience (often learned late at night under duress :->)
Who says an old dog can't learn new tricks?!?!
Read/Modify/Write Penalty
Choosing N in N+P in RAID 5 Configurations
RAID 5 "disk arrays" have N data disks and 1 parity disk

This is called "N+P" (e.g., 4+P)
A RAID 5 "stripe" is N * segment_size where segment_size is the size of the block of data written to 1 physical disk in the RAID 5 array
If segment_size = 256K with 4+P RAID 5 array, then the stripe_size = 1024K
GPFS block_size should equal RAID 5 stripe size for best performance. Since GPFS block_size is not arbitrary (GPFS blocksize is 2k), this implicitly restricts choices for N and the segment_size if optimum performance is to be achieved. For example...
On a DS4000 system, N = { 4 | 8 } If GPFS block_size=1024K, then
if N == 4, then segment_size=256K if N == 8, then segment_size=128K
If GPFS block_size=256K, then

if N == 4, then segment_size=64K if N == 8, then segment_size=32K
If N+P and GPFS block_size are consistent, then block_size == N * segment_size and the stripe is "over written"
this yields best performance this is sometimes called a "full stride write"
not consistent, then it is necessary read RAID 5 stripe, then update it, then write it
this significantly reduces performance
atime and mtime

mmchfs -E { yes | no } mmchfs -S { yes | no }
First ls does not update atime. atime is updated when a file is actually accessed (ls accesses the directory). I do not have data on cost of atime - it really depends on the workload. The reason of recommending changing -E no (for mtime) is that on some systems we have observed impact of mtime in shared file updates (and variability in performance). My guess would have been to expect atime to be a lesser issue than mtime (what is their workload that makes them concerned about performance impact atime updates?). -E no means no exact mtime. For supressing atime updates you would say -S yes Requires remounting the file system
Importance of Stable LAN and SAN
It's the network's fault!
It's GPFS's fault!
MMFS_PHOENIX Error Log Message

AIX: errpt -a Linux: less /var/log/messages
Common error message in the OS error log

Mar 15 15:55:15 bm-dell-10 mmfs: Error=MMFS_PHOENIX, ID=0xAB429E38, Tag=10383382: Reason code 668 Failure Reason Lost membership in cluster bm-dell-10. Unmounting file systems.
What is PHOENIX?
It is the "high availability" layer in GPFS today. Replaces the RSCT service used by GPFS in its AIX days
Significance of error message

MMFS_PHOENIX meesages occur when a node joins, leaves, changes cluster membership
this could merely be a normal response to an external event (e.g., mmdelnode) this could be a response to anomalous event (e.g., removing node due to loss of network access)
Testing Adapters and LUNs Using dd

OBSERVATIONS: A slow or improperly configured LAN adapter (e.g., Ethernet, Myrinet, IB) may adversly affect GPFS performance Use dd to isolate performance of a given adapter as follows:
dd if=/dev/zero bs=1024k count=1024 | ssh <hostname> dd of=/dev/null
A slow or improperly configured LUN in a GPFS file system can slow down performance for the entire file system Use dd to isolate performance of a given LUN for example, read a SCSI device in Linux
time dd if=/dev/sdc of=/dev/null bs=1024K count=2048
for example, read a raw device in AIX

time dd if=/dev/rhdisk2 of=/dev/null bs=1024K count=2048
Use caution when writing to SCSI device... it will "clobber" a file system
Mixed GPFS Code Levels

Co-existance defined
Nodes with different GPFS code levels may be active in the same cluster and simultaneously access the same file system.
Co-existance makes it possible to

upgrade GPFS within a cluster without shutting down GPFS on all nodes this is also called "rolling upgrades" mount GPFS file systems from other GPFS clusters that may be running a different GPFS code level
Beginning with GPFS 2.3.0.6

Nodes running with different 2.3 maintence levels may co-exist Nodes running the 2.3 and 3.1 releases cannot co-exist
Beginning with GPFS 3.1

Release-to-release co-existence is officially supported this includes the co-existance of maintenance as well as release levels e.g., 3.1 and 3.2 may co-exist Once all nodes are upgraded to latest version, it is necessary to run following commands
mmchconfig release=LATEST mmchfs -V all
See the GPFS: Concepts, Planning, and Installation Guide for further information.
Suspending GPFS
If it is necessary to take down a node to repair something, do the following...

> mmfsctl <FS name> suspend do something > mmfsctl <FS name> resume
Fine Print:
Use the mmfsctl command to issue control requests to a particular GPFS file system. The command is used to temporarily suspend the processing of all application I/O requests, and later resume them, as well as to synchronize the file systems configuration state between peer clusters in disaster recovery environments.
SAN Conjestion
An Example
This analysis is based on the Brocade SilkWorm 48000 with 4 Gb/s FC fabric
2 SAN switches with 2 x 32 port blades 2 ASICs per blade with 16 ports per ASIC total ports available = 128 total ports used = 56
Empirical tests show that the effective ASIC BW < 1100 MB/s
test code: dd to raw disks, read 4096 records with sizeof(record) = 1M effective BW is the BW measured by the application total BW through the ASIC is 2200 MB/s (accounting for the data in and data out streams)
Properly distribute host and controller connections across all ASICs to avoid ASIC saturation
see cabeling example on next page requires using all DS4800 host side ports in this example BEST PRACTICE: deploy cabeling to avoid all "inter-ASIC" traffic For completeness, each ASIC connects to a control processor enabling 32 Gb/s simplex or 64 Gb/s duplex inter-ASIC communication, however, the electronics of the ASIC does not appear to be able to handle that much aggregate BW.
This SAN cabeling issue does not impact standard GPFS NSD configuration.
In the standard NSD configuration, a SAN is not necessary. Moreover, each host port typically will be accessed by only 1 HBA; i.e., there is a 1:1 HBA to host port ratio. In this multi-cluster VSD configuration, there is a 3:1 HBA to host port ratio.
SAN Conjestion
The View from the Perspective of fcs0
Legend: DS0<1|2|3|4>, Array ID {1...12}
Zoning
node1
1,1 1,7 1,4 1,10 1,2 1,8 1,5 1,11 1,3 1,9 1,6 1,12
2,1 2,7 2,4 2,10 2,2 2,8 2,5 2,11 2,3 2,9 2,6 2,12
3,1 3,7 3,4 3,10 3,2 3,8 3,5 3,11 3,3 3,9 3,6 3,12
4,1 4,7 4,4 4,10 4,2 4,8 4,5 4,11 4,3 4,9 4,6 4,12
A1 fcs0
B1 fcs3
A3 fcs2
B3 fcs1
node2
A1 fcs0
A S I B C l a A d S e I C A S I B C l a A d S e I C
Brocade SilkWorm 48000

Port numbers
1 2 3 4 1 2 3 4
DS01
partition #1 1, 2, 3 7,8,9 partition #2 4, 5, 6 10,11,12
crtl A crtl B
B1 fcs3
A3 fcs2
DS02
1 2 3 4 1 2 3 4
B3 fcs1
crtl A crtl B
partition #1 1, 2, 3 7,8,9
partition #2 4, 5, 6 10,11,12
node3
A1 fcs0
B1 fcs3
A3 fcs2
B3 fcs1
DS03
1 2 3 4 1 2 3 4
node4
crtl A crtl B
partition #1 1, 2, 3 7,8,9
partition #2 4, 5, 6 10,11,12
1,1 1,7 1,4 1,10 1,2 1,8 1,5 1,11 1,3 1,9 1,6 1,12
2,1 2,7 2,4 2,10 2,2 2,8 2,5 2,11 2,3 2,9 2,6 2,12
3,1 3,7 3,4 3,10 3,2 3,8 3,5 3,11 3,3 3,9 3,6 3,12
4,1 4,7 4,4 4,10 4,2 4,8 4,5 4,11 4,3 4,9 4,6 4,12
A2 fcs0
B2 fcs3
A4 fcs2
B4 fcs1
node5
A2 fcs0
DS04
1 2 3 4 1 2 3 4
B2 fcs3
A4 fcs2
crtl A crtl B
partition #1 1, 2, 3 7,8,9
partition #2 4, 5, 6 10,11,12
B4 fcs1
node6
A2 fcs0
B2 fcs3
A4 fcs2
B4 fcs1
Notice that there is no data transmission between the 2 ASICs when accessing RAID arrays in partition #1.
SAN Conjestion
The Complete Cabeling View
Legend: DS0<1|2|3|4>, Array ID {1...12}
Zoning
node1
1,1 1,7 1,4 1,10 1,2 1,8 1,5 1,11 1,3 1,9 1,6 1,12
2,1 2,7 2,4 2,10 2,2 2,8 2,5 2,11 2,3 2,9 2,6 2,12
3,1 3,7 3,4 3,10 3,2 3,8 3,5 3,11 3,3 3,9 3,6 3,12
4,1 4,7 4,4 4,10 4,2 4,8 4,5 4,11 4,3 4,9 4,6 4,12
A1 fcs0
B1 fcs3
A3 fcs2
B3 fcs1
node2
A1 fcs0
Brocade SilkWorm 48000

Port numbers
1 2 3 4 1 2 3 4
DS01
partition #1 1, 2, 3 7,8,9 partition #2 4, 5, 6 10,11,12
crtl A crtl B
B1 fcs3
A3 fcs2
DS02
1 2 3 4 1 2 3 4
B3 fcs1
crtl A crtl B
partition #1 1, 2, 3 7,8,9
partition #2 4, 5, 6 10,11,12
node3
A1 fcs0
B1 fcs3
A3 fcs2
B3 fcs1
DS03
1 2 3 4 1 2 3 4
node4
crtl A crtl B
partition #1 1, 2, 3 7,8,9
partition #2 4, 5, 6 10,11,12
1,1 1,7 1,4 1,10 1,2 1,8 1,5 1,11 1,3 1,9 1,6 1,12
2,1 2,7 2,4 2,10 2,2 2,8 2,5 2,11 2,3 2,9 2,6 2,12
3,1 3,7 3,4 3,10 3,2 3,8 3,5 3,11 3,3 3,9 3,6 3,12
4,1 4,7 4,4 4,10 4,2 4,8 4,5 4,11 4,3 4,9 4,6 4,12
A2 fcs0
B2 fcs3
A4 fcs2
B4 fcs1
node5
A2 fcs0
DS04
1 2 3 4 1 2 3 4
B2 fcs3
A4 fcs2
crtl A crtl B
partition #1 1, 2, 3 7,8,9
partition #2 4, 5, 6 10,11,12
B4 fcs1
node6
A2 fcs0
B2 fcs3
A4 fcs2
B4 fcs1
Notice that all host ports are being used. This does not increase BW (cf using only 4 host ports per DS4800), but this makes it possible to avoid inter-ASIC traffic.
Non-unique Device Names

Due to sysadm errors, it is possible for a given LUN (RAID array) to have multiple device names.
e.g., /dev/sdy and /dev/sdba are the same RAID array
This can be seen when trying to create a new NSD on /dev/sdy

[root@gpfs01 gpfslpp]# cat disk.lst.more /dev/sdy:::::nsd14 [root@gpfs01 gpfslpp]# mmcrnsd -F disk.lst.more mmcrnsd: Processing disk sdy mmcrnsd: Disk descriptor /dev/sdy:::::nsd14 refers to an existing NSD nsd11
Doing an mmlsnsd -f <device> -m shows nsd11 is assigned to /dev/sdba

nsd11 C0A8010142828D9C /dev/sdba gpfs01 directly attached
To confirm this, dump the NSD record on the LUN to stdout

[root@gpfs01 gpfslpp]# dd if=/dev/sdy count=10
The dump is mostly binary, but the following text record can be seen
NSD descriptor for /dev/sdba created by GPFS Wed May 11 17:56:17 2005
Under Linux, this discrepency can be seen by comparing the output between
mmlsnsd -f <device> -m
<--output omitted to execessive length
and
[root@gpfs01 gpfslpp]# ps -ef | grep mmfsd | grep -v grep root 19625 19510 0 May11 ? 00:00:40 /usr/lpp/mmfs/bin//mmfsd <--output omitted to execessive length [root@gpfs01 gpfslpp]# lsof -p 19625
which lists the devices associated with mmfsd
Fine Grain Directory Locking

The problem
Multiple nodes changing the contents of a single directory at the same time hurts performance. multiple nodes changing the same directory block simultaneously forces nodes to serialize operations to maintain block consistency alternative schemes: if an application runs on multiple nodes where every node creates a file, it is recommended...
precreate all files (empty) on one node before all the other nodes open/access their file have each node create their file in its own private directory
very common in digital media and bio-informatics applications
This issue was corrected in patch release 3.2.1.6 (Sep 08)
LTG and VSD Buddy Buffers

AIX Tuning Parameter Affecting GPFS when Using VSD
When using large pSeries clusters with HPS (i.e., "federation") or SP (i.e., "colony") switch, VSD provides more efficient switch protocol then TCP/IP. But be sure to set the following AIX tuning parameters appropriately for GPFS. Set LTG size tobe >= GPFS blocksize e.g., if blocksize is 1024K, then increase LTG size to 1024K
requires AIX 5.2 or later
early AIX 5.2 releases may require patch(?)
modify mmvsdhelper script

see /usr/lpp/mmfs/bin
default = 128K
Set buddy buffer size to be >= GPFS blocksize e.g., if blocksize is 1024K, then set buddy buffer size to 1024K modify via smitty default = 256K
Consistent GID/UID
Linux Memory Management Issue

If memory is oversubscribed, Linux "shoots" large memory users.
mmfsd is common target since under Linux, the pagepool is accounted as belonging to mmfsd Reducing the risk reduce the size of the pagepool
Installing GPFS Under CentOS

GPFS is officially tested with RHEL and SUSE Linux only, but it generally works with other RHEL like distributions.
CentOS is the most common; others include Scientific Linux, Rocks.
If non-RHEL distrubutions may require some "tweaking".

e.g., make Autoconfig fails when building the portability layer
CentOS 5.4 example.

Change the Red Hat version identifier
# echo "Red Hat Enterprise Linux Server release 5.4 (Tikanga)" > /etc/redhat-release
Install RPMs for kernel-devel, compat-libstdc++-33 Fix a "broken" symbolic link

The link may appear as follows...
# ls -la /lib/modules/2.6.18-164.el5/build* lrwxrwxrwx 1 root root 46 Feb 2 16:14 build -> ../../../usr/src/kernels/2.6.18-164.el5-x86_64
It should look like this...

# ls -la /lib/modules/2.6.18-164.el5/build* lrwxrwxrwx 1 root root 44 Feb 4 13:08 build -> /usr/src/kernels/2.6.18-164.11.1.el5-x86_64/
Fix it with the following steps...

# unlink /lib/modules/2.6.18-164.el5/build # ln -s /usr/src/kernels/2.6.18-164.11.1.el5-x86_64/ /lib/modules/2.6.18-164.el5/build
Queue Depth
What is the queue depth?
Storage controllers can process up to a maximum number of concurrent I/O operation requests, sometimes called the maximum command queue depth (MQD) DS5300: MQD = 4096 requests (plus a small number of active requests)
up to 2048 requests per RAID controller up to 2048 requests per port up to the maximum allowed on a RAID controller
DCS9900: MQD = 4608 = 4096 queued requests + 512 active requests

up to 576 requests per port (n.b., you must use all ports to get all 4608 requests)
When the MQD has been reached, the controller will respond with a "queue full" status until some number of active requests have been processed and there is room for new requests. Storage adapter (e.g., HBA) drivers provide a device queue depth (DQD) parameter controlling the number of I/O requests submitted to a disk device on a given host DQD sets the number of I/O requests per device (e.g., sd<char> or hdisk<int>)
Purpose of the queue depth parameters

If a cluster submits more I/O requests than a storage controller can process, erratic behavior occurs e.g., lost I/O requests The DQD is used to limit the amount of I/O received by a storage controller
How it works
Parameters: NN = Number of nodes submitting IOPs to a storage controller NLUN = Number of LUNs per node For reliable operation, set DQD such that MQD > NN * NLUN * DQD
This formula ignores the fact that Linux may break a GPFS "packet" into several transactions. This is especially true for larger block sizes. This only makes the problem worse!
Consequences for a SAN configuration

For a large SANs, DCD should be set small e.g., set DQD = 1 for cluster with128 nodes and 24 LUNs Problems: a small DQD limits the number of I/O requests per node setting DCD to 1 limits the size of the SAN cluster
Scratch Paper: Consider cluster with 128 nodes and a DS5300 MQD = 4096 Set DQD = 1 and let NLUN = 24 NN * NLUN * DQD = 128 * 24 * 1 = 3072 < MQD = 4096: OK! Set DQD = 1 and let NLUN = 32 NN * NLUN * DQD = 128 * 32 * 1 = 4096 ~< MQD = 4096: Ouch!
12. GPFS Configuration Example
The following pages provide an actual example of installing and configuring GPFS under Linux using 4 x3650 NSD servers and a DCS9550 storage controller and disk enclosures. The steps for doing this under AIX are very similar; differences are explained in the annotations. This example can be used as a hands on guide for a lab exercise. Note the following:
red arial font is used for annotations blue courier font is used to highlight commands and parameters black courier font is used for screen text COMMENT: This example is based on GPFS 3.1, but the steps for GPFS 3.2 are nearly identical. Key differences are highlighted in context.
Lab Exercise
Install GPFS from media
if using Linux, build portability layer
Configure GPFS appropriate for lab cluster As time allows
experiment with "mmls" and "mmch" commands examine /var/adm/ras/mmfs.log.<extension> examine /var/mmfs look at /var/mmfs/gen/mmsdrfs file run dd or other benchmark tests monitor performance using iostat, vmstat, SMclient
iostat, vmstat not installed by default in Linux must have Windows or AIX client to run SMclient
examine packet size and latency using mmpmon
GPFS v3.3
EXAMPLE: Installing and Configuring GPFS

PHYSICAL CONFIGURATION
Ethernet Switch (GbE)
x3650-M2
Nodes 01..16
SAN Switch (FC8)
SAN client SAN client SAN client SAN client SAN client SAN client SAN client SAN client SAN client
1 2 1 2
60-disk Tray
GPFS Topology: SAN Node Configuration

16 nodes
12 x x3650
3 4 3 4
GbE GbE
2 sockets, 4 cores, 8 GB RAM 2 x GbE 2 x FC8
4 x x3650-M2
GbE GbE
Linux
Distribution: Cent OS 5.4* Kernel: 2.6.18-164.11.1.el5
60-disk Tray
Storage Configuration
DDN Couplet
Host Connections: 16 x FC8 Drive Connections: 40 x SAS
SAN client SAN client SAN client SAN client SAN client SAN client SAN client
60-disk Tray 60-disk Tray 60-disk Tray
20 x SSD
4 x 4+P RAID 5 60 GB per SSD
224 x SAS
450 GB/disk, 15 Krpm 56 x 4+P RAID 5
FOOTNOTE: * Since this is CentOS and not RHEL, it's necessary to create a configuration file as follows so that the portability layer build procedures will work. [root@node-01]# echo "Red Hat Enterprise Linux Server release 5.4 (Tikanga)" > /etc/redhat-release
GPFS v3.3
Outline of Steps to Install and Configure GPFS

1. Establish administrative control and scope 2. Install the GPFS code
a. Base version (e.g., 3.3.0.0) b. PTF version (e.g., 3.3.0.4)
build 3. Build portability RPM It is only necessary to Linux.the portability layer under 4. Configure a GPFS File System
a. e.g., Enable ssh with passwordless root access to designated nodes.
a. Create cluster b. Declare client and server licenses c. Change global GPFS parameters and start the GPFS daemon d. Create the NSDs e. Create and Mount the file system
GPFS v3.3

Steps to Install the GPFS Code under Linux
Alternatively, copy the RPMs to all nodes Example installation directory: /gpfs_install/gpfs_3.3.0.0
Installing GPFS under AIX and Windows is quite different. See the note below for references.
1. Create an NFS mounted installation directory for extracting the RPMs
2. Copy base version RPMs to the installation directory and extract the RPMs on all nodes in the GPFS cluster.
Sample RPM names: gpfs.base-3.3.0-0.x86_64.rpm gpfs.docs-3.3.0-0.noarch.rpm gpfs.docs... contains the man pages. gpfs.gpl-3.3.0-0.noarch.rpm It is not necessary to install the GUI. gpfs.gui-3.3.0-0.x86_64.rpm gpfs.msg.en_US-3.3.0-0.noarch.rpm
They will be installed in /usr/share/man/
3. Download the latest update package. This comes as a tar/gzip file from
https://www14.software.ibm.com/webapp/set2/sas/f/gpfs/download/home.html gpfs-3.3.0-4.x86_64.update.tar.gz This file contains a different version of the same RPMs as the base version.
4. Copy this file to the installation directory, then gunzip/untar this file, and extract the RPMs on all nodes in the GPFS cluster.
Example installation directory: /gpfs_install/gpfs_3.3.0.4
The steps for doing this are more thoroughly documented in chapter 5 of the GPFS Concepts, Planning and Installation Guide The steps for installing the GPFS code under AIX and Windows are also documented in this guide. It can be found at
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfs_com_faq.html
This link may not take you directly to the current GPFS FAQ, but by drilling down, you can get there.
GPFS v3.3

Steps to Install the GPFS Code under Linux: Building Portability RPM
5. Select a node in the cluster and do the following:
cd /usr/lpp/mmfs/src Use echo $? to verify each make make Autoconfig operation completes normally. make World make InstallImages make rpm sample rpm name: gpfs.gplbin-2.6.18-164.11.1.el5-3.3.0-4.x86_64.rpm
6. Copy portability rpm to all nodes and extract. 7. Warnings and Caveates
If a cluster has mixed architectures and/or kernel levels, it is necessary build a portability rpm for each instance and copy it to like nodes. Required Linux patches for GPFS can be found at: http://www.ibm.com/developerworks/opensource/
GPFS v3.3

[root@node-01 GPFS_install]# cat node_spec.lst node-01:manager-quorum node-02:manager-quorum The manager-quorum nodes must be licensed as servers. node-03:manager-quorum node-04 node-05 node-06 node-07 node-08 node-09 The remaining nodes must be licensed as clients. node-10 node-11 node-12 node-13 node-14 node-15 node-16
Administrative domain spans all nodes. (i.e., the traditional security model)
GPFS v3.3

[root@node-01 GPFS_install]# mmcrcluster -n node_spec.lst -p node-01 -s node-02 -R /usr/bin/scp -r /usr/bin/ssh Create the Thu Feb 4 21:56:06 EST 2010: mmcrcluster: Processing node node-01 GPFS cluster. Thu Feb 4 21:56:06 EST 2010: mmcrcluster: Processing node node-02 Thu Feb 4 21:56:06 EST 2010: mmcrcluster: Processing node node-03 Thu Feb 4 21:56:07 EST 2010: mmcrcluster: Processing node node-04 Thu Feb 4 21:56:07 EST 2010: mmcrcluster: Processing node node-05 Thu Feb 4 21:56:08 EST 2010: mmcrcluster: Processing node node-06 Thu Feb 4 21:56:08 EST 2010: mmcrcluster: Processing node node-07 Thu Feb 4 21:56:09 EST 2010: mmcrcluster: Processing node node-08 Thu Feb 4 21:56:09 EST 2010: mmcrcluster: Processing node node-09 Thu Feb 4 21:56:10 EST 2010: mmcrcluster: Processing node node-10 Thu Feb 4 21:56:10 EST 2010: mmcrcluster: Processing node node-11 Thu Feb 4 21:56:11 EST 2010: mmcrcluster: Processing node node-12 Thu Feb 4 21:56:11 EST 2010: mmcrcluster: Processing node node-13 Thu Feb 4 21:56:12 EST 2010: mmcrcluster: Processing node node-14 Thu Feb 4 21:56:12 EST 2010: mmcrcluster: Processing node node-15 Thu Feb 4 21:56:13 EST 2010: mmcrcluster: Processing node node-16 mmcrcluster: Command successfully completed mmcrcluster: Warning: Not all nodes have proper GPFS license designations. Use the mmchlicense command to designate licenses as needed. mmcrcluster: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.
This is a common and routine message. GPFS is using ssh and scp to copy configuration information to all of the nodes asynchronously. It may be that the node on which the command is executed may complete before all of the other nodes are ready.
mmcrcluster parameters -n: list of nodes to be included in the cluster -p: primary GPFS cluster configuration server node -s: secondary GPFS cluster configuration server node -R: remote copy command (e.g., rcp or scp) -r: remote shell command (e.g., rsh or ssh)
GPFS v3.3

[root@node-01 GPFS_install]# mmlscluster GPFS cluster information ======================== GPFS cluster name: GPFS cluster id: GPFS UID domain: Remote shell command: Remote file copy command:
List the cluster to verify that the cluster is created as intended
node-01 12402633858572060870 node-01 /usr/bin/ssh /usr/bin/scp
GPFS cluster configuration servers: ----------------------------------Primary server: node-01 Secondary server: node-02 Node Daemon node name IP address Admin node name Designation ----------------------------------------------------------------------------------------1 node-01 172.31.1.200 node-01 quorum-manager 2 node-02 172.31.1.201 node-02 quorum-manager 3 node-03 172.31.1.202 node-03 quorum-manager 4 node-04 172.31.1.203 node-04 5 node-05 172.31.1.204 node-05 6 node-06 172.31.1.205 node-06 7 node-07 172.31.1.206 node-07 8 node-08 172.31.1.207 node-08 9 node-09 172.31.1.210 node-09 10 node-10 172.31.1.211 node-10 11 node-11 172.31.1.212 node-11 12 node-12 172.31.1.213 node-12 13 node-13 172.31.1.214 node-13 14 node-14 172.31.1.215 node-14 15 node-15 172.31.1.216 node-15 16 node-16 172.31.1.217 node-16
GPFS v3.3

[root@node-01 GPFS_install]# cat license_server.lst node-01 node-02 node-03 [root@node-01 GPFS_install]# mmchlicense server --accept -N license_server.lst The following nodes will be designated as possessing GPFS server licenses: node-01 node-02 node-03 mmchlicense: Command successfully completed mmchlicense: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.
[root@node-01 GPFS_install]# mmchlicense client --accept -N license_client.lst The following nodes will be designated as possessing GPFS client licenses: node-04 node-05 mmchlicense parameters node-06 server: server license type node-07 client: client license type node-08 --accept: suppress the license prompt node-09 (implies you accept license terms) -N: list of nodes for a given license type node-10 node-11 It is necessary to explicitly declare both node-12 license types. node-13 node-14 node-15 node-16 mmchlicense: Command successfully completed mmchlicense: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.
GPFS v3.3

[root@node-01 GPFS_install]# mmlslicense Summary information --------------------Number of nodes defined in the cluster: Number of nodes with server license designation: Number of nodes with client license designation: Number of nodes still requiring server license designation: Number of nodes still requiring client license designation:
16 3 13 0 0
mmchlicense parameters -L: Displays the license type for each node, using an * to designate node with licenses out of compliance.
[root@node-01 GPFS_install]# mmlslicense -L Node name Required license Designated license ------------------------------------------------------------------node-01 server server node-02 server server node-03 server client * node-04 client client node-05 client node * Summary information --------------------Number of nodes defined in the cluster: 5 Number of nodes with server license designation: 2 Number of nodes with client license designation: 2 Number of nodes still requiring server license designation: 1 Number of nodes still requiring client license designation: 1
GPFS v3.3

[root@node-01 GPFS_install]# mmchconfig maxMBpS=2000, maxblocksize=4m, pagepool=256m, autoload=yes, adminMode=allToAll Change selected Verifying GPFS is stopped on all nodes ... global parameters. mmchconfig: Command successfully completed mmchconfig: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. [root@node-01 GPFS_install]# mmlsconfig List the global parameters to verify that they are set as intended. Configuration data for cluster bm-dell-10: -----------------------------------------clusterName node-01 clusterId 12402633858572060870 autoload yes minReleaseLevel 3.3.0.2 dmapiFileHandleSize 32 maxMBpS 2000 maxblocksize 4m The mmchconfig parameters are pagepool 256m maxMBpS: Limit the LAN BW per node. To get peak rate, set it adminMode allToAll File systems in cluster node-01: -------------------------------(none)
~= 2X the desired BW; do NOT set it excessively large. maxblocksize: Maximum file system block size allowed. This parameter can not be easily changed. pagepool: Size of GPFS cache. autoload: yes -> start mmfsd when a node is rebooted adminMode: allToAll -> all nodes allow passwordless root access client -> subset of nodes allow passwordless root access NOTES: mmchconfig parameters can be set differently on different nodes using the -N option. There are many more mmchconfig parameters possible, most of which are undocumented.
GPFS v3.3

[root@node-01 GPFS_install]# mmstartup -a Thu Feb 4 21:58:56 EST 2010: mmstartup: Starting GPFS ... [root@node-01 GPFS_install]# mmgetstate -a Node number Node name GPFS state -----------------------------------------1 node-01 active 2 node-02 active 3 node-03 active 4 node-04 active 5 node-05 active 6 node-06 active 7 node-07 active 8 node-08 active 9 node-09 active 10 node-10 active 11 node-11 active 12 node-12 active 13 node-13 active 14 node-14 active 15 node-15 active 16 node-16 active
Start the GPFS daemon (aka, mmfsd) on all nodes in the cluster.
Be sure mmfsd is active on all nodes before proceding.
GPFS v3.3

[root@node-01 GPFS_install]# cat disk.lst dm-3:::metadataOnly::ssd0 dm-4:::metadataOnly::ssd1 dm-5:::metadataOnly::ssd2 dm-6:::metadataOnly::ssd3 dm-7:::dataOnly::sas4 dm-8:::dataOnly::sas5 dm-9:::dataOnly::sas6 dm-10:::dataOnly::sas7 dm-11:::dataOnly::sas8 dm-12:::dataOnly::sas9 dm-13:::dataOnly::sas10 dm-14:::dataOnly::sas11
Since GPFS is configured as a SAN topology, primary and backup NSD servers are not specified. Using the Linux multi-pathing driver. Create a NSD specification file. The format for each line is as follows f1:f2:f3:f4:f5:f6:f7: where f1 = scsi device f2 = comma seperate NSD server list there can be upto 8 NSD servers f3 = NULL (retained for legacy reasons) f4 = usage f5 = failure group f6 = NSD name f7 = storage pool name Fields left blank are filled with default value Back up this specifications since its an input/output file for the mmcrnsd.
[root@node-01 GPFS_install]# cp disk.lst disk.lst.orig
[root@node-01 GPFS_install]# mmcrnsd -F disk.lst -v no mmcrnsd: Processing disk dm-3 mmcrnsd: Processing disk dm-4 mmcrnsd: Processing disk dm-5 The mmcrnsd parameters are mmcrnsd: Processing disk dm-6 -F: name of the NSD specification file (n.b., the file is changed by this command mmcrnsd: Processing disk dm-7 keep a back up!) mmcrnsd: Processing disk dm-8 -v: check if this disk is part of an existing GPFS file system or ever had a GPFS file mmcrnsd: Processing disk dm-9 system on it (n.b., if it does/did and the parameter is yes, then mmcrnsd will not mmcrnsd: Processing disk dm-10 create it as a new NSD) mmcrnsd: Processing disk dm-11 mmcrnsd: Processing disk dm-12 mmcrnsd: Processing disk dm-13 mmcrnsd: Processing disk dm-14 mmcrnsd: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.
GPFS v3.3

[root@node-01 GPFS_install]# cat disk.lst # dm-3:::metadataOnly::ssd0 ssd0:::metadataOnly:-1:: # dm-4:::metadataOnly::ssd1 ssd1:::metadataOnly:-1:: # dm-5:::metadataOnly::ssd2 ssd2:::metadataOnly:-1:: # dm-6:::metadataOnly::ssd3 ssd3:::metadataOnly:-1::
This shows on the mmcrnsd command modifies the NSD specification file.
# dm-7:::dataOnly::sas4 sas4:::dataOnly:-1:: # dm-8:::dataOnly::sas5 sas5:::dataOnly:-1:: # dm-9:::dataOnly::sas6 sas6:::dataOnly:-1:: # dm-10:::dataOnly::sas7 sas7:::dataOnly:-1:: # dm-11:::dataOnly::sas8 sas8:::dataOnly:-1:: # dm-12:::dataOnly::sas9 sas9:::dataOnly:-1:: # dm-13:::dataOnly::sas10 sas10:::dataOnly:-1:: # dm-14:::dataOnly::sas11 sas11:::dataOnly:-1::
GPFS v3.3

[root@node-01 GPFS_install]# mmlsnsd -X
Verify that the NSDs were properly created.
Disk name NSD volume ID Device Devtype Node name Remarks ------------------------------------------------------------------------------------sas10 AC1F01C84B6C5755 /dev/dm-13 dmm node-01 Since GPFS is configured as SAN sas11 AC1F01C84B6C5756 /dev/dm-14 dmm node-01 topology in this example, the node sas4 AC1F01C84B6C574F /dev/dm-7 dmm node-01 names are not unique. In a LAN configuration, there is one line for sas5 AC1F01C84B6C5750 /dev/dm-8 dmm node-01 each LUN and each node where it is sas6 AC1F01C84B6C5751 /dev/dm-9 dmm node-01 mounted. sas7 AC1F01C84B6C5752 /dev/dm-10 dmm node-01 sas8 AC1F01C84B6C5753 /dev/dm-11 dmm node-01 mmlsnsd parameters sas9 AC1F01C84B6C5754 /dev/dm-12 dmm node-01 -X: list extended NSD information ssd0 AC1F01C84B6C574B /dev/dm-3 dmm node-01 ssd1 AC1F01C84B6C574C /dev/dm-4 dmm node-01 ssd2 AC1F01C84B6C574D /dev/dm-5 dmm node-01 ssd3 AC1F01C84B6C574E /dev/dm-6 dmm node-01
[root@bm-dell-10 GPFS_install]# mmlsnsd
Assume this command was issued after a file system is built.
File system Disk name NSD servers --------------------------------------------------------------------------gpfs1 sas10 (directly attached) By omitting the -X parameter, a different view is presented. gpfs1 sas11 (directly attached) For example, if a file system exists, it would then show the LUN to file system mapping. If GPFS used a LAN topology, gpfs1 sas4 (directly attached) it would show primary and backup NSD servers. gpfs1 sas5 (directly attached) gpfs1 sas6 (directly attached) gpfs1 sas7 (directly attached) gpfs1 sas8 (directly attached) gpfs1 sas9 (directly attached) gpfs1 ssd0 (directly attached) gpfs1 ssd1 (directly attached) gpfs1 ssd2 (directly attached) gpfs1 ssd3 (directly attached)
GPFS v3.3

[root@node-01 GPFS_install]# mmcrfs /gpfs1 gpfs1 -F disk.lst -A yes -B 256k -v no -n 32 The following disks of gpfs1 will be formatted on node node-01: ssd0: size 192937984 KB Parameters for mmcrfs ssd1: size 192937984 KB /gpfs1: mount point ssd2: size 192937984 KB gpfs1: device entry in /dev for the file system ssd3: size 192937984 KB -F: output file from the mmcrnsd command sas4: size 2264924160 KB -A: mount the file system automatically every time sas5: size 2264924160 KB mmfsd is started sas6: size 2264924160 KB -B: actual block size for this file system; it can not be sas7: size 2264924160 KB larger than the maxblocksize set by the mmchconfig sas8: size 2264924160 KB command sas9: size 2264924160 KB -v: check if this disk is part of an existing GPFS file sas10: size 2264924160 KB system or ever had a GPFS file system on it (n.b., sas11: size 2264924160 KB if it does/did and the parameter is yes, then Formatting file system ... mmcrfs will not include this disk in this file system) Disks up to size 21 TB can be added to -n: estimated number of nodes that will mount this file storage pool 'system'. system (see note below). Creating Inode File The optimum value for the actual block size is both Creating Allocation Maps application and controller dependent. Experimentation is Clearing Inode Allocation Map recommended to determine the best choice for this value. Clearing Block Allocation Map The options are 16k, 64k, 128K, 256k. 512k, 1M, 2M, 4M. Formatting Allocation Map for storage pool 'system' Completed creation of file system /dev/gpfs1. The /etc/fstab is automatically updated by this command. mmcrfs: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.
COMMENT: Do not forget to set the -n parameter. Since it provides an estimate for the number of nodes that will mount the file system, try estimate future growth without wildly overestimating. While it can be off quite a bit with minimal impact, after it crosses a certain threshold performance can be severely impacted (e.g., performance will be impacted when it is off by an order of magnitude and the file system is over 70% capacity) and this parameter can not be easily changed. If you configure GPFS with a SAN topology on a cluster that you anticipate will exceed 32 nodes, seek technical assistance from IBM.
GPFS v3.3

[root@node-01 GPFS_install]# mmlsfs gpfs1 Verify file system status.. flag value description ---- ---------------- -----------------------------------------------------f 8192 Minimum fragment size in bytes -i 512 Inode size in bytes -I 16384 Indirect block size in bytes -m 1 Default number of metadata replicas -M 2 Maximum number of metadata replicas -r 1 Default number of data replicas -R 2 Maximum number of data replicas -j scatter Block allocation type -D nfs4 File locking semantics in effect -k all ACL semantics in effect -a 1048576 Estimated average file size -n 32 Estimated number of nodes that will mount file system -B 262144 Block size -Q none Quotas enforced none Default quotas enabled -F 18448390 Maximum number of inodes -V 11.05 (3.3.0.2) File system version -u yes Support for large LUNs? -z no Is DMAPI enabled? -L 4194304 Logfile size -E yes Exact mtime mount option -S no Suppress atime mount option -K whenpossible Strict replica allocation option -P system Disk storage pools in file system -d ssd0;ssd1;ssd2;ssd3;sas4;sas5;sas6;sas7;sas8;sas9;sas10;sas11 Disks in file system -A yes Automatic mount option -o none Additional mount options -T /gpfs1 Default mount point
GPFS v3.3

[root@node-01 GPFS_install]# mmlsdisk gpfs1 Verify disk status.. disk driver sector failure holds holds name type size group metadata data status ------------ -------- ------ ------- -------- ----- ------------ssd0 nsd 512 -1 yes no ready ssd1 nsd 512 -1 yes no ready ssd2 nsd 512 -1 yes no ready ssd3 nsd 512 -1 yes no ready sas4 nsd 512 -1 no yes ready sas5 nsd 512 -1 no yes ready sas6 nsd 512 -1 no yes ready sas7 nsd 512 -1 no yes ready sas8 nsd 512 -1 no yes ready sas9 nsd 512 -1 no yes ready sas10 nsd 512 -1 no yes ready sas11 nsd 512 -1 no yes ready availability -----------up up up up up up up up up up up up storage pool -----------system system system system system system system system system system system system
GPFS v3.3

[root@node-01 GPFS_install]# mmmount /gpfs1 -a Fri Feb 5 12:50:17 EST 2010: mmmount: Mounting file systems ... [root@bm-dell-10 GPFS_install]# chmod 777 /gpfs1 Permissions propogate to mount points on all nodes. [root@bm-dell-10 GPFS_install]# touch /gpfs1/test_file Sanity check... [root@bm-dell-10 GPFS_install]# dir /gpfs1 total 0 -rw-r--r-- 1 root root 0 Feb 5 12:51 test_file [root@node-01 GPFS_install]# time dd if=/dev/zero of=/gpfs1/buggs_bunny bs=256k count=16384 16384+0 records in Another anity check. This is NOT a benchmark! 16384+0 records out 4294967296 bytes (4.3 GB) copied, 7.50776 seconds, 572 MB/s real user sys 0m7.511s 0m0.010s 0m2.102s
[root@node-01 GPFS_install]# dir /gpfs1 total 4194304 -rw-r--r-- 1 root root 4294967296 Feb 5 13:01 buggs_bunny -rw-r--r-- 1 root root 0 Feb 5 12:51 test_file [root@node-01 GPFS_install]# cat /etc/fstab /dev/VolGroup00/LogVol00 / LABEL=/boot /boot tmpfs /dev/shm devpts /dev/pts sysfs /sys proc /proc /dev/VolGroup00/LogVol01 swap /dev/gpfs1 /gpfs1 gpfs
ext3 defaults 1 1 ext3 defaults 1 2 tmpfs defaults 0 0 devpts gid=5,mode=620 0 0 sysfs defaults 0 0 proc defaults 0 0 swap defaults 0 0 rw,mtime,atime,dev=gpfs1,autostart 0 0
mmcrfs automatically adds a GPFS stanza to the fstab file
GPFS v3.3
What if I screw up? Clean up and start over again.

Option #1 The proper way to do it.
1. Unmount the GPFS file system [root@gpfs1 gpfs_install]# 2. Delete GPFS file system [root@gpfs1 gpfs_install]# 3. Delete GPFS NSDs [root@gpfs1 gpfs_install]# > do > mmdelnsd nsd_lun$i > done 4. Shutdown GPFS daemons [root@gpfs1 gpfs_install]# 5. Delete the GPFS cluster [root@gpfs1 gpfs_install]#
Don't jump!
It's easier the second time!
mmunmount /gpfs1 -a mmdelfs /gpfs1 for i in `seq 1 24`
Properly deleting the file system ensures that the file system descriptors are deleted from the disks so that they will not create issues upon a subsequent file system creation attempt. Properly deleting the NSDs ensures that the NSD descriptors are deleted so that they will not create issues upon a subsequent NSD creation attempt.
mmshutdown -a mmdelnode -a
Option #2 But what if I really screw things up?

1. Unmount the GPFS file system and shutdown the GPFS daemons [root@gpfs1 gpfs_install]# mmunmount /gpfs1 -a [root@gpfs1 gpfs_install]# mmfsadm cleanup 2. Delete selected configuration files on all nodes [root@gpfs1 gpfs_install]# rm f /var/mmfs/etc/mmfs.cfg [root@gpfs1 gpfs_install]# rm f /var/mmfs/gen/* [root@gpfs1 gpfs_install]# rm f /var/mmfs/tmp/*
Deleting /var/mmfs/tmp may not be necessary; if so, then skip this step since it keeps backup copies of the mmsdrfs file. WARNING: Use with extreme caution! Once the configuration files have been deleted, the GPFS cluster no longer exists and any data on the disks will likely be lost. Use this method only when Option #1 fails.
GPFS v3.3

Verifying Multipath Driver Settings
[root@node-01 GPFS_install]# multipath -ll LUN11 (360001ff08000a000000000258bb1000b) dm-14 DDN,SFA 10000 [size=2.1T][features=1 queue_if_no_path][hwhandler=0][rw] \_ round-robin 0 [prio=50][active] \_ 1:0:0:12 sdz 65:144 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 2:0:0:12 sdn 8:208 [active][ready] LUN10 (360001ff08000a000000000248bb0000a) dm-13 DDN,SFA 10000 [size=2.1T][features=1 queue_if_no_path][hwhandler=0][rw] \_ round-robin 0 [prio=50][active] \_ 2:0:0:11 sdm 8:192 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 1:0:0:11 sdy 65:128 [active][ready]
SAS Drives
.... partial listing ..... LUN1 (360001ff08000a0000000001b8ba70001) dm-4 DDN,SFA 10000 [size=184G][features=1 queue_if_no_path][hwhandler=0][rw] \_ round-robin 0 [prio=50][active] \_ 1:0:0:2 sdp 8:240 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 2:0:0:2 sdd 8:48 [active][ready] LUN0 (360001ff08000a0000000001a8ba60000) dm-3 DDN,SFA 10000 [size=184G][features=1 queue_if_no_path][hwhandler=0][rw] \_ round-robin 0 [prio=50][active] \_ 2:0:0:1 sdc 8:32 [active][ready] \_ round-robin 0 [prio=10][enabled] \_ 1:0:0:1 sdo 8:224 [active][ready]
SSD Drives
GPFS v3.3

Script to Modify the Linux Transfer Size
[root@node-01 bin]# cat setDDNioparams.sh #!/bin/bash # # setDDNioparams # maxsectkb=8192 readaheadkb=8192 nrrequests=512 DDNdevstmp="/var/tmp/DDNdevs.tmp" DDNdevslog="/var/tmp/DDNdevs.log" lsscsi > $DDNdevstmp while read line; do devtype=`echo $line | cut -d" " -f3` if [ "$devtype" = "DDN" ]; then devname=`echo $line | cut -d"/" -f3` curmaxsectkb=`cat /sys/block/$devname/queue/max_sectors_kb` curreadaheadkb=`cat /sys/block/$devname/queue/read_ahead_kb` curnrrequests=`cat /sys/block/$devname/queue/nr_requests` if [ $curmaxsectkb -ne $maxsectkb -o \ $curreadaheadkb -ne $readaheadkb -o \ $curnrrequests -ne $nrrequests ]; then echo $maxsectkb > /sys/block/$devname/queue/max_sectors_kb echo $readaheadkb > /sys/block/$devname/queue/read_ahead_kb echo $nrrequests > /sys/block/$devname/queue/nr_requests echo "Device=$devname" | tee -a $DDNdevslog echo "Old max_sectors_kb=$curmaxsectkb" | tee -a $DDNdevslog echo "New max_sectors_kb=$maxsectkb" | tee -a $DDNdevslog echo "Old read_ahead_kb=$curreadaheadkb" | tee -a $DDNdevslog echo "New read_ahead_kb=$readaheadkb" | tee -a $DDNdevslog echo "Old nr_requests=$curnrrequests" | tee -a $DDNdevslog echo "New nr_requests=$nrrequests" | tee -a $DDNdevslog echo | tee -a $DDNdevslog fi fi done < $DDNdevstmp
GPFS v3.3

Script to Modify the Linux Transfer Size
WARNING: [root@node-01 queue]# pwd These values resort to their defaults every time a node is rebooted. /sys/block/sdm/queue Therefore it is necessary to reset them everytime a node is rebooted. [root@node-01 queue]# ls -l total 0 drwxr-xr-x 2 root root 0 Feb 5 10:32 iosched -rw-r--r-- 1 root root 4096 Feb 5 13:28 iostats -r--r--r-- 1 root root 4096 Feb 5 13:28 max_hw_sectors_kb -rw-r--r-- 1 root root 4096 Feb 5 13:28 max_sectors_kb -rw-r--r-- 1 root root 4096 Feb 5 13:28 nr_requests -rw-r--r-- 1 root root 4096 Feb 5 13:28 read_ahead_kb -rw-r--r-- 1 root root 4096 Feb 5 13:28 scheduler [root@node-01 queue]# cat max_sectors_kb 512 [root@node-01 queue]# cat nr_requests 128 [root@node-01 queue]# cat read_ahead_kb 128
13. Advanced Features in GPFS
The following pages examine newer features that are discussed in the Advanced Administration Guide.
Beginning with version 3.1 (and moving forward), GPFS is exploiting existing features and creating new features that make it more than an HPC file system... GPFS is becoming a general purpose clustered file system where HPC is a key and pervasive feature.
GPFS will never forget HPC!
The following pages examine some of the new (or newly exploited) GPFS features making it more suitable as a general purpose file system. Today, these features include ILM with Integrated HSM Robust NFS/CIFS support Scale-out File System (SoFS) - TBD Storage Virtualization Disaster Recovery Snapshots - TBD GPFS SNMP Support - TBD
Information Lifecycle Management (ILM)

GOALS Manage data over its life cycle ("cradle to grave") Keep active data on highest performing media and inactive data on tape of low cost, high capacity disk Migration of data is automatic and transparent to the client Lower levels can serve as backup for higher levels
Tier-1
Performance Optimized Disk e.g., FC, SAS disk Scratch Space
frequent use smaller capacity high BW/low latency more expensive
Tier-2
Capacity Optimized e.g., SATA Infrequently used files
Tier-3
Local tape libraries
Tier-4
Remote tape libraries
infrequent use larger capacity lower BW higher latency less expensive
Archive vs. Backup

Archive: Maintain Only 1 Copy of File
By definition, an archive requires multiple tiers of storage and maintains only 1 copy of a file in one of the tiers Example: in a combined disk/tape archive, a file resides either on disk or on tape, but not both!
Backup: Maintain a Second Copy of a File

A backup system maintains a second copy of a file. Best practice guideline: the files of an archive should be backed up, including files archived on tape. some archive products integrate backup into the archive function
GPFS ILM is primarily an archive tool, but...

GPFS ILM policy supports file replication while similar, replication is not the same thing as backup HSM products integrated with GPFS support both archive and backup functions
Information Lifecycle Management in GPFS

ILM provides
Storage pool - group of LUNs Fileset - define subtrees of a file system Policies - for rule based management of files inside the storage pools
Examples of policy rules

Place new files on fast, reliable storage, move files as they age to slower storage, then tape Place media files on video-friendly storage (fast, smooth), other files on cheaper storage Place related files together; e.g., for failure containment
The system pool must contain the metadata.
System
Storage Network
Tape
Comments
One global name space across pools of independent Storage Files in the same directory can be in different pools
Gold
Silver
Bronze
Including tape requires additional SW (e.g., TSM)
The non-system pools can only contain user data.
Storage Pools
ILM Manages sets of storage called "storage pools" What is a storage pool?
A named subset of disks and tapes within the context of GPFS, new appropriate SW to include tape (e.g., HPSS) Each file is assigned to a storage pool based upon policy rules placement policies (where to place files upon creation) migration policies (moving files from one pool to another) deletion policies (removing files from the storage system)
What are they are good for?

Tiered storage (files aged to slower/cheaper disk) Dedicated storage (e.g., per user or per project or per directory subtree) Failure containment To limit the amount of data lost due to a failure To bound the performance impact of RAID rebuild Appropriate use of special-purpose storage Different RAID levels Enterprise grade disk vs. consumer-grade disk Multimedia friendly storage
GPFS Filesets
What they are:

A named subtree of a GPFS file system Somewhat like a distinct file system, i.e. a fileset can be unlinked without deleting it, and it can subsequently be linked using its name as a handle
What they are good for:

Filesets can have quotas associated with them (global; not per-pool).
Fileset quotas are independent of user and group quotas
Filesets can be used to restrict the effect of policies to specific files
Side effects:
Unlinked filesets can confuse programs that scan the file system (e.g. incremental backup programs) Moving and linking between filesets is not allowed, in keeping with their being like little file systems
GPFS ILM/HSM Integration

GPFS Integrates its ILM Policies with tape based HSM Products
GPFS extends its Information Lifecycle Management (ILM) functionality to integrate with HSM (Hierarchical Storage System) products. A single set of policies is used to move data between GPFS storage pools and tape storage pools. Supported HSM products include High Performance Storage System (HPSS) Tivoli Storage Manager (TSM) Cool Feature: very fast file scans 1 million files in 13 seconds 1 billion files in 75 minutes
Tape is not dead!
A lab with insufficient tape BW was forced to use DHL to move 200 TB of data on disk!
Never underestimate the BW in a pickup load of magnetic tape! ... or a cargo plane for that matter.
HPSS/GPFS Integration Project

GPFS/HPSS is a collaborative project to develop synergy between IBMs General Parallel File System and the HPSS Collaborations High Performance Storage System Purpose
To create a hierarchical disk and tape storage solution with unequaled parallelism and scalability Extend GPFS Information Lifecycle Management functionality to include tape
Participants
HPSS Collaboration member NERSC/Lawrence Berkeley Lab IBM Research, Almaden Lab IBM GPFS Product Development in Poughkeepsie NY IBM HPSS Development and Support in Houston TX
What is HPSS?
HPSS
High Performance Storage System
Core Server LAN Metadata Disks
HPSS is a disk and tape hierarchical storage system with a cluster architecture similar in many ways to GPFS architecture HPSS can be used alone as a cluster hierarchical storage system or as the tape component of GPFS Versatile native HPSS interfaces:
Client Computers
Backup Core Server
SAN Tape-Disk Movers
Traditional HPSS APIs Linux file system interface New GridFTP interface available
Rugged DB2 metadata engine assures reliability and quick recovery Like GPFS, HPSS supports horizontal scaling by adding disks, tape libraries, movers, and core servers to:
10s of petabytes 100s of millions of files gigabytes per second
Disk Arrays
Robotic Tape Libraries
Jointly developed by the five US Department of Energy labs and IBM
Integrated GPFS/HPSS Architecture

GPFS Centric Mixed GPFS and HPSS HPSS Centric
GPFS/HPSS is software that connects GPFS and HPSS together under the GPFS ILM policy framework GPFS/HPSS agents (processes and daemons) run on the GPFS Session Node and I/O Manager Nodes GPFS/HPSS uses DB2 to contain a reference table that maps between GPFS file system objects and HPSS storage objects GPFS/HPSS is distributed with and supported by HPSS
GPFS Client Nodes
GPFS/ HPSS Agents
HPSS Core Server(s)
DB2
(GPFS/HPSS)
DB2
(HPSS tables)
NSD Server Nodes
GPFS/ HPSS Agents
GPFS NSD Nodes and HPSS Movers can share the same physical nodes
HPSS Movers
HPSS Disk Cache GPFS Storage Pools
GPFS/HPSS Cluster
Tape Libraries
GPFS/HPSS ILM Semantics

GPFS Cluster
mmapplypolicy mmfsmigrate mmfsrecall mmfspurge mmfslist hpssmigrate hpssrecall hpsspurge hpsslist NSD Nodes GPFS/HPSS Agent ILM Policy Generated List Passed to ILM Process GPFS Storage
GPFS Client Node GPFS/HPSS Agent
Functionally, HPSS will use the ILM policy lists from GPFS in order to move data between disk and tape.
HPSS Cluster
HPSS Storage HPSS Movers
Sample GPFS ILM Policy Statements

Storage pool name (corresponds to class of storage)
Initial Placement
RULE 'SlowDBase' SET STGPOOL 'sata' FOR FILESET('dbase') WHERE NAME LIKE '%.data RULE 'SlowScratch' SET STGPOOL 'sata' FOR FILESET('scratch') WHERE NAME LIKE '%.mpg' RULE 'default' SET STGPOOL 'system'
Rule name
Fileset name (corresponds to subdirectory)
Qualifiers
Movement by Age
Rule to RULE 'MigData' MIGRATE FROM POOL 'system THRESHOLD(80,78) move files WEIGHT( TIME_SINCE_LAST_ACCESS ) TO POOL 'sata FOR FILESET('data') to HPSS RULE 'HsmData' MIGRATE FROM POOL 'sata THRESHOLD(95,80) WEIGHT( TIME_SINCE_LAST_ACCESS ) TO POOL 'hsm FOR FILESET('data') RULE 'Mig2System' MIGRATE FROM POOL 'sata WEIGHT(ACCESS_TIME) TO POOL 'system LIMIT(85) FOR FILESET('user','root') WHERE DAYS_SINCE_LAST_ACCESS_IS_LESS_THAN( 2 )
Lock in place
RULE 'ExcDBase' EXCLUDE FOR FILESET('dbase')
Life expiration
RULE 'DelScratch' DELETE FROM POOL 'sata FOR FILESET('scratch') WHERE DAYS_SINCE_LAST_ACCESS_IS_MORE_THAN( 90 )
GPFS/HPSS Building Block

Ethernet Switch (provides TbE and 1 GbE ports)
G b E G b E
Aggregate BW per server TbE

FC4 dual port FC4
2u
x3650-01
FC4 dual port
FC4
TbE
dual port
G b E
G b E
x3650-03 2u NSD server

4-way 4 GB RAM
TbE
dual port
peak < 750 MB/s sustained < 700 MB/s
PCI-Ex Slots
PCI-Ex Slots
HBA
G b E
G b E
2u
x3650-02
FC4 dual port
FC4
TbE
PCI-Ex Slots 4 Gb/s Host side FC links
dual port
G b E
G b E
2u NSD server
4-way 4 GB RAM
x3650-04
FC4 dual port
FC4
TbE
dual port
PCI-Ex Slots
NSD service (GPFS)

DCS9550 Couplet
DCS9550 (2U) RAID Controller DCS9550 (2U) RAID Controller
TS1120-01
host ports drive ports host ports
TS1120-05
HPSS mover
host ports
drive ports
host ports
TS1120-02
TS1120-06
16-Bay Chassis (3U)
TS1120-03
TS1120-07
10 disk trays
o o o
16-Bay Chassis (3U)
Drive side cabeling not shown. TS1120-04 TS1120-08
In terms of sustained rates assuming well designed I/O application in a production environment, applications will be able to draw up to 450 MB/s per server.
ANALYSIS
Servers
4 x3650s 2 dual core sockets at least 4 GB RAM 2 dual port, 4 Gb/s HBA
380 MB/s per port
Disk and Disk Controller

DCS 9550 couplet 10 disk trays with 16 disks/tray
FC, 300 GB/disk @ 15 Krpm
Disk and Disk Controller

8 x TS1120 (Jaguar) Peak BW per drive <100 MB/s Sustained BW < 50 MB/s
35 LUNs
8+P RAID sets
1 TbE (10 GbE) adapter per node

800 MB/s per adapter
capacity: raw = 48 TB, usuable < 38 TB Peak BW

write < 2.8 GB/s, read < 2.6 GB/s
Sustained BW
write < 2.6 GB/s, read < 2.2 GB/s
A Multi-tiered GPFS/HPSS Solution

GbE network not shown for disk controllers and servers
1 GbE
Ethernet Switch
Disk Capacity raw: 522 TB usuable: 422 TB Tape Capacity 2 PB Aggregate FS BW 6.6 GB/s Application I/O BW 5.2 GB/s Tape BW 1.2 GB/s requires 1.2 GB/s FS BW
x3650-13 x3650-14
DS3200 12 disks, RAID 10
4 Gb/s FC network
DCS 9550 Couplet

10 disk trays 300 GB, 15 Krpm drives raw capacity = 96 TB
(requires 2 frames)
4 Gb/s FC network
10 GbE
x3650-01 x3650-02 x3650-03
HPSS Core Servers

Tape Library TS3500
library controller frames (L23) - 1 expansion frames (D23) - 7
TS1120-01 TS1120-02 TS1120-03 TS1120-04
DCS 9550 Couplet

5 disk trays 1 TB, SATA drives raw capacity = 240 TB
(requires 1 frame)
x3650-04 x3650-05 x3650-06 x3650-07 x3650-08

TS1120-05
TS1120-06
TS1120-07
TS1120-08
TS1120-09
TS1120-10
TS1120-11
TS1120-12
DCS 9550 Couplet

5 disk trays 1 TB, SATA drives raw capacity = 240 TB
(requires 1 frame)
x3650-09 x3650-10 x3650-11 x3650-12
TS1120-13
TS1120-14
TS1120-15
TS1120-16
TS1120-17
TS1120-18
TS1120-19
TS1120-20
TS1120-21
TS1120-22
TS1120-23
TS1120-24
Combined NSD Servers and HPSS Movers
4 Gb/s FC network
Tape Cartridges size: 700 GB / cartridge minimum needed 2860 maximum available: 2997
WARNING: If any more tape drives are added to this configuration without increasing the number of servers, it will be necessary to add a SAN switch for the tape drives.
What is Tivoli Storage Manager?
Tivoli Storage Manager (TSM) is a comprehensive software suite that manages storage. It provides
Backup / restore Archive / retrieve Disaster recovery Database & application protection Space management (HSM) Bare machine recovery Continuous data protection Content Management
It is a client/server design with seperate server products and client products implementing this list of functions.
TSM Architecture
Administration User Interface
Local Area Network

Log
Database Storage Repository
Servers, Clients, Application systems
Storage Area Network
TSM Clients
TSM Server
TSM Stgpools
COMMENT It is a common practice to combine the TSM server and client functions into a common node in GPFS clusters.
TSM Archive
Archive
TSM Server
Disk Copy Disk On-site Tape DVD/CD Optical Copied to Offsite Tape Other
Tiered Storage Pools
TSM Client
Retrieve
Archive Features
It is a common practice to combine the TSM server and client functions into a common node in GPFS clusters.
Long-term storage Point in time copy Retention period Policy managed Index archives with descriptive metadata expedite locating historical information Allows focus to be placed on active data
Recover only active data Reduce backup time by focusing on active files only
TSM Backup
Backup
TSM Server
Disk Copy Disk On-site Tape DVD/CD Optical Copied to Offsite Tape Other
Tiered Storage Pools
TSM Client
Restore
Backup Features
Progressive incremental backup
Backup only new/changed files avoiding wasteful full backups Data tracked at file level Accurately restores files to a point in time
It is a common practice to combine the TSM server and client functions into a common node in GPFS clusters.
Adaptive subfile differencing Volume level Multiple versions kept Policy managed System assisted restore Automated scheduling
COMMENT: For TSM, a recommended best practice is to explicitly backup archived data.
Integrated GPFS/TSM Architecture
Policy File
Policy daemon
1. monitors
2. starts
Policy Engine
ans 3. sc
4. uses
The Process
GPFS policy daemon monitors HT/LT based on enabled policy. Policy daemon starts policy engine. Policy engine scans file system and generates candidate list based on enabled migration policy. dsmmigrate is called and migrates all files in candidate list to the TSM server.
6.
rts ta s
5. ge ne rat e
Candidate List 7. uses
HT LT
dsmmigrate
File System
8. Migrates data
TSM Server
Setting up Integrated TSM/GPFS Migration

To enable the TSM/HSM client to integrate with the GPFS policy engine it is necessary to
Install GPFS 3.2 and TSM/HSM 5.5 as described in the books The HSM automigration capability needs to be disabled. This is done by the following command :
dsmmigfs add -HT=100 <file_system> dsmmigfs update -HT=100 <file_system>
Add an external HSM Pool to your placement policy file. Example:

RULE EXTERNAL POOL 'hsmpool' EXEC '/var/mmfs/etc/mmpolicyExec-hsm.sample' OPTS '-v'
Add a threshold migration rule to your placement policy file Example:

RULE 'HsmData' MIGRATE FROM POOL 'StoragePool1' THRESHOLD(90,80,70) WEIGHT( CURRENT_TIMESTAMP - ACCESS_TIME ) TO POOL 'hsmpool' WHERE FILE_SIZE > 1024
Install placement policy with

mmchpolicy <filesystem> <policy file>
A Two-tiered GPFS/TSM Solution

Ethernet LAN Switch (admin)
IB 4xDDR 2xFC8
1 2
3 4
host ports
GbE ports
GbE
GbE
IB 4xDDR
2xFC8
GbE
GbE
IB 4xDDR 2xFC8
1 2
3 4
host ports GbE ports
GbE
GbE
IB 4xDDR
2xFC8
GbE
GbE
5 Disk trays
IB 4xDDR 2xFC8 2xFC8
o o o o
300 x SATA disks 300 TB
x3650-05
GbE GbE
TSM client/server* 8 cores, 8 GB RAM
IB LAN Switch
FC Switch
TS3500-L53/D53 (tape library)
TS1040-01
(LTO-4)
FC4
2nd Tier TSM Tape Archive

10 x LTO-4 drives (TS1040)
1 FC4 port per tape drive at most 120 MB/s per tape drive assumes no compression
800 x 800 GB cartridges

usable capacity < 640 TB assumes no compression
o o o
aggregate data rate < 1 GB/s

provides 4 GB/s application to file system BW and 1 GB/s file system to TSM BW
TS1040-10
(LTO-4)
FC4
4xDDR IB HCA (Host Channel Adapter) Peak data rate per HCA < 1500 MB/s Require RDMA FC8 (single port 8 Gbit/s Fibre Channel) Peak data rate 2 FC8 HBAs < 1500 MB/s DCS9900 Performance streaming data rate < 5000 MB/s noncached IOP rate < 40,000+ IOP/s
In a production system there would normally be 2 TSM client/server systems; the active one and a passive one for redundancy.
A Three-tiered GPFS/TSM Solution

IB connections to computational cluster
45u frame 42u frame FC8
TS3500-L53/D53
5 frames 1 x L53 4 x D53 800 GB/cartridges uncompressed 2000 cartidges 10 x TS1040 drives at most 120 MB/s uncompressed aggregate capacity < 1.6 PB data rate < 1.0 GB
42u frame
DCS9900 Couplet DCS Tray #1 DCS Tray #2 DCS Tray #3
EXP5000 #1 EXP5000 #2
IB Switch
NSD Server - 01 NSD Server - 02 NSD Server - 03 NSD Server - 04 NSD Server - 05 NSD Server - 06 NSD Server - 07 NSD Server - 08 Active TSM - 01 Passive TSM - 02
FC8
COMMENTS Tier 1 Scratch storage used for application processing Tier 2 Archive storage indirectly accessed by applications Tier 3 Archive/backup storage indirectly accessed by applications. Footnotes The passive TSM client/server is a "hot spare" backup for the active TSM client/server. It is assumed that the NSD server and TSM client/server nodes are x3650 M2. Alternatively, the P6-p520 could be used instead. Likewise, the IB LAN could be replaced by TbE where each server has a channel bonded 2xTbE. There is upto 2.5 GB/s of unused bandwidth in the tier 2 storage. If applications directly access this storge to create data, then additional tape bandwidth is needed archive and/or backup this data. This will require more TSM client/server nodes which means creating a 2nd TSM repository, or selecting a more powerful node that can handle the increased bandwidth. Generally, the archive rate = data creation rate which is assumed to be 80% write rate for this example. (n.b., not all written data is retained)
DCS Tray #4 DCS Tray #5 DCS Tray #6 DCS Tray #7
TS1040-01 TS1040-02 TS1040-03 TS1040-04 TS1040-05 TS1040-06 TS1040-07 TS1040-08 TS1040-09 TS1040-10
FC4
EXP5000 #3 EXP5000 #4 EXP5000 #5 EXP5000 #6 EXP5000 #7 EXP5000 #8 DS5300
DCS Tray #8 DCS Tray #9 DCS Tray #10
FC Switch (tape only)
FC8
Tier 3 - archive/backup Capacity 1.6 PB ILM Bandwidth receive < 0.8 GB/s restore < 0.2 GB/s
Tier 1 - scratch Capacity < 58 TB 128 x 450 GB drives 15 Krpm FC drives Application Bandwidth write < 1.0 GB/s read < 2.0 GB/s ILM Bandwidth transfer < 1.0 GB/s restore < 0.5 GB/s
Tier 2 - archive Capacity < 600 TB 600 x 1 TB drives SATA drives ILM Bandwidth recieve from tier 1 < 1.0 GB/s transfer to tier 3 < 1.0 GB/s restrore to tier 1 < 0.5 GB/s Unused Bandwidth < 2.5 GB/s
HPSS vs. TSM

Which is Best?
Capacity
HPSS is sized and priced for systems with over 1 PB of storage. TSM has an upper limit of 1 PB per TSM instance.
Backup
HPSS integrates backup into the archive function. TSM requires a separate backup procedure in addition to the archive function.
Parallelism
HPSS is designed as a parallel archive tool; it supports multiple tape servers (i.e., "tape movers") per HPSS instance. TSM is not parallel; to scale TSM beyond a single server requires multiple TSM instances.
Metadata management
HPSS requires a separate metadata subsystem (e.g., 2 "core servers" plus external disk storage for its metadata database). TSM integrates the metadata operations into its server operations.
Market segment
HPSS was designed for the high end HPC market by a consortium of HPC labs. TSM was designed for commercial applications, but is commonly adapted to scientific and technical environments.
NFS/GPFS Integration
Yesterday
NFS was a
Ouch!
with GPFS on Linux!
Today...
Improved performance, robustness, server farm features for NFS Clustered NFS (CNFS)
Provides high availability NFS server functionality using GPFS only under Linux
RHEL and SuSE includes Linux on pSeries (LoP)
Serves most any NFS client examples: AIX, Linux, Solaris, etc.
Clustered NFS (CNFS)

CNFS
Provides a High Availability implementation of NFS under Linux Two primary uses can be used like a NAS device providing robust NFS service facsilitate a transition from a current NFS deployment to GPFS
Create a GPFS cluster whose nodes can provide NFS service

Generally use NSD servers as the NFS servers
The CNFS feature provides the following features

Monitoring all NSD/NFS server nodes monitor key components of the "stack" and upon a failure initiate failover recovery Failover exploits IP address failover and GPFS failover to perform NFS recovery Load balancing utilizes DNS round robin Honors NFS lock consistency NLM locks are passed thru to GPFS lock manager
CNFS
"A Picture is Worth a Thousand Words"
NFS Client 1
NFS Client 2
NFS Client 3
NFS Client 4
oooo
NFS Client N
Ethernet Fabric (with DNS Round Robin Load Balancing)

COMMENTS GPFS and the CNFS features are entirely contained within the GPFS cluster. New and legacy NFS clients access the GPFS storage via NFS.
clients use "plain vanilla" NFS (i.e., no special features other than DNS RR load balancing) clients can use AIX, Linux, Solaris or any other Unix based OS offices with windows based systems running CIFS can access GPFS under regular CIFS protocols (i.e., the HA feature not available for CIFS)
DS4800 EXP810 Tier 1
monitor utility
nfsd
mmfsd node 1 monitor utility
HSM Tape Drives
nfsd
S A N
mmfsd node 2
o o o
monitor utility nfsd
mmfsd node 12
disk/tape servers
GPFS Cluster
CNFS
"A Picture is Worth a Thousand Words"
Cluster #2
Legacy NFS Storage
Cluster #1 Legacy Blades Legacy Rack Optimized Internal LAN Myrinet External LAN (Ethernet)
Mount Legacy NFS storage New GPFS storage
Storage Frame
Storage Frame
NFS #1
NFS #2
Internal LAN (Myrinet) External LAN (Ethernet) Mount

Legacy NFS storage
This legacy system retains its original Myrinet work for message passing, but this cluster is now part of the GPFS cluster and natively mounts the GPFS file system over Ethernet. (It can still access the legacy NFS file system.)
Ethernet Fabric (with DNS Round Robin Load Balancing Applied to NFS clients)
monitor utility
nfsd
mmfsd node 1 monitor utility
HSM Tape Drives

This new system is initially deployed as part of the GPFS cluster and natively mounts the GPFS file system over Ethernet. (It can also access the legacy NFS file system.)
nfsd
S A N
mmfsd node 2
Cluster #3 Blades HS-21 (Xeon) GPFS clients that natively mount GPFS file system
o o o
monitor utility nfsd
mmfsd node 12
disk/tape servers
GPFS Cluster
TBD
The core of GPFS continues to operate on Unix UID/GID values. Windows GPFS nodes perform the task of mapping to Windows SIDs: explicit Unix-Windows ID maps are defined in Active Directory; implicit (default) maps for Windows SIDs are created from a reserved range of UID/GID values; and unmapped Unix IDs are cast into a foreign domain for Windows. Explicit maps persist only in the Active Directory. Implicit maps persist in the file system. (So how did we do on that explainable principle?)
CNFS Details
Monitoring
Every node in the CNFS cluster runs an NFS utility that monitors GPFS, NFS, and networking components on the node. Upon failure detection and based on your configuration, the monitoring utility might invoke a failover.
Failover
As part of GPFS recovery, the CNFS cluster failover mechanism is invoked. It transfers the NFS serving load that was served by the failing node to another node in the CNFS cluster. Failover is done using recovery groups to help choose the preferred node for takeover. The failover mechanism is based on IP address failover. In addition, it guarantees NFS lock (NLM) recovery.
Load balancing
CNFS supports a failover of all of the nodes load together (all of its NFS IP addresses) as one unit to another node. However, if no locks are outstanding, individual IP addresses can be moved to other nodes for load balancing purposes. CNFS is based on round robin DNS for load balancing of NFS clients among the NFS cluster nodes.
GPFS Supports Storage Virtualization

Storage Virtualization: software provides a consistent "look and feel" for different disk technologies GPFS provides a common command interface to manage any supported storage technolgy
Commands of particular interest: ILM mmadddisk, mmdeldisk, mmaddnode, mmdelnode GPFS client access is made consisten by the NSD layer GPFS can be used as a virtualization tool for NFS clients does not require CNFS
Other storage virtualization tool: SVC

SVC is not necessary to achieve virtualization if GPFS is being used.
Storage Virtualization
An Abstract Example with GPFS Using Best Practices
This example illustrates two possible ways to virtualize very different types of storage technologies.
Segregate different disk systems to improve performance 1. Use GPFS ILM placing each disk system into its own storage pool 2. Different disk systems are segregated into seperate file systems. Segregation is not required, but is merely a "best practice". COMMENT: GPFS can virtualize almost any block device under a common rubric.
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
nsd36
nsd36
nsd36
nsd36
nsd36
nsd36
Node Switch Fabric

NSD Server
nsd1 nsd2 nsd3 nsd4
o o o
NSD Server
nsd1 nsd2 nsd3 nsd4
o o o
NSD Server
nsd1 nsd2 nsd3 nsd4
o o o
NSD Server
nsd1 nsd2 nsd3 nsd4
o o o
NSD Server
nsd1 nsd2 nsd3 nsd4
o o o
NSD Server
nsd1 nsd2 nsd3 nsd4
o o o
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
nsd36 L1..L6 L7..L12
nsd36 L7..L12 L1..L6
nsd36 L13..L20 L21..L28
nsd36 L21..L28 L13..L20
nsd36 L29..L32 L33..L36
nsd36 L33..L36 L29..L32
Optional SAN Switch Fabric

A SAN switch may be required depending on the disk controllers and the controller/server topology.
Disk Controller System A

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 B1 B2 B3 B4
Disk Controller System B

B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B16 C1 C2
Disk Controller System C

C3 C4 C5 C6 C7 C8
1. /fs, pool_a: LUNs L1..L12 2. /fs_a: LUNs L1..L12
1. /fs, pool_b: LUNs L13..L28 2. /fs_b: LUNs L13..L28
1. /fs, pool_c: LUNs L29..L36 2. /fs_c: LUNs L29..L36
Another Abstract Example with GPFS Using Best Practices
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS Client
nsd1 nsd2 nsd3 nsd4
o o o
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
GPFS NSD
nsd36
nsd36
nsd36
nsd36
nsd36
nsd36
Node Switch Fabric

NSD Server This example shows that it is not necessary to segregate storage systems between servers.
nsd1 nsd2 nsd3 nsd4
o o o
NSD Server
nsd1 nsd2 nsd3 nsd4
o o o
GPFS NSD
GPFS NSD
It is also possible to mix disks with different connection technolgies:

FC disk Internal SCSI or IDE IB
nsd36 L1..L18 L19..L36
nsd36 L19..L36 L1..L18



B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B16 C1 C2

C3 C4 C5 C6 C7 C8
GPFS Can Provide Storage Virtualization to Non-GPFS Clusters
NFS Client
NFS Client
NFS Client
NFS Client
NFS Client
NFS Client
Local Area Network nfsd mmfsd nfsd mmfsd

GPFS NSD
nsd1 nsd2
o o o
GPFS can be used as a storage virtualization tool to NFS clients.
GPFS NSD
nsd1 nsd2
o o o
This can be done with or without CNFS.
nsd36
nsd36
L1..L18 L19..L36
L19..L36 L1..L18



B5 B6 B7 B8 B9 B10 B11 B12 B13 B14 B15 B16 C1 C2

C3 C4 C5 C6 C7 C8
Operational Continuity under Adverse Conditions

"Disaster Recovery"
General Concept
Redundant storage technology is deployed that enables operational continuity in the event of a disaster or other unrecoverable error. This is achieved by maintaining duplicate copies of a data set at two different locations, each with a redundant storage system, and enabling the "other" storage system to take over responsibility. This commonly called "disaster recovery" in the literature.
Two Proposed Alternatives

Software based solution synchronous mirroring using GPFS replication Hardware based solution asynchrnous mirroring using DS4000 Enhanced Remote Mirroring
TWAIN Technology Without An Interesting Name

First Alternative
Syncrhonous Mirroring Using GPFS Replication Software replication using GPFS mirroring
the replica is not a sector image copy
Reduces risk of permanent data loss or inconsistency

Provides automatic failover Failback procedures vary according to the cause and magnitude of the failure Allows an active/active configuration Synchronous operation explicitly exposes BW and latency issues Does not require inter-SAN connectivity nor local SAN switches Included in GPFS at no extra cost
Synchronous Mirroring Using GPFS Replication

Best Practice
Maintain 3 sites at geographically distributed locations Maintain 3 sub-sites at same geographic location, but supplied by power from 3 independent sources
Q
Maintaining Quorum
node-z1 NSD Server
Site C guarantees quorum under degraded operation. It does not participate in regular operations. The disk contains only file system descriptor info (descOnly) needed to maintain quorum Site-C is not required, but it improves your chances of surviving an outage automatically by 50%
Q designates quorum nodes
Requirement
This infrastructure is deployed as a single GPFS cluster (n.b., can not use GPFS multi-cluster feature)
failure group 3
Site Z
Tiebreaker
node-x1 client node-x2 client

Q
node-x3 NSD server node-x4 NSD server
Q Local SAN switch not required.
E t h e r n e t
IP Network
GPFS uses mirroring to synchronously copy user data and meta data to both sites (e.g., the write() call blocks until the data is copied to both sites).
minimizes the risk of permanent data loss must provide sufficient BW between sites so regular I/O operations are not impeded must have extra BW to allow quick recovery (n.b., restriping) following a failure Does not require SAN switches or SAN connectivity between sites
E t h e r n e t
node-y1 client node-y2 client

Q
node-y3 NSD server node-y4 NSD server
DS4800-X
DS4800-Y
Local SAN switch not required.
EXP810
o o o
failure group 1
failure group 2
EXP810
o o o
Site X
active
Site Y
active

Second Alternative
DS4800 Enhanced Remote Mirroring (ERM) ERM is hardware based mirroring provided by the IBM DS4000 Three alternatives
metro mirroring full synchronous copy requiring a response from the secondary storage device to continue distance limited to metropolitan areas global mirror asynchronous copy with guaranteed in order delivery and the ability to create a consistency group of LUNs that will be mirrored together distance limited (but not as much as metro mirroring) global copy asynchronous copy with no guaranteed in order delivery and no consistency groups very long distances theoretically possible
Enhanced Remote Mirroring

Hardware mirroring integrated with GPFS
focus on the asynchronous methods
Asynchronous operation avoids latency issues

caution: BW must be able to keep up with data creation rate as well as re-synchronization following an outage in a timely manner after a failure
Failover can be automated with scripts Failback procedures vary according to the cause and magnitude of the failure Deployed in an active/passive configuration
LUNs at the secondary site can not be written to
Increased risk of permanent data loss

Due to asynchronous operation, if the primary site fails, data that has not been replicated to the secondary site will be lost.
Requires inter-SAN connectivity and local SAN switches Premium feature that must be licensed

Best Practice
Maintain 3 sites at geographically distributed locations Maintain 3 sub-sites at same geographic location, but supplied by power from 3 independent sources
Q
Maintaining Quorum
node-z1 NSD Server
Site C guarantees quorum under degraded operation. It does not participate in regular operations. The disk contains only file system descriptor info (descOnly) needed to maintain quorum Site-C is not required, but it improves your chances of surviving an outage automatically by 50%
Requirement
This infrastructure is deployed as a single GPFS cluster (n.b., can not use GPFS multi-cluster feature)
Site Z
Tiebreaker
Primary Site
node-x1 client node-x2 client
The Word Smith Q
Secondary Site
E t h e r n e t
IP Network
node-x3 NSD server node-x4 NSD server

SAN Switch
Some Terms primary/secondary storage system mirroring storage controller pair mirror FC connection primary/secondary logical drive mirrored logical drive pair mirror role role reversal write consistency group full synchronization
The last host port on each controller is dedicated to inter-controller communication.

Logical SAN Connection < 780 MB/s
E t h e r n e t
node-y1 client node-y2 client

Q
Clients at Site Y will access LUNs at Site X using NSD protocol.
node-y3 NSD server node-y4 NSD server

SAN Switch
NSD Servers at Site Y will be idle except during an outage.
DS4800-X
Physical SAN Connection

1. direct FC for within data center 2. DWDM for some metro areas 3. FC-IP using a Brocade channel extender or Cisco FC-IP port 4. ATM/SONET using a channel extender
DS4800-Y
EXP810
o o o
EXP810
o o o
Site X
active
Site Y
passive

Integration with GPFS
ERM has not been tested with GPFS

While ERM has not been tested with GPFS, the mechanisms of ERM are embodied in the storage controller and completely independent of GPFS. For the most part, GPFS operations are unaware of the ERM functionality. There is no reason why GPFS should not work with ERM. A proof of concept (POC) test is recommended.
Operational Procedures
Each logical drive in a mirrored pair presents itself to the local host(s) as a SCSI device (e.g., /dev/sdb). Write requests can be received only by the primary logical drive in a mirrored pair. Read requests can be received by both the primary and logical drive; reading secondary logical drives is primarily intended for administrative purposes. Two file systems are created for the mirrored storage controller pair, a primary and secondary file system. During normal operation, applications access the primary file system. If operation to the primary file system is lost, the secondary site must go through a role reversal, and the secondary file system unmounted/mounted. Full synchronization is needed after an outage.
Snapshots
TBD
GPFS SNMP support
TBD
14. Best Practices
Consider an eclectic assortment of GPFS "best practices" (some collective wisdom).
While these are simple, common sense things, they are easily overlooked, especially when you are working with legacy codes developed under different conditions and assumptions.
Miscellaneous Best Practices

Performance Hierarchy
1. large record sequential order 2. large record strided order or small record sequential order 3. large records in random order or small records in strided order 4. issue hints when reading in small records in random order 5. small record random order without hints
NOTE: large records are >= GPFS block size small records are < GPFS block size (e.g., 2K to 16K)

Performance Hierarchy
Example illustrating how code can be rewritten to eliminate small record accesses. Suppose you are sorting directly a set of small, randomly or semi-randomly distributed records. Because records are small, GPFS will perform poorly. Rewrite code sort as follows:
divide file into N subsets and assign each subset to a node choose the subset size so that it can fit entirely within RAM sort each subset depending on file size, a node may need to sort several subsets merge all of the subsets together
This improves performance by

performing all small record references in memory sequentially accessing records on disk in the merge step
This is a variation of the mergesort algorithm

Read:Write Ratio for HPC Processing
General rule of thumb in CS textbooks read: 90% write: 10% But this generalization is more typical of commercial applications than technical HPC applications. For example, the ratio for many scientific applications is read: 60% to 70% write: 40% to 30% Therefore, plan accordingly.

Response Time vs Bandwidth
Storage systems optimized for response time generally

have larger variance complete less work per unit time are best suited to support online users
Storage systems optimized for bandwidth generally

have smaller variance complete more work per unit time are best suited for batch systems

Avoid "Gold Plating"
Rule of thumb
Configure a file system to handle peak performance up to 3 or 4 standard deviations above the mean to avoid "gold plating". (John Watts, IBM) Programmers worried about performance will often over architect a system
When sizing the disk I/O subsystem, programmers...

should not ask What is the peak load? should ask What are the highest mean loads? What are their standard deviations? What loads are you prepared to pay to meet? Once a month? Once a year?

Mixing Home and Scratch Directories Under GPFS
Home Directory Usage

Many small files Support online, interactive transactions Large variance in access patterns, data rates and times of usage e.g., used a lot during business hours and quiet during the night stat() call are frequent e.g., ls -l Rate of change is small e.g., many files remain untouched for long periods of time
Scratch Directory Usage

Principle working directory for applications Support batch processing, often under job scheduler 24x7 usuage with consistent access patterns and data rates n.b., does not imply low variance... its just a different kind of variance stat() calls are infrequent Rate of change can be large HSM systems move untouched files to lower storage tiers (e.g., tape)
Best Practice
Avoid mixing home and scratch directories under GPFS

When to Use NFS
Where NFS works well Where NFS is a challenge NFS vs. GPFS
Need For Speed Is NFS is badly named or what?

Myth: GPFS is Hard to Manage
One of GPFS's salient features is that it has a million knobs... One of GPFS's problems is that it has a million knobs...
But do not worry!

Myth: GPFS is Hard to Manage
FC8
DCS9900 Couplet DCS Tray #01 DCS Tray #02 DCS Tray #03 DCS Tray #04 DCS Tray #05 DCS Tray #06 DCS Tray #07 DCS Tray #08 DCS Tray #09 DCS Tray #10
45u
Only 6 Steps to install, build and configure a GPFS file system

1. Install from media (e.g., rpm or smitty) 2. For Linux only build portability layer
see /usr/lpp/mmfs/src/README
3. Create the GPFS cluster 4. Startup the GPFS daemons 5. Create the logical disks (i.e., NSDs) 6. Create and mount the file system An experienced sysadm can do this in as little as 5 to 10 minutes!
Ethernet Switch TbE GbE server #01 server #02 server #03 server #04 server #05 server #06 server #07 server #08
42u
GPFS provides convenient sysadm tools

1. Adding and deleting disk 2. Adding and deleting nodes 3. Changing disk, cluster, configuration and file system attributes 4. Monitoring performance (including latency)
Many sysadm tasks like adding and deleting nodes or disks can be completed without shutting down GPFS, rebooting or interfering with production.
GPFS is simplest complex product you will ever use!
15. GPFS Road Map
16. Conclusion and Observation
GPFS is a best of class product with good features, but it is not a "silver bullet" Without careful design, I/O can seriously degrade parallel efficiency (e.g., Amdahl's law) Good I/O performance requires hard work, careful design and the intelligent use of GPFS I/O is not the entire picture; improving I/O performance will uncover other bottle necks

Gpfsworkshop2010 Tutorial v17 2

Uploaded by

Copyright:

Available Formats

Gpfsworkshop2010 Tutorial v17 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gpfsworkshop2010 Tutorial v17 2

Uploaded by

Copyright:

Available Formats

GPFS Best Practices

Raymond L. Paden, Ph.D.

Special Notices from IBM Legal

Author, Revisions and TBDs

Abstract and Biographical Sketch

Sample 2 Day Agenda

Sample Test Configurations

NSD Server-02 x3650 M2 8 cores, 6 DIMMs

x3550-03 x3550-04 x3550-05 x3550-06 x3550-07 x3550-08 x3550-09

x3550-10 x3550-11 x3550-12 x3550-13

x3550-14 x3550-15 x3550-16

60-disk Drawer SAS or SATA

60-disk Drawer SAS or SATA

60-disk Drawer SAS or SATA

60-disk Drawer SAS or SATA

Practical understanding of how to use GPFS

An educated customer is a good customer!

GPFS is a shared disk, parallel clustered file system.

Switching Fabric (LAN, SAN, WAN)

Parallel I/O in a Cluster

Switching Fabric (LAN, SAN, WAN)

Parallel I/O in a Cluster

Switching Fabric (LAN, SAN, WAN)

Textbook examples are great. But in practical terms, what is GPFS?

Some Some thing thing New Old

General Parallel File System

P6-p595 128 core, 256 GB RAM

P6-p595 128 core, 256 GB RAM

RIO (IB 12xDDR)

RIO (IB 12xDDR)

What GPFS is Not

LAN metadata server

GPFS is not a SAN file system with dedicated metadata server.

What GPFS is Not

GPFS is not a niche file system for IBM system P products

Where GPFS Is Used Today

Banking and Finance

Bio-informatics and Life Sciences

Where GPFS Is Used Today

155 p575 Nodes

3780 iDataPlex Nodes 105 P6p575 Nodes

Where GPFS Is Used Today

2. Cluster Storage Taxonomy

Supports POSIX I/O model Generally supports limited forms of parallelism

Journal, extent based semantics

Networked File Systems

NFS is ubiquitous in Unix/Linux environments

poorer performance for HPC jobs, especially parallel I/O

CIFS is ubiquitous in Windows environments

Networked File Systems

Network Attached Storage (NAS)

Provides an NFS server and/or CIFS/Samba solution

Scale-out File System (SoFS, SoNAS)

Basic Clustered File Systems

File system overhead operations

guarantees portability to other POSIX based file systems

Common component architecture

Basic Clustered File Systems

LAN file server file server