QV4311 StudentGuide
QV4311 StudentGuide
QV4311 StudentGuide
cover
AIX Performance
Management I: Concepts and
Tools
Student Notebook
ERC 1.1
The information contained in this document has not been submitted to any formal IBM test and is distributed on an “as is” basis without
any warranty either express or implied. The use of this information or the implementation of any of these techniques is a customer
responsibility and depends on the customer’s ability to evaluate and integrate them into the customer’s operational environment. While
each item may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee that the same or similar results will
result elsewhere. Customers attempting to adapt these techniques to their own environments do so at their own risk.
TOC Contents
Course Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Agenda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Audience
This course is intended for AIX technical support personnel,
performance benchmark personnel, and AIX system administrators.
Prerequisites
Students attending this course are expected to have AIX problem
determination skills. These skills can be obtained by attending the
following courses:
• AHQV011 - AIX Problem Determination I: Boot Issues
• AHQV012 - AIX Problem Determination II: LVM Issues
AHQV332 - POWER6 LPAR Configuration and Operations is
recommended
Objectives
On completion of this course, students should be able to:
- Define performance terminology
- Describe the methodology for tuning a system
- Identify the AIX tools to monitor and analyze an AIX system
- Use AIX tools to determine bottlenecks related to Central
Processing Unit (CPU), Virtual Memory Manager (VMM),
physical and logical I/O, and file systems
- Use AIX tools to demonstrate techniques to tune the
subsystems
pref Agenda
Unit 1 - Data Collection and Analysis
Exercise 1 - Data Collection and Analysis
Text highlighting
The following text highlighting conventions are used throughout this book:
Bold Identifies file names, file paths, directories, user names,
principals, menu paths and menu selections. Also identifies
graphical objects such as buttons, labels and icons that the
user selects.
Italics Identifies links to web sites, publication titles, is used where the
word or phrase is meant to stand out from the surrounding text,
and identifies parameters whose actual names or values are to
be supplied by the user.
Monospace Identifies attributes, variables, file listings, SMIT menus, code
examples and command output that you would see displayed
on a terminal, and messages from the system.
Monospace bold Identifies commands, subroutines, daemons, and text the user
would type.
References
SC23-5253 AIX Performance Management
SC23-5254 AIX Performance Tools Guide and Reference
AIX Commands Reference, Volumes 1-6
SG24-6478 AIX Practical Performance Tools and Tuning Guide
(Redbook)
© Copyright IBM Corp. 2009 Unit 1. Data Collection and Analysis 1-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Unit Objectives
Notes:
Introduction
The objectives in the visual above state what you should be able to do at the end of this
unit.
Uempty
What Exactly is Performance?
• Performance is the major factor on which the productivity of a
system depends
• Performance is dependent on a combination of:
– Throughput
– Response time
• Acceptable performance is based on expectations:
– Expectations are the basis for quantitative performance goals
4 O’clock
Panic
Lunch
Dip
Morning 5 O’clock
Crunch Cliff
7am 8 9 10 11 12 1 2 3 4 5 6
Notes:
Throughput is a measure of the amount of work over a period of time. Examples include
database transactions per minute or kilobytes of a file transferred per second.
Response time is the elapsed time between when a request is submitted to when the
response from that request is returned. Examples include how long a database query
takes or how long it takes to access a web page.
Throughput and response time are related. Sometimes you can have higher throughput
at the cost of response time or better response time at the cost of throughput. So,
acceptable performance is based on reasonable throughput combined with reasonable
response time. Sometimes a decision has to be made as to which is more important:
throughput or response time.
© Copyright IBM Corp. 2009 Unit 1. Data Collection and Analysis 1-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
The performance of a computer system depends on four main components: CPU,
Memory, I/O, and Network.
Both hardware and software contribute to the entire system performance. You should
not depend on very fast hardware as the sole contributor of system performance. Very
efficient software on average hardware can cause a system to perform much better
(and probably be less costly) than poor software on very fast hardware.
Uempty
Performance Metrics and Baseline
• Performance is measured through analysis tools
• Metrics that are measured include:
– CPU utilization
– Memory utilization and paging
– Disk I/O
– Network I/O
• Each metric can be subdivided into finer details
• Create a baseline measurement to compare against in
the future
Notes:
CPU utilization can be split into %user, %system, %idle, and %IOwait. Other CPU
metrics can include the length of the run queues, process/thread dispatches, interrupts,
and lock contention statistics.
Memory metrics include virtual memory paging statistics, file paging statistics, and
cache and TLB miss rates.
Disk metrics include disk throughput (kilobytes read/written), disk transactions
(transactions per second), disk adapter statistics, disk queues (if the device driver and
tools support them), and elapsed time caused by various disk latencies. The type of
disk access, random versus sequential, can also have a big impact on response times.
Network metrics include network adapter throughput, protocol statistics, transmission
statistics, network memory utilization, and much more.
You should create a baseline measurement when your system is running well and
under a normal load. This will give you a guideline to compare against when your
system seems to have performance problems.
© Copyright IBM Corp. 2009 Unit 1. Data Collection and Analysis 1-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Throughput
Bottlenecks
Notes:
As server performance is distributed throughout each server component and type of
resource, it is essential to identify the most important factors or bottlenecks that will
affect the performance for a particular activity. Detecting the bottleneck within a server
system depends on a range of factors such as those shown in the visual:
A bottleneck is a term used to describe a particular performance issue which is throttling
the throughput of the system. It could be in any of the subsystems: CPU, memory, or I/O
including network I/O. The graphic in the visual above illustrates that there may be
several performance bottlenecks on a system and some may not be discovered until
other, more constraining, bottlenecks are discovered and solved.
Uempty
Determine the Type of the Problem
• Determine the type of the problem:
– Is it a functional problem or purely a performance problem?
– Is it a trend or a sudden issue?
– Is the problem only at certain times?
• What do you do when someone reports a performance
problem?
– Know the nature of the problem
– Gather data and compare against the baseline
• Use AIX tools
• Use PerfPMR
• Document statistics regularly to spot trends for capacity
planning
• Document statistics during high workloads
Notes:
A functional problem is when the application, hardware or network is not behaving
properly. A performance problem is when the functions are being achieved but the
performance is slow. Sometimes functional problems lead to performance problems. In
these cases, rather than tune the system, it is more important to determine the root
cause of the problem and fix it.
It is quite common for support personnel to receive a problem report in which all it says
is that someone has a performance problem on the system and here is some data for
you to analyze. This little information is not enough to accurately determine the nature
of a performance problem.
© Copyright IBM Corp. 2009 Unit 1. Data Collection and Analysis 1-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
There are many trade-offs related to performance tuning that should be considered.
The key is to ensure there is a balance between them.
The trade-offs are:
- Cost versus performance
In some situations, the only way to improve performance is by using more or faster
hardware. But, ask the question “Does the additional cost result in a proportional
increase in performance?”
- Conflicting performance requirements
If there is more than one application running simultaneously, there may be
conflicting performance requirements.
- Speed versus functionality
Resources may be increased to improve a particular area, but serve as an overall
detriment to the system. Also, you may need to make choices when configuring your
system for speed versus maximum scalability.
Uempty
Performance Analysis Tools
CPU Memory System I/O Subsystem
vmstat vmstat iostat
iostat lsps vmstat
ps svmon lsps
sar filemon lsattr
tprof, gprof, prof lsdev
time, timex lspv, lslv, lsvg
fileplace
filemon
lvmstat
topas topas topas
trace, trcrpt, curt, trace, trcrpt, truss trace, trcrpt, truss
splat, truss
cpupstat, lparstat, lparstat sar
mpstat, smtctl
Notes:
© Copyright IBM Corp. 2009 Unit 1. Data Collection and Analysis 1-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
It is important to collect a variety of data that show statistics regarding the various
system components. In order to make this easy, a set of tools supplied in a package
called PerfPMR is available on a public ftp site. The following URL can be used to
download your version using a web browser:
ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr
The goal is to collect a good base of information that can be used by AIX technical
support specialists or development lab programmers to get started in analyzing and
solving the performance problem. This process may need to be repeated after analysis
of the initial set of data is completed.
Uempty
Capturing Data with PerfPMR
• Create a directory to collect the PerfPMR data
• Run perfpmr.sh 600 to collect the standard data
• perfpmr.sh will collect information by:
í Running trace for 5 seconds
í Gathering 600 seconds of general system performance data
í Collecting hardware and software configuration information
and putting it into a file named config.sum
í Attempting to collect additional data by:
• Running iptrace for 10 seconds
• Running tcpdump for 10 seconds
• Running filemon for 60 seconds
• Running tprof for 60 seconds
• Answer the questions in PROBLEM.INFO
UNIX Software Service Enablement © Copyright IBM Corporation 2009
Notes:
Create a data collection directory and cd into this directory. Allow at least
12 MB/processor of unused space in whatever file system is used.
If there is not enough space in the file system, perfpmr.sh will print a message similar
to:
perfpmr.sh: There may not be enough space in this filesystem
perfpmr.sh: Make sure there is at least 44 Mbytes
To run PerfPMR, type in the command perfpmr.sh. One of the scripts perfpmr.sh
calls is monitor.sh. monitor.sh calls several scripts to run performance monitoring
commands. By default, each of these performance monitoring commands called by
monitor.sh will collect data for 10 minutes (600 seconds). This default time can be
changed by specifying the number of seconds to run as the first parameter to
perfpmr.sh. You can also run the PerfPMR scripts individually.
© Copyright IBM Corp. 2009 Unit 1. Data Collection and Analysis 1-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
PerfPMR collects its data into many different files. The types of files created are listed
on the visual.
The .int data is most useful for metrics analysis. The .sum data is most useful for
overall or configuration type of data. The .before and .after data are metrics before the
testcase begins and those at end of the test interval. These are good for determining a
starting and delta value for what occurred over life of the test interval.
Uempty
monitor.sh
• The monitor.sh script invokes commands and scripts to
collect performance data
Notes:
The perfpmr.sh script calls the monitor.sh script. The monitor.sh script invokes
commands and other scripts to gather performance data. The monitor.sh script can be
run by itself.
The monitor.sh script captures before and after data by invoking the following
commands and scripts: lsps -a, lsps -s, vmstat -i, vmstat -v, and svmon.sh. The
svmon command captures and analyzes a snapshot of virtual memory. The svmon
commands that the svmon.sh script invokes are svmon -G, svmon -Pns, and svmon -S.
The monitor.sh script invokes the following scripts to monitor system data for the
amount of time given in the perfpmr.sh or monitor.sh command: nfsstat.sh (unless
the -n flag is used), netstat.sh (unless the -n flag is used), ps.sh, vmstat.sh,
emstat.sh (unless the -e flag is used), mpstat.sh (unless the -m flag is used),
lparstat.sh (unless the -l flag is used), sar.sh, iostat.sh, and pprof.sh.
© Copyright IBM Corp. 2009 Unit 1. Data Collection and Analysis 1-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
The PerfPMR package is distributed as a compressed tar file. Obtain the latest version
of PerfPMR from the website ftp://ftp.software.ibm.com/aix/tools/perftools/perfpmr.
When you install PerfPMR, a link will be created in /usr/bin to the perfpmr.sh script.
The PerfPMR process is described in a README file provided in the PerfPMR
package.
PerfPMR should be installed when the system is initially set up and tuned. Then, you
can get a baseline measurement from all the performance tools. When you suspect a
performance problem, PerfPMR can be run again and the results compared with the
baseline measurement.
It is also recommended that you run PerfPMR before and after hardware and software
changes. If your system is performing fine and you then you upgrade your system and
begin to have problems, then it’s difficult to identify the problem without a baseline to
compare against.
Uempty
The topas Command
Topas Monitor for host: woolf222 EVENTS/QUEUES FILE/TTY
Tue Feb 3 19:43:13 2009 Interval: 10 Cswitch 165 Readch 1373
Syscall 949.2K Writech 335
CPU User% Kern% Wait% Idle% Reads 949.3K Rawin 0
ALL 22.4 77.6 0.0 0.0 Writes 0 Ttyout 64
Forks 0 Igets 0
Network KBPS I-Pack O-Pack KB-In KB-Out Execs 0 Namei 5
Total 0.2 1.0 0.4 0.1 0.1 Runqueue 1.2 Dirblk 0
Waitqueue 0.0
Disk Busy% KBPS TPS KB-Read KB-Writ MEMORY
Total 0.0 0.0 0.0 0.0 0.0 PAGING Real,MB 1024
Faults 17 % Comp 68.8
FileSystem KBPS TPS KB-Read KB-Writ Steals 0 % Noncomp 11.1
Total 1.1 1.0 1.1 0.0 PgspIn 0 % Client 11.1
PgspOut 0
Name PID CPU% PgSp Owner PageIn 0 PAGING SPACE
cpuprog 503892 99.7 0.1 root PageOut 0 Size,MB 512
getty 213180 0.1 0.5 root Sios 0 % Used 1.1
topas 262368 0.0 1.3 root % Free 99.9
java 204836 0.0 70.0 pconsole NFS (calls/sec)
gil 57372 0.0 0.9 root SerV2 0 WPAR Activ 0
java 114916 0.0 37.9 root CliV2 0 WPAR Total 0
rpc.lock 81986 0.0 1.2 root SerV3 0 Press: "h"-help
ksh 290824 0.0 0.5 root CliV3 0 "q"-quit
rmcd 266382 0.0 2.5 root
aixmibd 225438 0.0 1.1 root
sendmail 217244 0.0 1.1 root
xmgc 45078 0.0 0.4 root
Notes:
The topas command reports selected statistics about the activity on the local system.
This tool can be used to provide a full screen of a variety of performance statistics.
The topas tool displays a continually changing screen of data rather than a sequence of
interval samples, as displayed by such tools as vmstat and iostat. Therefore, topas is
most useful for online monitoring and the other tools are useful for gathering detailed
performance monitoring statistics for analysis.
If you're running topas in a partition and do a dynamic LPAR command which changes
the system configuration, then topas must be stopped and restart to view accurate
data.
The topas command can show many performance statistics at the same time. The
output consists of two fixed parts and a variable section.
© Copyright IBM Corp. 2009 Unit 1. Data Collection and Analysis 1-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
Like topas, the nmon tool is helpful in presenting important performance tuning
information on one screen and dynamically updating it.
Another tool, the nmon_analyser, takes files produced by nmon and turns them into
spreadsheets containing high quality graphs ready to cut and paste into performance
reports.
The nmon and nmon_analyser tools come with AIX 5.3 TL09, AIX 6.1 TL02 and Virtual
I/O Server (VIOS) 2.1 and installed by default.
Uempty
Exercise 1: Data Collection and Analysis
Notes:
© Copyright IBM Corp. 2009 Unit 1. Data Collection and Analysis 1-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Review Questions (1 of 3)
1. Use these terms with the following statements:
metrics, baseline, performance goals,
throughput, response time
Notes:
Uempty
Review Questions (2 of 3)
2. The four components of system performance are:
–
–
–
–
3. After tuning a resource or system parameter and monitoring the
outcome, what is the next step in the tuning process? _________
____________________________________________________
Notes:
© Copyright IBM Corp. 2009 Unit 1. Data Collection and Analysis 1-19
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Review Questions (3 of 3)
10. True or False You can dynamically change the topas and
nmon displays.
Notes:
Uempty
Unit Summary
Notes:
© Copyright IBM Corp. 2009 Unit 1. Data Collection and Analysis 1-21
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
References
SC23-5253 AIX Performance Management
SC23-5254 AIX Performance Tools Guide and Reference
AIX Commands Reference, Volumes 1-6
SG24-6478 AIX Practical Performance Tools and Tuning Guide
(Redbook)
Unit Objectives
After completing this unit, you should be able to:
• Describe the performance tuning process
• List the tools available for tuning
• Find help on the performance tunables
• Define the types of performance tunables
• Describe, display, and change performance tunables
• Describe the error log entry when a restricted tunable is
changed permanently
Notes:
Uempty
What is Performance Tuning?
Notes:
Performance tuning is one aspect of performance management. The definition of
performance tuning sounds simple and straight forward, but it’s actually a complex
process.
Performance tuning involves managing your resources. Resources could be logical
(queues, buffers, etc.) or physical (real memory, disks, CPUs, network adapters, etc.).
Resource management involves the various tasks listed here. We will examine each of
these tasks later.
Tuning always must be done based on performance analysis. While there are
recommendations as to where to look for performance problems, what tools to use, and
what parameters to change, what works on one system may not work on another. So
there is no cookbook approach available for performance tuning that will work for all
systems.
Notes:
The wheel graphic in the visual above represents the phases of a more formal tuning
project. Experiences with tuning may range from the informal to the very formal where
reports and reviews are done prior to changes being made. Even for informal tuning
actions, it is essential to plan, gather data, develop a recommendation, implement, and
document.
Uempty
Performance Tuning Tools
Notes:
The table in the visual shows the tuning commands that can be used for each
subsystem.
Tuning Commands
• Tunable commands include:
– vmo manages Virtual Memory Manager tunables
– ioo manages I/O tunables
– schedo manages CPU scheduler/dispatcher tunables
– no manages network tunables
– nfso manages NFS tunables
– raso manages reliability, availability, serviceability tunables
• Tunables are the parameters the tuning commands
manipulate
• Tunables can be managed from:
– SMIT
– Web-based System Manager
– Command line
• All tunable commands have the same syntax
Notes:
There are six tunable commands (vmo, ioo, schedo, no, nfso, and raso) that are used
to display and change tuning parameters. These actions can be done through SMIT
panels, Web-based System Manager plug-ins, and the tunable commands.
All six tuning commands (vmo, ioo, schedo, no, nfso and raso) use a common syntax
and are available to directly manipulate the tunable parameter values. Available options
include making permanent changes and displaying detailed help on each of the
parameters that the command manages.
Uempty
Types of Tunables
• There are two types of tunables (AIX 6.1):
– Restricted Tunables
• Should not be changed unless recommended by AIX
development or development support
• Dynamic change will show a warning message
• Permanent change must be confirmed
• Permanent changes will cause an error log entry at boot
time
– Non-Restricted Tunable
• Can have restricted tunables as dependencies
• Migration from AIX 5.3 to AIX 6.1 will keep the old tunable values
Notes:
Beginning with AIX 6.1, many of the tunables are considered restricted. Restricted
tunables should not be modified unless told to do so by AIX development or support
professionals.
The restricted tunables are not displayed, by default.
When migrating to AIX 6.1, the old tunable values will be kept. However, any restricted
tunables that are not at their default AIX 6.1 value will cause an error log entry.
Notes:
Uempty
Syntax of Tunable Commands
• All tuning commands have the same syntax:
command [ -p | -r ] -D
command [ -p | -r ] [-F] -a
command -h [ Tunable ]
Notes:
The descriptions of the flags are:
Flag Description
-p Makes the change apply to both current and reboot values
-r Forces the change to go into effect on the next reboot
-o Displays or sets individual parameters
-d Resets individual Tunable to default value
-D Resets all tunables to default values
Forces display of the restricted tunable parameters when the -a, -L or
-F
-x options are specified alone on the command line to list all tunables.
-a Displays all parameters
-h Displays help information for a Tunable
-L Lists attributes of one or all tunables
Lists characteristics of one or all tunables, one per line, using a
-x
spreadsheet-type format
Tunables Documentation
• The -h flag displays information for the tuning commands
vmo, ioo, schedo, raso, no, and nfso:
– command -h displays the usage statement for the command
– command -h <tunable> displays the tunable's purpose,
values (default, range, type, unit), and tuning information
• Beginning with AIX 6.1, none of the AIX manuals or man pages
contain documentation on the performance tunables
Notes:
The -h flag of the tuning commands display help about the tunable parameter, if one is
specified. Otherwise, the command usage statement is displayed.
Prior to AIX 6.1, the performance tunables were described in the documentation and
man pages of the related command (schedo, vmo, ioo, raso, no, and nfso). The
documentation could not keep up with the changes being made to the tunable values
(default and range), or the addition of new tunables.
In AIX 6.1, to keep the tunables information up to date, the tunables descriptions can
only be found from the tuning command itself. System documentation is fairly static, so
it was hard to keep up with the many tunables available, the adjustments to default
tunable values, changes to the tunable value ranges, and new tunables. The tunable
information is now dynamically retrieved from the kernel providing more accurate help.
This ensures a single method to know what functions a command currently has.
Uempty
Displaying Tunable Values
• The no, nfso, vmo, ioo, raso, and schedo tuning commands all
support the following syntax to display tunables:
– To display a single tunable:
command -o tunable (display current value)
command -L tunable (display tunable attributes)
– To display the current values for all the command's non-restricted
tunables:
command -a
– To display the current values for all the command's tunables:
command -F -a
– To display the tunable attributes for all the command's non-restricted
tunables:
command -L
– To display the tunable attributes for all the command's tunables:
command -F -L
Notes:
The -o flag
displays the current value of the given tunable.
The -a flag
Displays the current, reboot (when used with the -r option), or permanent (when used
with the -p option) values for all tunable parameters.
The -F flag
Forces display of the restricted tunable parameters when the -a, -L or -x options are
specified alone on the command line to list all tunables.
The -L flag
Lists the attributes of one or all tunables (current, default, boot, minimum, maximum,
unit, type and dependencies).
Notes:
The vmo -a command will display the current value of the VMM tunables.
When the -F flag is not specified, restricted tunables are not displayed, unless these
restricted tunables are specifically named as a parameter
(e.g.vmo -o maxclient%).
As shown in the visual, when the -F flag is included with the vmo -a command
(vmo -a -F), the non-restricted tunables are displayed first followed by the restricted
tunables. Note, the output line, ##Restricted tunables, is displayed before listing the
restricted tunables.
The restricted tunables will not be shown by default (without the -F flag) regardless of
whether they have been modified or not.
Uempty
Displaying Attributes of Tunables
# vmo -L
Notes:
The -L option of the tunable commands (vmo, ioo, schedo, no and nfso) can be used to
print out the attributes of a single tunable or all the tunables.
The output of the command with the -L option shows the current value, default value,
value to be set at next reboot, minimum possible value, maximum possible value, unit,
type, and dependencies.
• To change a tunable value permanently, use the -p flag with the -o flag:
# vmo -p -o minperm%=7
Setting minperm% to 7 in nextboot file
Setting minperm% to 7
• To change a tunable at the next reboot, use the -r flag with the -o flag:
# vmo -r -o minperm%=6
Setting minperm% to 6 in nextboot file
Warning: changes will take effect only at next reboot
• To change all the tunable value to their default values, use the -D flag:
# vmo -D
Notes:
The -o option of the tunable commands (vmo, ioo, schedo, no and nfso) is used to
change a tunable value to the new value specified.
Any change (with -o, -d or -D) to a parameter of type Mount will result in a message
being displayed to warn the user that the change is only effective for future mount
operations.
Any change (with -o, -d or -D flags) to a parameter of type Connect will result in inetd
being restarted, and a message displaying a warning to the user that the change is only
effective for future socket connections.
Any attempt to change (with -o, -d or -D) a parameter of type Bosboot or Reboot
without -r, will result in an error message.
Any attempt to change (with -o, -d or -D but without -r) the current value of a
parameter of type Incremental with a new value smaller than the current value, will
result in an error message.
Uempty
Changing Restricted Tunable Values
Restricted tunables should NOT be changed without
approval from AIX Development or AIX Support!
• Changing a restricted tunable dynamically
– Warning message is written that states a restricted tunable has
been modified
# vmo -o maxperm%=95
Setting maxperm% to 95
Warning: a restricted tunable has been modified
Notes:
CAUTION!
Restricted tunables should not be modified unless told to do so by AIX development
or support professionals.
Description
RESTRICTED TUNABLES MODIFIED AT REBOOT
Probable Causes
SYSTEM TUNING
User Causes
TUNABLE PARAMETER OF TYPE RESTRICTED HAS BEEN MODIFIED
Recommended Actions
REVIEW TUNABLE LISTS IN DETAILED DATA
Detail Data
LIST OF TUNABLE COMMANDS CONTROLLING MODIFIED RESTRICTED TUNABLES AT REBOOT, SEE FILE
/etc/tunables/lastboot.log
vmo
Notes:
When the system is rebooted, any restricted tunables in the /etc/tunables/nextboot file
that were modified from their default values (by using a tuning command specifying the
-r or -p flag) will cause an error log entry with a label of TUNE_RESTRICTED.
The /usr/lib/perf/tunerrlog command creates the TUNE_RESTRICTED error log
entry. The tunerrlog command is a new performance command that is included in the
bos.perf.tune package.
The tunerrlog command is called by /usr/sbin/tunrestore -R (which is in
/etc/inittab).
Uempty
Tunables Files
• The /etc/tunables directory centralizes the tunable files
Notes:
The parameter values tuned by vmo, schedo, ioo, no, and nfso are stored in files in
/etc/tunables.
Tunables files currently support six different stanzas: one for each of the tunable
commands (schedo, vmo, ioo, no and nfso), plus a special info stanza. The six
stanzas schedo, vmo, ioo, no, nfso and raso contain tunable parameters managed by
the corresponding command (see the command's man pages for the complete
parameter lists).
nextboot File
Example of a nextboot file:
info:
AIX_level = "6.1.1.1"
Kernel_type = "MP64"
Last_validation = "2009-01-18 14:24:43 CST (current, reboot)"
vmo:
maxperm% = "94"
minperm% = "6"
schedo:
ioo:
raso:
Notes:
The nextboot file is automatically applied at boot time and only contains the list of
tunables to change. It does not contain all parameters. The bosboot command also
gets the value of Bosboot type tunables from this file. It contains all tunable settings
made permanent.
Uempty
lastboot File
# cat lastboot
info:
Logfile_checksum = "1323389206"
Description = "Full set of tunable parameters after last boot"
AIX_level = "6.1.2.1"
Kernel_type = "MP64"
Last_validation = "2009-02-03 20:43:47 CST (current, reboot)"
...
schedo:
...
vmo:
...
minfree = "960" # DEFAULT VALUE
minperm = "14301" # STATIC (never restored)
minperm% = "6"
nokilluid = "0" # DEFAULT VALUE
npskill = "256" # DEFAULT VALUE
...
ioo:
...
raso:
...
no:
...
net_malloc_police = "16384" # RESTRICTED not at default value
...
nfso:
...
Notes:
The lastboot file is automatically generated at boot time. It contains the full set of
tunable parameters, with their values at the beginning of this boot. Default values are
marked with
# DEFAULT VALUE. Restricted parameters have a blank second column or the phrase
# RESTRICTED not at default value (depending on the TL of AIX 6).
lastboot.log File
# cat /etc/tunables/lastboot.log
Restoring no values
===================
Warning: a restricted tunable has been modified
Setting net_malloc_police to 65536
Notes:
The lastboot.log file should be the only file in /etc/tunables that is not in the stanza
format described here. It is automatically generated at boot time, and contains the
logging of the creation of the lastboot file, i.e. any parameter change made is logged.
Any change which could not be made (possible if the nextboot file was created
manually and not validated with tuncheck) is also logged. (tuncheck will be covered
soon.)
Uempty
Managing Tunables Files
• Commands to manipulate the tunables files in /etc/tunables
are:
– tuncheck
Used to validate the parameter values in a file
– tunrestore
Changes tunables based on parameters in a file
– tunsave
Saves tunable values to a stanza file
– tundefault
Resets tunable parameters to their default values
Notes:
tuncheck command
The tuncheck command validates a tunables file. All tunables listed in the specified file
are checked for range and dependencies. If a problem is detected, a warning is issued.
tunrestore command
The tunrestore command is used to change all tunable parameters to values stored in
a specified file.
tunsave command
The tunsave command saves the current state of the tunables parameters in a file.
tundefault command
The tundefault command resets all tunable parameters to their default values. It
launches all the tuning commands (ioo, vmo, schedo, no and nfso) with the -D flag.
Notes:
Uempty
Review Questions
1. True or False: In AIX 6.1, help for the performance tunables
is available in the man pages.
2. Which tunable command flag will show the restricted
tunables:
a) -h
b) -s
c) -F
d) -R
3. True or False: A confirmation must always be given when
permanently changing a restricted tunable.
4. True or False: An error log entry is created when a
restricted tunable is changed permanently.
5. True or False: When a system is rebooted, the lastboot file
flags any tunables that have been changed since the
system booted.
UNIX Software Service Enablement © Copyright IBM Corporation 2009
Notes:
Unit Summary
• Tune one or one set of tunables at a time and see if performance improved
• The documentation is only found in the help message of the tuning
commands
• The -h flag displays information for the tuning commands
vmo, ioo, schedo, raso, no, nfso
• Restricted tunables should not be changed unless recommended by AIX
development or development support
• Restricted tunables are not shown by tuning commands unless the –F flag
is used
• Changing restricted tunables:
– Dynamically: will show a warning message
– Permanently: must be confirmed
• Permanent changes to restricted tunables will cause an error log entry at
boot time
Notes:
Introduction
This unit identifies the tools to help determine CPU bottlenecks. It also
demonstrates techniques to tune CPU-related issues on your system.
Unit Objectives
After completing this unit, you should be able to:
• Describe processes and threads
• Describe how process priorities affect CPU scheduling
• Use the nice and renice commands to change process priorities
• Describe the simultaneous multi-threading concept and its effect
on performance monitoring and tuning
• View logical processors
• Use smtctl to enable/disable simultaneous multi-threading and
view simultaneous multi-threading statistics
• Use the output of the following AIX tools to determine symptoms of
a CPU bottleneck:
- vmstat, sar, ps and topas
References
SC23-5253 AIX Performance Management
SC23-5254 AIX Performance Tools Guide and Reference
AIX Commands Reference, Volumes 1-6
SG24-6478 AIX Practical Performance Tools and Tuning Guide
(Redbook)
SG24-7559 IBM AIX Version 6.1 Differences Guide (Redbook)
Uempty
Unit Objectives
After completing this unit, you should be able to:
• Describe processes and threads
• Describe how process priorities affect CPU scheduling
• Use the nice and renice commands to change process
priorities
• Describe the simultaneous multi-threading concept and its
effect on performance monitoring and tuning
• View logical processors
• Use smtctl to enable/disable simultaneous multi-threading
and view simultaneous multi-threading statistics
• Use the output of the following AIX tools to determine
symptoms of a CPU bottleneck:
– vmstat, sar, ps and topas
Notes:
Introduction
The objectives in the visual above state what you should be able to do at the end of this
unit.
Yes CPU No
High CPU
supposed to
usage?
be idle?
No Yes
Yes
Tune applications /
operating system
UNIX Software Service Enablement © Copyright IBM Corporation 2009
Notes:
This flowchart illustrates the CPU-specific monitoring and tuning strategy. If the system
is not meeting the CPU performance goal, you need to find the root cause for why the
CPU subsystem is constrained. It may be simply that the system needs more physical
CPUs, but it could also be because of errant applications or processes gone awry. If the
system is behaving normally but is still showing signs of a CPU bottleneck, tuning
strategies may help to get the most out of the CPU resources.
If you see unusually high CPU usage when monitoring, the next question to ask is,
“What processes are accumulating CPU time?” “Are they supposed to be accumulating
so much CPU time?” If they are, then perhaps there are some tuning strategies you can
use to tune the application or the operating system to make sure that important
processes get the CPU they need to meet the performance goal.
Another scenario is that you’re not meeting performance goals and the CPUs are fairly
idle or not working as much as they should. This points to a bottleneck in another area
of the computer system.
Uempty
Processes and Threads
Disk Memory CPU
Single-
Program Thread 1 CPU 0
threaded
process
Multi-threaded
process CPU 0
Thread 1
Program Thread 2
CPU 1
Thread 3
CPU 2
Notes:
Process
A process is the entity that the operating system uses to control the use of system
resources. A process is started by a command, shell program or another process.
Thread
Each process is made up of one or more kernel threads. A thread is a single sequential
flow of control. A single-threaded process can only handle one operation at a time,
sequentially. Multiple threads of control allow an application to overlap operations, such
as reading from a terminal and writing to a file. AIX schedules and dispatches CPU
resources at the thread level. In general, when we refer to threads in this course, we will
be referring to the kernel threads within a process.
SIDL "A"
"R" "T"
Ti
m RUN
e "S"
NING
"Z"
Zombie
Notes:
Before a process is created, it needs a slot in the process and thread tables; at this
stage it is in the SNONE state. While a process is undergoing creation, waiting for
resources (memory) to be allocated, it is in the SIDL state (I (idle) state).
When a process is in an A state, one or more of its threads are in the R (ready-to-run)
state. Threads of a process in this state have to contend for the CPU with all other
ready-to-run threads. Only one thread can have the use of the CPU at a time; this is the
running thread for that processor. A thread will be in an S state if a thread is waiting on
an event or I/O. Instead of wasting CPU time, it sleeps and relinquishes control of the
CPU. A thread may be stopped via the SIGSTOP signal, and started again via the
SIGCONT signal; while suspended it is in the T state. This has nothing to do with
performance management.
The Z state: When a process dies (exits) it becomes a zombie.
Uempty
Run Queues
CPU 0 Run Queue
0
1 Prioritized
Global Run Queue . threads
.
0 .
1
.
.
. 255
. CPU 1 Run Queue
.
0
255 1
.
.
.
254
255
UNIX Software Service Enablement © Copyright IBM Corporation 2009
Notes:
There is a run queue structure for each CPU as well as a global run queue. The run
queue is divided further into queues that are priority ordered (one queue per priority
number). The per-CPU run queues are called local run queues. When a thread has
been running on a CPU, it will tend to stay on that CPU’s run queue. If that CPU is busy,
then the thread can be dispatched to another idle CPU and will be assigned to that
CPU’s run queue. When a CPU performs idle load balancing (i.e., a CPU is idle, and
tries to steal work from another CPU), it will steal threads that are less favored, since
the highly favored threads will run soon enough on the busy CPU. If the higher favored
threads were moved, they would suffer cache misses and performance would be worse.
Less favored threads are moved, since even though they will suffer cache misses, they
still end up running sooner than they would have if they'd remained on the busy CPU
run queue.
The dispatcher picks the best priority thread in the run queue when a CPU is available.
When a thread is first created, it is assigned to the global run queue. It stays on that
queue until assigned to a local run queue.
Notes:
A priority is a number assigned to a thread used to determine the order of scheduling
when multiple threads are runnable. A process priority is the most favored priority of any
one of its threads. The initial process/thread priority is inherited from the parent
process.
The kernel maintains a priority value (sometimes termed the scheduling priority) for
each thread. The priority value is a positive integer and varies inversely with the
importance of the associated thread. That is, a smaller priority value indicates a more
important thread. When a CPU is looking for a thread to run, it chooses the
dispatchable thread with the smallest priority value.
A thread can be fixed-priority or nonfixed-priority. The priority value of a fixed-priority
thread is constant, while the priority value of a nonfixed-priority thread can change
depending on its CPU usage.
Real-time thread priorities are lower than 40. Real-time applications should run with a
fixed priority and a numerical value less than 40 so that they are more favored than
other applications.
Uempty
Changing Priority with nice/renice
• Initial priority of a non-fixed thread is 40 + nice
• Nice has a default value of:
– 20 for a foreground process
– 24 for a ksh background process
• Nice value can range from 0 to 39
– The higher the nice value the lower its priority
– Only root can make its priority more favorable
• Nice value can be set when a process is started by using the nice
command
– Example: Add 10 to the nice value:
# nice -n 10 command
• Nice value can be changed for a running process using the renice
command
– Example: Add 10 to the nice value:
# renice -n 10 -p PID
Notes:
A user can use the nice and renice commands to change the nice value for a process.
The nice value is used by the system to calculate the current priority of a running
process. It is added to the base user priority of 40 for non-fixed priority threads and is
irrelevant for fixed priority threads. The default nice value is 20 and therefore a resulting
priority of 60. This is because the nice value is added to the user base priority value of
40. Some shells (such as ksh) will automatically add a nice value of 4 to the default nice
value if a process is started in the background (using &). Only the root user can change
the priority to a more favored priority.
The nice command does not return an error message if you attempt to increase a
command's priority without the appropriate authority. Instead, the command's priority is
not changed, and the system starts the command as it normally would.
The renice command alters the nice value of a specific process, all processes with a
specific user ID, or all processes with a specific group ID.
nice/renice Examples
• nice Examples:
Command Action Relative Priority
nice -10 foo Add 10 to current nice value Lower priority (disfavored)
nice -n 10 foo Add 10 to current nice value Lower priority (disfavored)
nice --10 foo Subtract 10 from current nice value Higher priority (favored)
nice -n -10 foo Subtract 10 from current nice value Higher priority (favored)
• renice Examples:
Command Action Relative Priority
renice 10 -p 563 Add 10 to default nice value Lower priority (disfavored)
renice -n 10 -p 563 Add 10 to current nice value Lower priority (disfavored)
renice -10 -p 563 Subtract 10 from default nice value Higher priority (favored)
renice -n -10 -p 563 Subtract 10 from current nice value Higher priority (favored)
Notes:
The -Increment flag to the nice command is equivalent to the -n Increment flag. Both
flags increments a command’s priority up or down. You can specify a positive or
negative number. Positive increment values reduce priority (disfavor). Negative
increment values increase priority (favor). Only users with root authority can specify a
negative increment.
With the renice command, the way the increment value is used depends on whether
the -n flag is specified. If -n is specified, then the increment value is added to the
current nice value. If the -n flag is not specified, then the increment value is added to
the default value of 20 to get the effective nice value.
Uempty
Viewing Process Priorities
# ps -elk
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
303 A 0 0 0 120 16 -- 15004190 384 - 0:01 swapper
200003 A 0 1 0 0 60 20 10001480 708 - 0:00 init
303 A 0 8196 0 0 255 -- 17006190 384 - 20:31 wait
303 A 0 12294 0 0 17 -- 19008190 448 - 0:00 sched
303 A 0 16392 0 0 16 -- 1b00a190 512 f100080009786c08 - 0:00 lrud
303 A 0 49176 0 0 255 -- 1d02c190 384 - 20:11 wait
303 A 0 53274 0 0 255 -- 1f02e190 384 - 20:35 wait
303 A 0 57372 0 0 255 -- 1030190 384 - 20:14 wait
303 A 0 61470 0 0 36 -- 2033190 448 - 0:00 netm
303 A 0 65568 0 0 37 -- 4035190 960 * - 0:01 gil
303 A 0 69666 0 0 16 -- 9038190 512 3f2af70 - 0:00 wlmsched
40201 A 0 81986 0 0 60 20 170a6190 448 - 0:00 lvmbb
240001 A 0 106618 1 0 60 20 1c14d480 552 * - 0:00 syncd
240001 A 0 180346 151706 0 60 20 f1de480 376 - 0:00 syslogd
240001 A 0 192764 204958 0 60 20 1fb2e480 824 f100070000159c78 pts/2 0:00 ksh
200001 A 0 262372 192764 34 87 24 1cb4d480 92 pts/2 52:37 myprog
200001 A 0 286896 192764 35 87 24 1fb4e480 92 pts/2 52:32 myprog2
200001 A 0 290950 356386 0 60 20 15b64480 732 pts/1 0:00 ps
# ps -L 192764 -l
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
240001 A 0 192764 204958 0 60 20 1fb2e480 824 f100070000159c78 pts/2 0:00 ksh
200001 A 0 262372 192764 40 90 24 1cb4d480 92 pts/2 55:02 myprog
200001 A 0 286896 192764 40 90 24 1fb4e480 92 pts/2 54:55 myprog2
Notes:
The priority values shown on the visual are those of the most favored thread for each
process. To view all processes, run the command: ps -el.
To view the most favored thread for each process including kernel processes, run the
command: ps -elk.
Beginning with AIX 5.3, the -L <PIDlist> option generates a list of descendants of
each PID that has been passed to it in the Pidlist variable.
The priority is listed under the PRI column. If the value under NI is --, this indicates that
it is a fixed priority.
You can use the ps command with the -l flag to view a command's nice value. The nice
value appears under the NI heading in the ps command output. If the nice value in ps is
--, the process is running at a fixed priority.
Another column in the ps output is important, the C (CPU usage) column. This
represents the CPU utilization of all the process’s threads, incremented each time the
system clock ticks and a thread is found to be running.
Notes:
Uempty
Context Switches
• A context switch is when one thread is taken off a CPU and
another thread is dispatched onto the same CPU
• Context switches are normal for multi-processing systems:
– What is abnormal? Check against baseline
– High context switch rate could be indication of lock contention
• Use vmstat, sar, or topas to see context switches
• Example:
# vmstat 1 5
Notes:
A context switch (also known as process switch or thread switch) is when a thread is
dispatched to a CPU and the previous thread on that CPU was a different thread from
the one currently being dispatched. Context switches occur for various reasons. The
most common reason is where a thread has used up its timeslice or has gone to sleep
waiting on a resource (such as waiting on an I/O to complete or waiting on a lock) and
another thread takes its place.
High context switch rates may be an indication of a resource contention issue such as
application or kernel lock contention.
The rate is given in switches per second. It’s not uncommon to see the context switch
rate be approximately the same as the device interrupt rate (the in column in vmstat).
A context switch occurs when:
- A thread has to wait for a resource (voluntarily)
- A “higher priority” thread wakes up (involuntarily)
- The thread has used up its timeslice (10 ms by default)
• System mode:
í System mode is when the CPU is executing code in the kernel
í CPU time spent in kernel mode is reflected as system time in
output of commands such as vmstat, topas, iostat, and
sar
í Context switch time, system calls, device interrupts, NFS I/O,
and anything else in the kernel is considered as system time
Notes:
User time is simply the percentage of time the CPUs are spending executing code in
the applications or shared libraries. System time is the percentage of time the CPUs
execute kernel code. System time can be because the applications are executing
system calls which enter the applications into the kernel, or can be because there are
kernel threads running that only execute in kernel mode, or can be because interrupt
handler code is currently being run. When using monitoring tools, add up the user and
the system CPU utilization percentage to see the total CPU utilization.
The use of a system call by a user mode process allows a kernel function to be called
from user mode. This is considered a mode switch. Mode switching is when a thread
switches from user mode to kernel or system mode. Switching from user to system
mode and back again is normal for applications. System mode does not just represent
operating system housekeeping functions.
Mode switches should be differentiated between the context switches seen in the output
of vmstat (cs column) and sar (cswch/s).
Uempty
What is Simultaneous Multi-Threading?
• Two hardware threads can run on one physical processor at the
same time
• One processor appears as two logical processors to the operating
system
• On an LPAR with shared processors, the logical processors will be
twice the number of virtual (not physical) processors
• Simultaneous multi-threading is a means of converting thread-level
parallelism (multiple CPUs) to instruction-level parallelism (same
CPU)
Logical Logical
CPU0 CPU1
AIX Layer
Physical Layer
Hardware Hardware
Thread0 Thread1
Physical CPU
UNIX Software Service Enablement © Copyright IBM Corporation 2009
Notes:
Simultaneous multi-threading (SMT) is the ability of a single physical processor to
concurrently execute instructions from more than one hardware thread. There are two
hardware threads per physical processor, so additional instructions can run at the same
time. Since instructions from any of the threads can be fetched by the processor in a
given cycle, the processor is no longer limited by the instruction level parallelism of the
individual threads.
Simultaneous multi-threading also allows instructions from one thread to utilize all the
execution units if the other thread encounters a long latency event. For instance, when
one of the threads has a cache miss, the second thread can continue to execute.
Each hardware thread is supported as a separate logical processor by the operating
system. So, a dedicated partition that is created with one physical processor is
configured by the operating system as a logical two-way when simultaneous
multi-threading is enabled. This is independent of the partition type, so a shared
partition with one virtual processor is configured as a logical two-way. Beginning with
AIX 5.3, SMT is enabled by default on hardware that supports it.
Notes:
Simultaneous multi-threading is a good choice when the overall throughput is more
important than the throughput of an individual thread.
Simultaneous multi-threading is not always advantageous. Any workload where the
majority of individual software threads highly utilize any resource in the processor or
memory will benefit very little from simultaneous multi-threading.
Where simultaneous multi-threading is not beneficial, POWER5 and later systems
support single-threaded execution mode. In this mode, the system gives all the physical
resources to the active thread.
If simultaneous multi-threading is not beneficial, it can be disabled.
The process of putting an active thread into a dormant state is known as snoozing. In
dedicated processor partitions, if there are not enough tasks available to run on both
hardware threads of a processor, the operating system’s idle process will be selected to
run on the idle hardware thread.
Uempty
Viewing Processor and Attribute Information
• List processors with the lsdev command:
– lsdev lists physical or virtual processors:
# lsdev -Cc processor
proc0 Available 00-00 Processor
proc2 Available 00-02 Processor
Notes:
The lsdev command lists processors that the operating system sees, and their AIX
location codes. When a partition is using dedicated processors, lsdev shows physical
processors. When a partition is using shared processors, lsdev shows virtual
processors.
The lsattr command shows the processor attributes:
- The smt_enabled attribute indicates whether simultaneous multi-threading is
enabled or not.
- The smt_threads attribute shows the number of simultaneous multi-threading
threads per physical (for dedicated processor partitions) or virtual processor (on
shared processor partitions).
The numbers shown by the bindprocessor -q command are the logical CPU numbers
for the AIX instance. These don't necessarily correspond with the processors shown in
the lsdev command output.
Notes:
Beginning with AIX 5.3, simultaneous multi-threading is enabled by default and
supported by AIX. You may dynamically change the simultaneous multi-threading
setting with the smtctl command or with the SMIT menu subsystem.
The smtctl command provides privileged users and applications the ability to control
utilization of processors with simultaneous multi-threading support. With this command,
you can enable or disable simultaneous multi-threading system-wide, either
immediately or the next time the system boots.
The smtctl command does not rebuild the boot image. If you want your change to
persist across reboots, the bosboot command must be used to rebuild the boot image.
Beginning with AIX 5.3, the boot image has been extended to include an indicator that
controls the default simultaneous multi-threading mode.
Uempty
Viewing smtctl Settings
# smtctl
Notes:
Timing Commands
• Time commands show:
– Elapsed time
– CPU time spent in user mode
– CPU time spent in system mode
# /usr/bin/time <command> <command arguments>
real 9.30
user 3.10
sys 1.20
Notes:
Timing commands
Use the timing commands to understand the performance characteristics of a single
program and its synchronous children. The output from /usr/bin/time and timex are
in seconds. The output of the Korn shell’s built-in time command is in minutes and
seconds. The C shell’s built-in time command is in yet another format.
Uempty
Monitoring CPU Usage with vmstat
# vmstat 5 3
System configuration: lcpu=4 mem=1024MB
kthr memory page faults cpu
----- ------------- ---------------------- --------------- ------------
r b avm fre re pi po fr sr cy in sy cs us sy id wa
19 2 127005 758755 0 0 0 0 0 0 1692 10464 1070 48 52 0 0
19 2 127096 758662 0 0 0 0 0 0 1397 71452 1059 28 72 0 0
19 2 127100 758656 0 0 0 0 0 0 1361 72624 1001 28 72 0 0
Notes:
Using vmstat with intervals during the execution of a workload will provide information
on paging space activity, real memory use, and CPU utilization. vmstat data can be
retrieved from the PerfPMR monitor.int file.
# sar -u 2
Notes:
The sar command is the System Activity Report tool and is standard for UNIX systems.
The sar command can collect data in real-time and postprocess the data in real-time or
after the fact. sar data can be retrieved from the PerfPMR monitor.int file.
sar -q reports queue statistics. A blank value in any column indicates that the
associated queue is empty.
The -q option can indicate whether you just have many jobs running (runq-sz) or have
a potential paging bottleneck.
A large number of runnable threads does not necessarily indicate a CPU bottleneck. If
the performance goals are being met and the system is running the threads quickly,
then it doesn’t matter if this number seems high.
The sar -u report in the visual displays the system-wide statistics. The -u flag
information is expressed as percentages so the system-wide information is simply the
average of each individual processor's statistics. Also, the I/O wait state is defined
system-wide and not per processor.
Uempty
Using the sar -P Command
• Reports system activity information from selected
cumulative activity counters
# sar -P ALL 2 1
Notes:
If the sar -P flag is given, the sar command reports activity which relates to the
specified processor or processors. If -P ALL is given, the sar command reports
statistics for each individual processor, followed by system-wide statistics in the row that
starts with the hyphen.
The visual above shows a system running with the same workload, first with SMT
disabled, then with it enabled. Notice that the logical CPU number doubled with SMT
enabled. Also notice the new statistic of physc or physical CPU consumed with SMT
enabled. This shows how much of a CPU was consumed by the logical processor (the
measurement of fraction of time a logical processor was getting physical processor
cycles).
The example in the visual was created on a partition with dedicated processors. When
the partition has shared processors, an additional column is displayed (%entc). The
%entc column reports the percentage of entitled capacity consumed.
Interval: 2
Logical Partition: Wed Feb 4 16:30:25 2009
Dedicated SMT ON Online Memory: 1024.0
Partition CPU Utilization Online Virtual CPUs: 2 Online Logical CPUs: 4
%user %sys %wait %idle %hypv hcalls
19 81 0 0 0.0 127
===============================================================================
LCPU minpf majpf intr csw icsw runq lpa scalls usr sys _wt idl pc
Cpu0 0 0 105 112 62 0 100 722709 19 81 0 0 0.49
Cpu1 0 0 101 82 55 0 99 731461 19 81 0 0 0.51
Cpu2 0 0 254 102 65 0 100 728143 19 81 0 0 0.50
Cpu3 0 0 102 78 48 0 100 728527 19 81 0 0 0.50
Notes:
The topas output has been modified to show statistics by logical processor.
The visual above shows output from a system with two dedicated processors and
simultaneous multi-threading enabled which is why we see four logical processors.
Uempty
Using the mpstat Command
• The mpstat command displays performance statistics for logical
processors:
– Shows the distribution of work between logical processors
– Percentage for each logical processor is sum of %user and %sys
# mpstat -s 1 1
# mpstat -s 1 1
System configuration: lcpu=2 ent=0.2 mode=Uncapped
Proc0
Shared
0.39%
Processors cpu0 cpu1
0.30% 0.09%
Notes:
If SMT is enabled, the mpstat -s command displays logical processors usage as
shown in the visual above.
With a dedicated processors, the two logical processor utilization metrics will always
add up to a whole processor. In the dedicated processor example, logical processor
cpu0 is 49.72% busy and logical processor cpu1 is 50.22%. cpu0 and cpu1 are
hardware threads for proc0. Logical processor cpu2 is 49.83% busy and logical
processor cpu2 is 50.20%. cpu2 and cpu3 are hardware threads for proc1.
If the partition was using shared processors, the percentages add up to the actual
overall CPU time that was consumed in the period, not to the whole number of allocated
processors. With shared processors, you're given an overall percentage for the
processor.
# ps aux
USER PID %CPU %MEM SZ RSS TTY STAT STIME TIME COMMAND
root 262372 9.0 0.0 92 96 pts/2 A 14:58:25 51:00 ./myprog
root 286896 8.9 0.0 92 96 pts/2 A 14:58:15 50:55 ./myprog2
root 376848 8.9 0.0 92 96 pts/2 A 14:58:28 50:47 ./tstcase
root 335904 8.8 0.0 92 96 pts/2 A 14:58:12 50:18 ./statpgm
root 372976 8.4 0.0 92 96 pts/2 A 15:18:38 40:48 ./minep
root 294918 8.2 0.0 92 96 pts/2 A 15:18:33 40:14 ./extst
root 53274 2.8 0.0 384 384 - A 14:16:53 20:35 wait
root 8196 2.8 0.0 384 384 - A 14:16:53 20:31 wait
root 57372 2.8 0.0 384 384 - A 14:16:53 20:14 wait
root 49176 2.7 0.0 384 384 - A 14:16:53 20:11 wait
pconsole 250038 0.0 6.0 52936 52940 - A 14:17:30 0:14 /usr/java5/bin/j
root 311456 0.0 5.0 40216 40220 - A 14:17:24 0:07 /usr/java5/bin/j
Notes:
To locate the processes dominating CPU usage, the ps command is a useful tool.
The ps command, run periodically, will display the CPU time under the TIME column and
the ratio of CPU time to real time under the % CPU column. Keep in mind that the CPU
usage shown is the average CPU utilization of the process since it was first created.
Therefore, if a process consumes 100% of the CPU for five seconds and then sleeps for
the next five seconds, the ps report at the end of ten seconds would report 50% CPU
time. This can be misleading because right now the process is not actually using any
CPU time.
Uempty
CPU Monitoring Strategy Summary
START vmstat, sar, topas, time
Monitor CPU usage
& compare with goals
Yes CPU No
High CPU
supposed to
usage?
be idle?
Actions No Yes
Actions
Determine cause Check memory & Locate dominant ps, topas
of idle time by disk subsystems process(es)
tracing
Actions Actions
Tune applications /
Actions
operating system
UNIX Software Service Enablement © Copyright IBM Corporation 2009
Notes:
Tools such as vmstat, sar, and topas help determine whether a system is CPU bound.
If it is determined that a system is CPU bound, then you need to find out which
processes or applications are dominating the CPU usage. This could be accomplished
by running the ps command periodically.
Once the culprit is pinpointed, then it must be determined if the behavior is abnormal
(unexpected application behavior) or not. If not abnormal, a variety of methods can be
used to improve the performance of the application. They include specific coding
techniques, special libraries, and compiler options. Profilers can help you determine
where in the application to concentrate your efforts. It should be emphasized that tuning
is an iterative process, not only at the overall system level, but also at the application
level. Fixing abnormal behavior may involve changes to the application or to how the
application is invoked. Examples of abnormal behavior would be “runaway processes”
where an application is in a loop executing on a non-existent terminal.
And, sometimes the CPUs are idle and they shouldn’t be. Tracing can reveal the reason
why.
Notes:
Uempty
Review Questions (1 of 2)
1. What is the difference between a process and a thread?
___________________________________________________
___________________________________________________
2. The default scheduling policy is called: _________________
3. The default scheduling policy applies to fixed or non-fixed priorities?
_________________
4. Priority numbers range from ____ to ____.
5. True/False The higher the priority number the more favored the
thread will be for scheduling.
6. List at least two tools to monitor CPU usage:
í
í
Notes:
Review Questions (2 of 2)
7. True or False: All applications will run faster with simultaneous multi-
threading enabled.
Notes:
Uempty
Unit Summary
Notes:
References
SC23-5253 AIX Performance Management
SC23-5254 AIX Performance Tools Guide and Reference
AIX Commands Reference, Volumes 1-6
SG24-6478 AIX Practical Performance Tools and Tuning Guide
(Redbook)
SG24-7559 IBM AIX Version 6.1 Differences Guide (Redbook)
© Copyright IBM Corp. 2009 Unit 4. Virtual Memory Performance Monitoring 4-1
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Unit Objectives
Notes:
Uempty
VMM Terminology
Segments
JFS
Program text (Persistent) Filesystem
Paging
Space
Shared library data (Working)
– Working segments
– Program text
– Non-computational (file memory):
– Persistent segments
– Client segments
UNIX Software Service Enablement © Copyright IBM Corporation 2009
Notes:
The virtual memory system is composed of the real memory plus physical disk space
where portions of a file that are not currently in use are stored.
The pages of a persistent segment have permanent storage locations on disk. Files
containing data or executable programs are mapped to persistent segments.
The client segments are used for all file system file caching except for JFS and GPFS.
(GPFS uses its own mechanism.)
Working segments are transitory and exist only during their use by a process. They
have no permanent disk storage location and are therefore stored on disk paging space
if their page frames are stolen.
Computational memory, also known as computational pages, consists of the pages
that belong to working storage segments or program text (executable files) segments.
File memory, also known as file pages or non-computational memory, consists of the
remaining pages. These are usually pages from permanent data files in persistent
storage (persistent or client segments).
© Copyright IBM Corp. 2009 Unit 4. Virtual Memory Performance Monitoring 4-3
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
The Virtual Memory Manager (VMM) coordinates and manages all the activities
associated with the virtual memory system. It is responsible for allocating real memory
page frames and resolving references to pages that are not currently in real memory.
The VMM maintains a list of unallocated page frames that it uses to satisfy page faults,
called the free list. In most environments, the VMM must occasionally add to the free list
by stealing some page frames owned by running processes. The virtual memory pages
whose page frames are to be reassigned are selected by the VMM’s page stealer. The
VMM thresholds determine the number of frames reassigned.
When a process exits, its working storage is freed up immediately and its associated
memory frames are put back on the free list. However, any files the process may have
opened can stay in memory. When a file system is unmounted, any cached file pages
are freed.
Page stealing occurs when the lrud kernel process selects a currently allocated real
memory page frame to be placed on the free list.
Uempty
Page Replacement
Initial PFT (excerpt) Pages added to
Physical Segment Modified Paging the free List
Ref. Bit
Address Type ? Space Physical
aaa1 W On Yes Address
aaa2 W Off Yes aaa2
aaa3 W On No aaa4
aaa4 W Off No JFS bbb2
bbb1 P On Yes
Filesystem
bbb4
bbb2 P Off Yes ccc2
bbb3 P On No ccc4
bbb4 P Off No
JFS2/NFS
ccc1 C On Yes
Filesystem
ccc2 C Off Yes
Resulting PFT (excerpt)
ccc3 C On No
ccc4 C Off No Physical Segment Modified
Ref. Bit
Address Type ?
aaa1 W Yes
aaa3 W No
bbb1 P Yes
bbb3 P No
ccc1 C Yes
ccc3 C No
UNIX Software Service Enablement © Copyright IBM Corporation 2009
Notes:
A process requires real memory pages to execute. When a process references a virtual
memory page that is on disk (because it either has been paged out or has yet to be read
in), the referenced page must be paged in. If the memory is already nearly full, this may
cause one or more pages to be paged out to make room, creating I/O traffic and
delaying the progress of the process.
The VMM uses the page stealer to steal page frames that have not been recently
referenced, and thus would be unlikely to be referenced in the near future. A successful
page stealer allows the operating system to keep enough processes active in memory
to keep the CPU busy.
Pinned page frames or pinned memory are pages that cannot be stolen.
The VMM uses a Page Frame Table (PFT) to keep track of what page frames are in
use. The PFT includes flags to signal which pages have been referenced and which
have been modified. If the page stealer encounters a page that has been referenced,
then it does not steal that page at that time, but instead resets the reference flag for that
page.
© Copyright IBM Corp. 2009 Unit 4. Virtual Memory Performance Monitoring 4-5
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
Uempty
VMM Thresholds (1 of 2)
• The following vmo parameters ensure there are pages on the free list:
– minfree - default 960 pages
– maxfree - default 1088 pages
• The percentage of real memory that can be used by file pages (non-
computational segments) is controlled by the following vmo
parameters:
– minperm%
• AIX 5.2/5.3 - default 20%
• AIX 6.1 - default 3%
– maxperm%
• AIX 5.2/5.3 - default 80%
• AIX 6.1 - default 90%
– maxclient%
• AIX 5.2/5.3 - default 80%
• AIX 6.1 - default 90%
Notes:
VMM Thresholds
Several numerical thresholds define the objectives of the VMM. When one of these
thresholds is breached, the VMM takes appropriate action to bring the state of memory
back within bounds. These thresholds are:
- minfree specifies the minimum acceptable number of real memory page frames on
the free list
- maxfree specifies the maximum size to which the free list will grow by VMM page
stealing
- minperm% specifies the point below which the page stealer will steal file or
computational pages regardless of repaging rates
- maxperm% specifies the point above which the page stealer steals only file pages
- maxclient% specifies maximum percentage of RAM that can be used for caching
client pages
© Copyright IBM Corp. 2009 Unit 4. Virtual Memory Performance Monitoring 4-7
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
VMM Thresholds (2 of 2)
• Other vmo parameters that affect page replacement are:
– strict_maxclient (default 1)
– strict_maxperm (default 0)
– lru_file_repage
• AIX 5.2/5.3 - default 1
• AIX 6.1 - default 0
Notes:
The strict_maxclient and strict_maxperm tunables are restricted tunables and
should NOT be changed unless directed by IBM AIX Development or Support.
When strict_maxclient is set to 1 (the default), then the maxclient% value will be a
hard limit on how much of RAM can be used as a client file cache.
If strict_maxperm is set to 1, (NOT the default), then the maxperm% value will be a hard
limit on how much of RAM can be used as a persistent file cache. With the default of
strict_maxperm set to 0, maxperm% is a soft limit.
When lru_file_repage is set to 0 (the default in AIX 6.1), then the repage rates are
ignored and the page stealer will try to only steal file pages, as long as numperm is
greater than minperm. If lru_file_repage is set to 1 (not the default in AIX 6.1), then
the repage rates may be considered when deciding what type of page to steal.
Uempty
When to Steal Pages Based on Free Pages
•The following vmo parameters ensure there are pages on the free list:
– minfree (default 960 pages)
– maxfree (default 1088 pages)
maxfree Number of
free pages
Number of minfree
free pages
Notes:
© Copyright IBM Corp. 2009 Unit 4. Virtual Memory Performance Monitoring 4-9
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
90% 90%
80% maxclient
80% maxclient
numclient
maxclient
numclient minus
minfree maxclient numclient
minus
numclient maxfree
Notes:
If strict_maxclient=1 (the default), the page stealer may start before the free list
reaches minfree number of pages. When the number of client pages exceeds the
value of maxclient minus minfree, then page stealing starts.
When the number of client pages drops below the value of maxclient minus maxfree,
then page stealing stops.
The visual shows a comparison of what happens in AIX 5.2/5.3 versus AIX 6.1. Note
that in AIX 6.1, the page stealer will start later than it would in AIX 5.2/5.3 and stops
earlier than it would in AIX 5.2/5.3. This allows more client pages to remain in memory
in AIX 6.1.
Uempty
What Types of Pages are Stolen?
lru_file_repage = 1 lru_file_repage = 0
(default in AIX 5.2/5.3) (default in AIX 6.1)
numperm > maxperm Tries to only steal
Tries to only steal file pages
file pages 90%
80% maxperm
20%
Note: File pages here mean BOTH client and persistent pages
UNIX Software Service Enablement © Copyright IBM Corporation 2009
Notes:
© Copyright IBM Corp. 2009 Unit 4. Virtual Memory Performance Monitoring 4-11
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
# svmon -G
size inuse free pin virtual
memory 262144 259018 3126 108991 187230
pg space 131072 1876
Notes:
The vmstat -I command reports virtual memory statistics including file page ins (fi)
and file page outs (fo) per second, paging space page ins (pi) and paging space page
outs (po) for working pages, number of pages scanned (sr), number of pages stolen or
freed (fr).
The svmon -G command gives a snapshot of the overall picture of memory use
including the total amount of real memory (size), number of free memory frames
(free), number of memory frames containing working segment pages (work field in the
in use), number of memory frames containing persistent segment pages (pers field in
the in use), and the number of memory frames containing client segment pages
(clnt). These four fields add up to the total real memory.
In the vmstat output, avm stands for Active Virtual Memory. The avm value in the vmstat
output and the virtual value in the svmon -G output show the active number of 4 KB
virtual memory pages in use at that time.
The fre value in the vmstat output and the free field in the svmon -G output indicate
the average number of 4 KB pages that are currently on the free list.
Uempty
How Many File Pages are in Memory?
# vmstat -v
262144 memory pages
238362 lruable pages
3013 free pages
1 memory pools
109201 pinned pages
80.0 maxpin percentage
3.0 minperm percentage
90.0 maxperm percentage
27.3 numperm percentage
65249 file pages
0.0 compressed percentage
0 compressed pages
27.3 numclient percentage
90.0 maxclient percentage
65249 client pages
0 remote pageouts scheduled
32 pending disk I/Os blocked with no pbuf
0 paging space I/Os blocked with no psbuf
2484 filesystem I/Os blocked with no fsbuf
26 client filesystem I/Os blocked with no fsbuf
0 external pager filesystem I/Os blocked with no fsbuf
Notes:
In a particular workload, it might be more important to avoid file I/O. In another
workload, keeping computational segment pages in memory might be more important.
To get the file page and other statistics, use the vmstat -v command. If PerfPMR was
run, the output is in vmstat_v.before and vmstat_v.after.
Note: The numperm value can be less than numclient because text pages are classified
as computational.
© Copyright IBM Corp. 2009 Unit 4. Virtual Memory Performance Monitoring 4-13
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Notes:
A successful page replacement keeps the memory pages of all currently active
processes in RAM, while the memory pages of inactive processes are paged out.
However, when RAM is over committed, it becomes difficult to choose pages for page
out because they will be referenced in the near future by currently running processes.
The result is that pages that will soon be referenced still get paged out and then paged
in again later. When this happens, continuous paging in and paging out may occur. The
system spends most of its time paging in and paging out instead of executing useful
instructions, and none of the active processes make any significant progress.
Use the svmon -G command to get the amount of memory being used and compare that
to the amount of real memory. To do this:
- The total amount of real memory is shown in the memory size field
- The amount of memory being used is the total of the virtual pages shown in the
memory virtual field, the persistent pages shown in the in use pers field, and the
client pages shown in the in use clnt field.
Uempty
Exercise 4:
Virtual Memory Performance Monitoring
Notes:
© Copyright IBM Corp. 2009 Unit 4. Virtual Memory Performance Monitoring 4-15
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Review Questions (1 of 2)
1. The virtual memory system is composed of
________________ and ______________________
2. Virtual memory is divided into the three segment types:
_____________, _____________, and _____________
3. What type of segments are paged out to paging space?
__________________
4. Segments are classified as either __________________
or ___________________
5. The two major functions of the VMM are:
–
–
6. The name of the kernel process that implements the page
replacement algorithm is _______
UNIX Software Service Enablement © Copyright IBM Corporation 2009
Notes:
Uempty
Review Questions (2 of 2)
7. List the vmo parameter that matches the description:
a. Specifies the minimum number of frames on the free list when the
VMM starts to steal pages to replenish the free list _______
b. Specifies the number of frames on the free list at which page
stealing stops ______________
c. Specifies the point below which the page stealer will steal file or
computational pages regardless of repaging rates ___________
d. Specifies the point above which the page stealing algorithm steals
only file pages ______________
e. Specifies the maximum percentage of RAM that can be used for
caching client pages ______________
f. Specifies whether the maxclient value will be a hard limit on how
much of RAM can be used as a client file cache
_______________
g. Specifies whether the maxperm value will be a hard limit on how
much of RAM can be used as a persistent file cache
_______________
h. Specifies whether or not to consider repage rates when deciding
what type of page to steal ________________
Notes:
© Copyright IBM Corp. 2009 Unit 4. Virtual Memory Performance Monitoring 4-17
Course materials may not be reproduced in whole or in part
without the prior written permission of IBM.
Student Notebook
Unit Summary
• The amount of virtual memory that is in use at any given time
can be larger than real memory. The VMM must store the
surplus virtual memory on disk.
• From the performance standpoint, the VMM has two
objectives:
– Minimize the overall CPU time and I/O bandwidth cost for
virtual memory
– Minimize the response time cost of page faults
• To fulfill these objectives, the VMM:
– Maintains a free list of page frames that are available to
satisfy a page fault
– Uses a page replacement algorithm to determine which
virtual memory pages currently in memory will have their
page frames reassigned to the free list
Notes:
References
SC23-5253 AIX Performance Management
SC23-5254 AIX Performance Tools Guide and Reference
AIX Commands Reference, Volumes 1-6
SG24-6478 AIX Practical Performance Tools and Tuning Guide
(Redbook)
Unit Objectives
After completing this unit, you should be able to:
• Identify factors related to physical and logical volume
performance and file systems
• Use performance tools to identify I/O bottlenecks
• Describe how file fragmentation affects file system I/O
performance
• List guidelines for accurate I/O measurements
• Measure read and write throughput
• Define and create JFS and JFS2 logs
• Reorganize a file system
Notes:
Uempty
LVM Terminology
Application Raw
JFS/JFS2
Layer Logical Volume
Volume
Group Logical Volume Logical Volume
Physical
Layer
Physical Physical
Physical Disk Disk
Disk
Array
Notes:
The logical volume layer is between the application and physical layers. The application
layers are the file system or raw logical volumes. The physical layer are the physical
disks. LVM maps the data between application layer and physical storage. Even
physical volumes are part of the logical layer, as the physical layer only contains the
actual disks, device drivers, and disk arrays that may already be configured.
The physical disk drives, storage arrays or virtual disks are named as a physical
volumes in LVM. All of the physical volumes in a volume group are divided into physical
partitions. All the physical partitions within a volume group are the same size, although
different volume groups can have different physical partition sizes. A volume group is
made up of one or more physical volumes. Within each volume group, one or more
logical volumes are defined. Logical volumes are groups of information located on
physical volumes. Each logical volume consists of one or more logical partitions.
Logical partitions are the same size as the physical partitions within a volume group.
Each logical partition is mapped to one, two or three physical partitions.
Notes:
When a logical volume is created, you can specify which physical volumes to use.
The intra-disk allocation policy choices are based on the five regions of a disk where
physical partitions can be located.
The inter-disk allocation policy specifies the number of disks on which the physical
partitions of a logical volume are located.
A logical volume can have from 1 to 3 copies. You can also decide how to handle
recovery of a mirrored logical volume with the Mirror Write Consistency setting.
The strictness policy defines the rule of whether each logical partition copy must be
on a separate physical volume.
Relocate LV during reorganization specifies whether to allow the relocation of the
logical volume during reorganization.
Write verify sets an option for the disk to do whatever its “write verify” procedure is.
Logical volume serialization serializes overlapping I/Os.
Uempty
Causes of Poor I/O Performance
• Fragmentation (file, file system, or logical volume)
• MWC writes
• Write Verify enabled
• Excessive disk seeks
• Saturated devices (disks, adapters, buses)
• Locality of data (hot partitions, hot disks)
• Slow disk subsystem
Notes:
Fragmentation can occur at the file, file system, or the logical volume level.
Mirror Write Consistency Check writes can seriously hurt performance. This is more
of an issue with random writes. If it’s a problem, then consider using passive MWC.
Write verify can also be very expensive because after the write, the data has to be
read and verified.
Disk seeks can cause poor performance since this is the slowest part of a physical
disk. Disk seeks can be caused by the application, by fragmentation, or by concurrent
accesses to multiple data sets on the same physical disk drive by different threads.
Saturated devices (disks, adapters, or buses) can cause poor performance due to lack
of throughput.
Locality of data is important because certain areas or partitions of a disk may be more
frequently accessed than others.
Sometimes the disk subsystem may be slow for one reason or another. It could be
software or hardware issues.
Notes:
Rather than migrating entire logical volumes from one disk to another in an attempt to
rebalance the workload, if we can identify the individual hot logical partitions, then we
can focus on migrating just those to another disk.The lvmstat utility can be used to
monitor the utilization of individual logical partitions of a logical volume. By default,
statistics are not kept on a per partition basis. These statistics can be enabled with the
lvmstat -e option. You can enable statistics for:
- All logical volumes in a volume group with lvmstat -e -v vgname
- Per logical volume basis with lvmstat -e -l lvname
The first report generated by lvmstat provides statistics concerning the time since the
system was booted. Each subsequent report covers the time since the previous report.
All statistics are reported each time lvmstat runs. The report consists of a header row
followed by a line of statistics for each logical partition or logical volume depending on
the flags specified.
Uempty
lvmstat Example
• The following shows how to list the top logical partitions of a
logical volume:
# lvmstat -l lv03 -e
# lvmstat -l lv03 -c 10
Notes:
The visual shows enabling lvmstat to gather statistics for the logical volume, lv03. It
then gathers and reports the statistics for the 10 busiest logical partitions on lv03.
The reports has the following fields:
Field Description
Log_part Logical partition number
mirror# Mirror copy number of the logical partition
iocnt Number of read and write requests
Kb_read The total number of kilobytes read
Kb_wrtn The total number of kilobytes written
Kbps The amount of data transferred in kilobytes per second
Notes:
Uempty
Migration Example
# lvmstat -v datavg -e
# lvmstat -v datavg
Logical Volume iocnt Kb_read Kb_wrtn Kbps
lv00 2099 26564 25364 0.12
lv01 1682 0 253 0.11
lv02 39 0 156 0.00
# lvmstat -l lv00
Log_part mirror# iocnt Kb_read Kb_wrtn Kbps
2 1 1848 12760 12416 0.06
8 1 684 10624 9480 0.03
7 1 556 1196 733 0.01
3 1 507 2836 210 0.03
Notes:
The first command, lvmstat -v datavg -e, enables LVM statistics gathering on
datavg. It also enables statistics gathering on all logical volumes in datavg.
To get the LVM statistics on datavg, use the command: lvmstat -v datavg
The activity is highest on lv00. To take a closer look at lv00, use the command:
lvmstat -l lv00.
At this point, we may want to consider migrating lv00’s logical partition 2 to another
disk. This can be done with the command: migratelp lv00/2 hdisk2.
Note: We are assuming that LP2 and LP8 of lv00 are on the same disk. They may not
be. In reality, we would need to confirm that using lslv -m lv00 or look at the mapping
of the storage array.
sar -d
# sar -d 1 2
AIX leguin221 1 6 00066BA2D900 02/09/09
System configuration: lcpu=2 drives=8 mode=Capped
15:31:47 device %busy avque r+w/s Kbs/s avwait avserv
15:31:48 hdisk0 0 0.0 0 0 0.0 0.0
hdisk1 100 0.0 282 1128 0.0 4.0
hdisk2 0 0.0 0 0 0.0 0.0
hdisk3 0 0.0 0 0 0.0 0.0
hdisk4 0 0.0 0 0 0.0 0.0
hdisk5 0 0.0 0 0 0.0 0.0
cd0 0 0.0 0 0 0.0 0.0
hdisk6 91 0.0 6045 24180 0.0 0.2
Notes:
The -d flag of sar provides real time disk I/O statistics. The fields listed by sar -d are:
- %busy - Reports the portion of time device was busy servicing a transfer request.
- avque - Reports the average number of requests outstanding from the adapter to
the device during the time interval.
- r+w/s - The number of read/write transfers from or to device.
- Kbs/s -The amount of data transferred to the drive in KB per second.
- avwait - The average time (in milliseconds) that transfer requests waited idly on
the queue for the device. Prior to AIX 5.3, this was not supported. If you see large
numbers in the avwait column, try to distribute the workload on other disks.
- avserv - The average time (in milliseconds) to service each transfer request
(includes seek, rotational latency, and data transfer times) for the device. Prior to
AIX 5.3, this was not supported.
Note:%busy is the same as %tm_act in iostat. r+w/s is equal to tps in iostat.
Uempty
Using iostat
# iostat 5 2
Notes:
The iostat command is used for monitoring system input/output device loading by
observing the time the physical disks are active in relation to their average transfer
rates. It does not provide data for file systems or logical volumes. The iostat
command generates reports that can be used to change the system configuration to
better balance the input/output load between physical disks and adapters.
Beginning with AIX 5.3, the collection of disk input/output statistics is disabled by default
to improve performance. To enable the collection of this data, type:
chdev -l sys0 -a iostat=true
To display the current settings, type:
lsattr -E -l sys0 -a iostat
If the collection of disk input/output history is disabled and iostat is called without an
interval, the iostat output displays the message Disk History Since Boot Not
Available instead of disk statistics.
What is iowait?
• iowait is a form of idle time
• The iowait statistic is simply the percentage of time the
CPU is idle AND there is at least one I/O still in progress
(started from that CPU)
• The iowait value seen in the output of commands like
vmstat, iostat, and topas is the iowait percentages
across all CPUs averaged together
• High I/O wait does not mean that there is definitely an I/O
bottleneck
• Zero I/O wait does not mean that there is not an I/O
bottleneck
• A CPU in I/O wait state can still execute threads if there are
any runnable threads
Notes:
To summarize it in one sentence, iowait is the percentage of time the CPU is idle AND
there is at least one I/O in progress. Each CPU can be in one of four states:
- user
- sys
- idle
- iowait
Performance tools such as vmstat, iostat, sar, etc. print out these four states as a
percentage. The sar tool can print out the states on a per CPU basis (-P flag) but most
other tools print out the average values across all the CPUs. Since these are
percentage values, the four state values should add up to 100%.
Uempty
Monitoring Adapter I/O Throughput
• iostat -a shows adapter throughput
• Disks are listed following the adapter to which they are
attached
# iostat -a
System configuration: lcpu=2 drives=8 paths=8 vdisks=0 tapes=0
tty: tin tout avg-cpu: % user % sys % idle % iowait
0.0 3395.0 3.6 24.5 34.8 37.2
Adapter: Kbps tps Kb_read Kb_wrtn
sissas0 1128.0 282.0 128 1000
Disks: % tm_act Kbps tps Kb_read Kb_wrtn
hdisk0 0.0 0.0 0.0 0 0
hdisk1 99.0 1128.0 282.0 128 1000
hdisk2 0.0 0.0 0.0 0 0
hdisk3 0.0 0.0 0.0 0 0
hdisk4 0.0 0.0 0.0 0 0
hdisk5 0.0 0.0 0.0 0 0
cd0 0.0 0.0 0.0 0 0
Adapter: Kbps tps Kb_read Kb_wrtn
fcs0 24300.0 6075.0 2720 21580
Disks: % tm_act Kbps tps Kb_read Kb_wrtn
hdisk6 91.0 24300.0 6075.0 2720 21580
Notes:
Adapter throughput
The -a option to iostat will combine the disks statistics to the adapter to which they
are connected. The adapter throughput will simply be the sum of the throughput of each
of its connected devices. With the -a option, the adapter will be listed first, followed by
its devices and then followed by the next adapter, followed by its devices, and so on.
The adapter throughput values can be used to determine if any particular adapter is
approaching its maximum bandwidth or to see if the I/O is balanced across adapters.
Notes:
System throughput
The -s option to iostat shows the system throughput. This is the sum of all the
adapter’s throughputs.
Uempty
File System I/O Layers
Logical File Local or NFS
System
Virtual
Paging
Memory Manager
Logical
Disk space management
Volume Manager
Notes:
There are a number of layers involved in file system storage and retrieval. It’s important
to understand what performance issues are associated with each layer. The
management tools used to monitor file system activity can provide data on each of
these layers.
The effect of a file’s placement on I/O performance diminishes when the file is buffered
in memory. When a file is opened in AIX, it is mapped to a persistent (JFS) or client
(JFS2) data segment in virtual memory. The segment represents a virtual buffer for the
file. The file’s blocks map directly to segment pages. The VMM manages the segment
pages, reading file blocks into segment pages upon demand (as they are accessed).
There are several circumstances that cause the VMM to write a page back to its
corresponding block in the file on disk.
Notes:
There’s a theory that anything that starts out with perfect order will, over time, become
disordered due to outside forces. This concept certainly applies to file systems. The
longer a file system is used, the more likely it will become fragmented. Also, the
dynamic allocation of resources (e.g., extending a logical volume) contributes to the
disorder. File system performance is also affected by physical considerations.
With fragmentation, sequential file access will no longer find contiguous physical disk
blocks. Random access may not find physically contiguous logical records and will have
to access more widely dispersed data. In both cases, seek time for file access grows.
Both JFS and JFS2 attach a VM segment to do I/O, so file data becomes cached in
memory and disk fragmentation does not affect access to the cached data.
Each read or write operation on a file system is done through system calls. System calls
for reads and writes define the size of the operation. The smaller the operation the more
system calls are needed to read or write the entire file. Therefore, more CPU time is
spent making the system calls. The read or write size should be a multiple of the file
system block size to reduce the amount of CPU time spent per system call.
Uempty
How to Measure File System Performance
• General guidelines for accurate measurements
í System has to be idle
í System management tools like WLM should be turned
off
í I/O subsystems should not be shared with other
systems
í Files must not be cached in memory for read throughput
measurement
í Writes must go to the file system disk
Notes:
File system operations require system resources such as CPU, memory, and I/O. The
result of a file system performance measurement will NOT be accurate if one or more of
these resources are in use by other applications. The same applies if one or more of
these resources is managed and/or the statistics are gathered with system
management tools like Workload Manager (WLM). Those tools should be turned off.
I/O subsystems can share disk space among several systems. The available bandwidth
might not be enough to achieve maximum file system performance if the I/O subsystem
is used by other systems during the performance measurement, thus it should not be
shared. When a file is cached in memory, a read throughput measurement does not
give any information about the file system throughput since no physical operation on the
file system takes place. The best way to assure that a file is not cached in memory is to
unmount then mount the file system on which the file is located. A write throughput
measurement does not give any information about file system performance if nothing is
written out to disk. Unless the application opens files in such a way that it doesn’t use
file system buffers (such as direct I/O), then each write to a file is done in memory and is
written out to disk by either a syncd or a write-behind algorithm.
Notes:
The dd command is a good utility to measure the throughput of a file system since it
allows you to specify the exact size for reads or writes as well as the number of
operations.
Example
The first set of sync commands flush all modified file pages in memory to disk. The time
between the first and the second date command is the amount of time the dd
command took to write the file into memory. The time between the first and third date
command is the total amount of time it took to write the file to disk.
In this example, dd completed after 3 seconds (23:03:20 - 23:03:17) and wrote about
33.3 MB per second and the total amount of time it took to write the data to the file
system is 4 seconds (23:03:21 - 23:03:17), about 25 MB per second.
Uempty
How to Measure Read Throughput
• Useful tools for file system performance measurements
are dd and time
real 0m1.16s
user 0m0.00s
sys 0m0.19s
Notes:
Example
The time command shows the amount of time it took to complete the read.
The read throughput in this example is about 86.2 MB per second (100 MB / 1.16
seconds real time).
• Basic syntax:
filemon -O report-types -o output-file
• Runs in the background; stops with the trcstop command
• Uses the trace facility
Notes:
If an application is believed to be disk-bound, the filemon utility is useful to find out
where and why.
The filemon command uses the trace facility to obtain a detailed picture of I/O activity
during a time interval on the various layers of file system utilization, including the logical
file system, virtual memory segments, LVM, and physical disk layers. Data can be
collected on all the layers, or some of the layers. The default is to collect data on the
virtual memory segments, LVM, and physical disk layers.
By default, filemon runs in the background while other applications are running and
being monitored. When the trcstop command is issued, filemon stops and generates
its report.
The report begins with a summary of the I/O activity for each of the levels (the Most
Active sections) and ends with detailed I/O activity for each level (Detailed sections).
Each section is ordered from most active to least active.
When running PerfPMR, the filemon data is in the filemon.sum file.
Uempty
filemon - Most Active Files Report
# filemon -O lv,lf,pv -o fmon.out
# trcstop
# cat fmon.out
Wed Feb 11 23:08:09 2009
System: AIX 6.1 Node: leguin221 Machine: 00066BA2D900
Notes:
The visual on this page shows the logical file output (lf) from the filemon report. The
logical file I/O includes read, writes, opens and seeks which may or may not result in
actual physical I/O depending on whether or not the files are already buffered in
memory. Statistics are kept by file.
Output is ordered by #MBs read and/or written to a file.
By default, the logical file reports are limited to the top 20. If the verbose flag (-v) is
added, activity for all files would be reported. The -u flag can be used to generate
reports on files opened prior to the start of the trace daemon.
Look for the most active files to see usage patterns. If they are dynamic files, they may
need to be backed up and restored. The Most Active Files sections shows the
bigfile1 file (read by dd command) as most active file with one open and 101 reads.
The number of writes (#wrs) is 1 less than the number of reads (#rds), because
end-of-file has been reached.
Notes:
Uempty
filemon - Detailed File Stats Report
------------------------------------------------------------------------
Detailed File Stats
------------------------------------------------------------------------
FILE: /dev/null
opens: 1
total bytes xfrd: 104857600
writes: 100 (0 errs)
write sizes (bytes): avg 1048576.0 min 1048576 max 1048576 sdev 0.0
write times (msec): avg 0.003 min 0.003 max 0.005 sdev 0.000
Notes:
------------------------------------------------------------------------
Detailed Physical Volume Stats (512 byte blocks)
------------------------------------------------------------------------
Notes:
Uempty
Fragmentation and Performance
Logical File Physical File System
i-nodes
1 2 3
i-nodes
5 6
3 4 6
i-nodes
5
i-nodes
1 2
Notes:
While an operating system’s file is conceptually a sequential and contiguous string of
bytes, the physical reality might be very different. Fragmentation may arise from
multiple extensions to logical volumes, the sequence of allocation/release/reallocation
activity within a file system or simply appending to a file while other applications are
also writing to the files in the same area. A file system is fragmented when its available
space consists of large numbers of small chunks of space, making it impossible to write
out a new file in contiguous blocks.
Access to files in a highly fragmented file system may result in a large number of seeks
and longer I/O response times (seek latency dominates I/O response time). For
example, if the file is accessed sequentially, a file placement that consists of many,
widely separated chunks requires more seeks than a placement that consists of one or
a few large contiguous chunks. If the file is accessed randomly, a placement that is
widely dispersed requires longer seeks than a placement in which the file’s blocks are
close together.
Notes:
To see the characteristics of a hot logical volume, the lslv command may be used with
the specified logical volume name. This will tell you what policies are in effect. Then, to
see if the intra-policy is being followed, the lslv -l command may be used. The
IN BAND column will indicate the percentage of physical partitions that are allocated in
the region specified by the intra-policy. The total region distribution is also displayed.
To see the allocation map for logical partitions on a specific disk, use the
lslv -p hdisk# lvname command. Replace the # with the number of the desired
physical disk.
To see the allocation map for a logical volume across all disks it occupies, use the
lslv -m lvname command.
Uempty
Logical Volume Settings
# lslv lv00
LOGICAL VOLUME: lv00 VOLUME GROUP: testvg
LV IDENTIFIER: 00066ba20000d9000000011f5c6d5c5a.1 PERMISSION:read/write
VG STATE: active/complete LV STATE: closed/syncd
TYPE: jfs WRITE VERIFY: off
MAX LPs: 512 PP SIZE: 128 megabyte(s)
COPIES: 1 SCHED POLICY: parallel
LPs: 20 PPs: 20
STALE PPs: 0 BB POLICY: relocatable
INTER-POLICY: minimum RELOCATABLE: yes
INTRA-POLICY: middle UPPER BOUND: 32
MOUNT POINT: N/A LABEL: None
MIRROR WRITE CONSISTENCY: on/ACTIVE
EACH LP COPY ON A SEPARATE PV ?: yes
Serialize IO ?: NO
# lslv -l lv00
lv00:N/A
PV COPIES IN BAND DISTRIBUTION
hdisk1 013:000:000 15% 009:002:002:000:000
hdisk2 007:000:000 42% 003:003:001:000:000
Notes:
Using the output from lslv, you can compare the requested policies against the actual
implementation. The lslv -l output shows several characteristics of the logical
volume. The PerfPMR config.sum file lists the output of lslv for each logical volume.
The COPIES column shows the disks where the physical partitions reside. There are
three columns, one for each of the possible logical volume copies.
The IN BAND column shows the percentage of the partitions that met the intra-policy
criteria.
The DISTRIBUTION column shows the locations of the physical partitions of this logical
volume as numbers separated by a colon (:). Each of these numbers represents an
intra-policy location. Of the remaining percentage in the IN BAND value, the rest may be
on a different part of the disk and may be fragmented.
lslv -p
# lslv -p hdisk1 lv00
hdisk1:lv00:N/A
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 1-10
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 11-20
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 21-30
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 31-40
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 41-50
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 51-60
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 61-70
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 71-80
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 81-90
FREE FREE FREE FREE FREE FREE FREE FREE FREE 0001 91-100
0002 0003 0004 0005 0006 0015 0016 0017 FREE FREE 101-110
USED USED USED USED USED USED USED USED USED USED 111-120
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 121-130
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 131-140
FREE FREE FREE FREE FREE FREE FREE FREE FREE 0010 141-150
FREE 0011 FREE FREE FREE FREE FREE FREE FREE FREE 151-160
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 161-170
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 171-180
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 181-190
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 191-200
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 201-210
FREE FREE FREE FREE FREE FREE FREE FREE FREE 211-219
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 220-229
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 230-239
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 240-249
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 250-259
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 260-269
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 270-279
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 280-289
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 290-299
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 300-309
FREE FREE FREE FREE FREE FREE FREE FREE FREE FREE 310-319
0013 0014 FREE FREE FREE FREE FREE FREE FREE 320-328
Notes:
In the previous visual, even if the partitions were all in-band, that does not guarantee
that they are not fragmented. Therefore, the lslv -p data should be looked at next.
Logical volume fragmentation occurs if logical partitions are not contiguous across the
disk. The lslv -p command shows the logical volume allocation map for the physical
volume given.
The state of the partition is listed as one of the following:
- USED indicates that the physical partition at this location is used by a logical volume
other than the one specified with lslv -p.
- FREE indicates that this physical partition is not used by any logical volume.
- STALE indicates that the specified partition is no longer consistent with other
partitions. The system lists the logical partition number with a question mark if the
partition is stale.
- Where it shows a number, this indicates the logical partition number of the logical
volume specified with the lslv -p command.
Uempty
lslv -m
# lslv -m lv00
lv00:
LP PP1 PV1 PP2 PV2 PP3 PV3
0001 0100 hdisk1
0002 0101 hdisk1
0003 0102 hdisk1
0004 0103 hdisk1
0005 0104 hdisk1
0006 0105 hdisk1
0007 0051 hdisk2
0008 0052 hdisk2
0009 0055 hdisk2
0010 0150 hdisk1
0011 0152 hdisk1
0012 0300 hdisk2
0013 0320 hdisk1
0014 0321 hdisk1
0015 0106 hdisk1
0016 0107 hdisk1
0017 0108 hdisk1
0018 0145 hdisk2
0019 0146 hdisk2
0020 0149 hdisk2
Notes:
The lslv -m option shows the mapping of a logical volume. For each logical partition,
it gives the physical partition and physical volume where the logical partition resides.
Notes:
The fileplace tool displays the placement of a file’s blocks within a logical or physical
volume(s). fileplace expects an argument containing the name of the file to examine.
This tool can be used to detect file fragmentation.
By default, fileplace sends its output to the display, but the output can be redirected
to a file via normal shell redirection.
The example in the visual demonstrates how to use fileplace to determine whether a
file is fragmented.
The report generated by the -pv options displays the file’s placement in terms of
physical volume blocks for the physical volumes. The verbose part of the report is one
of the most important sections since it displays the efficiency and sequentiality of the
file.
Higher space efficiency and sequentiality provide better sequential file access.
Uempty
Reorganizing the File System
• After identifying a fragmented file system, reduce the
fragmentation by:
1. Backing up the files (by name) in that file system
2. Deleting the contents of the file system (or
recreating it with mkfs)
3. Restoring the contents of file system
Notes:
File system fragmentation can be alleviated by copying the files to a backup media,
recreating the file system using mkfs fsname or deleting the contents of the file
system, and reloading the files into the new file system. This loads the file sequentially
and reduces fragmentation.
Some file systems or logical volumes shouldn’t be reorganized because the data is
either transitory (that is, /tmp), does not change much (that is, /usr and /), or not in a file
system format (log).
Backup and recovery procedures are only needed when there is a lot of fragmented
free space. If there are a few high usage large sequential files which are fragmented
and there is enough contiguous free space, you should copy the file to a different file
name, delete the original file, and then move the new file back to its original name.
Notes:
Using small fragment sizes is not recommended, but if a journal file system has been
created with a fragment size smaller than 4 KB, it becomes necessary after a period of
time to query the amount of scattered unusable fragments. If many small fragments are
scattered, it makes it difficult to find available contiguous free space.
To recover these small, scattered spaces, use smit or the defragfs command. Some
free space must be available for the defragmentation procedure to be used. The file
system must be mounted for read-write.
Uempty
JFS and JFS2 Logs
• AIX uses a special logical volume called the log device as a
circular journal for recording modifications to file system
metadata
Notes:
JFS and JFS2 use a technique that duplicates transactions that are made to file system
metadata to the circular file system log. File system metadata includes the superblock,
i-nodes, indirect data pointers, and directories. All I/Os to the log are synchronous.
File system logs enable rapid and clean recovery of file systems if a system goes down.
However, there may be a performance trade-off. If an application is doing synchronous
I/O or is creating and/or removing many files in a short amount of time, then there may
be a lot of I/O going to the log logical volume. Information about I/Os to the log can be
recorded using the filemon command.
If you notice that a file system and its log device are both heavily utilized, it may be
better to put each one on a separate physical disk (assuming that there is more than
one disk in that volume group). This can be done using the migratepv command or via
SMIT.
JFS2 file systems have an option to have an inline log. An inline log allows you to create
the log within the same data logical volume. With an inline log, each JFS2 file system
can have its own log device without having to share this device.
• What to do:
í Create a new JFS or JFS2 log logical volume
(JFS) # mklv -t jfslog -y LVname VGname 1 PVname
(JFS2) # mklv -t jfs2log -y LVname VGname 1 PVname
Notes:
Overview
Placing the log logical volume on a physical volume different from your most active file
system’s logical volume will increase parallel resource usage assuming that the I/O
pattern on that file system causes JFS/JFS2 log transactions. If there is more than one
file system in the same volume group which is causing JFS/JFS2 log transactions, you
may get better performance by creating a separate JFS/JFS2 log for each of these file
systems. The downside of this is that if you have one JFS/JFS2 log for each file system
then you are potentially faced with storage waste, since the smallest each JFS/JFS2 log
can be is one physical partition.
The performance of disk drives differ. So, try to create a logical volume for a hot file
system on a fast drive (possibly one with fast write cache).
Uempty
The Commands to Use for Monitoring I/O
• Look for most active files, file systems, and logical volumes:
í Can “hot” file systems be better located on physical drive
or be spread across multiple physical drives? (filemon)
í Are “hot” files local or remote? (filemon)
í Is there enough memory to cache the file pages being
used by running processes? (svmon)
Notes:
Overview
When monitoring disk I/O, there are several areas to look at. The visual gives a list of
questions to ask to help determine your course of action.
Notes:
Uempty
Review Questions (1 of 2)
List the command/utility to do the following:
1. Monitor a trace of file system and I/O, and report on the file
and I/O access performance during that period: ___________
2. Show the logical volume allocation map for the physical
volume given: _______________
3. Report statistics for logical partitions and volumes:
____________
4. Show how the actual layout of a logical volume meets the
intra-allocation policy: _____________
5. Monitor system input/output device loading by observing the
time the physical disks are active in relation to their average
transfer rates: _____________
Notes:
Review Questions (2 of 2)
6. An I/O bottleneck may be solved by moving logical
partitions:
a) The ____________ command moves individual logical
partitions of a logical volume.
b) The ___________ command moves an entire logical
volume to another physical disk.
Notes:
Uempty
Unit Summary
Notes:
10. True or False You can dynamically change the topas and
nmon displays.
backpg
Back page
®