HP-UX Performance and Tuning (H4262S)
HP-UX Performance and Tuning (H4262S)
HP-UX Performance and Tuning (H4262S)
00 guide
HP Training
Student guide
Copyright 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. This is an HP copyrighted work that may not be reproduced without the written permission of HP. You may not use these materials to deliver training to any person outside of your organization without the written permission of HP. UNIX is a registered trademark of the Open Group. Printed in the USA HP-UX Performance and Tuning Student guide May 2004
Contents
Contents
Overview............................................................................................................................................. 1 Module 1 Introduction 11. SLIDE: Welcome to HP-UX Performance and Tuning.................................................. 1-2 12. SLIDE: Course Outline ..................................................................................................... 1-3 13. SLIDE: System Performance ........................................................................................... 1-4 14. SLIDE: Areas of Performance Problems........................................................................ 1-6 15. SLIDE: Performance Bottlenecks ................................................................................... 1-8 16. SLIDE: Baseline .............................................................................................................. 1-10 17. SLIDE: Queuing Theory of Performance ..................................................................... 1-12 18. SLIDE: How Long Is the Line?....................................................................................... 1-13 19. SLIDE: Example of Queuing Theory ............................................................................ 1-14 110. SLIDE: Summary............................................................................................................. 1-16 111. LAB: Establishing a Baseline......................................................................................... 1-17 112. LAB: Verifying the Performance Queuing Theory ...................................................... 1-19 Module 2 Performance Tools 21. SLIDE: HP-UX Performance Tools...................................................................................... 2-2 22. SLIDE: HP-UX Performance Tools (Continued) ............................................................... 2-3 23. SLIDE: Sources of Tools....................................................................................................... 2-4 24. SLIDE: Types of Tools .......................................................................................................... 2-6 25. SLIDE: Criteria for Comparing the Tools ........................................................................... 2-8 26. SLIDE: Data Sources ........................................................................................................... 2-10 27. SLIDE: Performance Monitoring Tools (Standard UNIX).............................................. 2-11 28. TEXT PAGE: iostat ......................................................................................................... 2-12 29. TEXT PAGE: ps................................................................................................................... 2-14 210. TEXT PAGE: sar .............................................................................................................. 2-16 211. TEXT PAGE: time, timex .............................................................................................. 2-18 212. TEXT PAGE: top .............................................................................................................. 2-19 213. TEXT PAGE: uptime, w................................................................................................... 2-21 214. TEXT PAGE: vmstat ....................................................................................................... 2-22 215. SLIDE: Performance Monitoring Tools (HP Specific) .................................................. 2-25 216. TEXT: glance................................................................................................................... 2-26 217. TEXT PAGE: gpm .............................................................................................................. 2-28 218. TEXT PAGE: xload.......................................................................................................... 2-30 219. SLIDE: Data Collection Performance Tools (Standard UNIX) .................................... 2-31 220. TEXT PAGE: acct Programs .......................................................................................... 2-32 221. TEXT PAGE: sar .............................................................................................................. 2-34 222. SLIDE: Data Collection Performance Tools (HP-Specific) .......................................... 2-36 223. TEXT PAGE: MeasureWare/OVPA and DSI Software................................................... 2-37 224. TEXT PAGE: PerfView/OVPM ......................................................................................... 2-39 225. SLIDE: Network Performance Tools (Standard UNIX)................................................ 2-41 226. TEXT PAGE: netstat..................................................................................................... 2-42 227. TEXT PAGE: nfsstat..................................................................................................... 2-44 228. TEXT PAGE: ping ............................................................................................................ 2-46 229. SLIDE: Network Performance Tools (HP-Specific) ...................................................... 2-48 230. TEXT PAGE: lanadmin .................................................................................................. 2-49
http://education.hp.com
Contents
231. 232. 233. 234. 235. 236. 237. 238. 239. 240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253. 254. 255. 256. 257. 258. 259. 260. 261. 262. 263. 264. 265. 266. 267. 268.
TEXT PAGE: lanscan .....................................................................................................2-51 TEXT PAGE: nettune (HP-UX 10.x Only)....................................................................2-53 TEXT PAGE: ndd (HP-UX 11.x Only) ..............................................................................2-55 TEXT PAGE: NetMetrix (HP-UX 10.20 and 11.0 Only)..................................................2-57 SLIDE: Performance Administrative Tools (Standard UNIX) ......................................2-58 TEXT PAGE: ipcs, ipcrm...............................................................................................2-59 TEXT PAGE: nice, renice ............................................................................................2-61 SLIDE: Performance Administrative Tools (HP-Specific) ............................................2-63 Text Page: getprivgrp, setprivgrp.........................................................................2-64 Text Page: rtprio ............................................................................................................2-66 Text Page: rtsched..........................................................................................................2-67 Text Page: scsictl..........................................................................................................2-69 Text Page: serialize .....................................................................................................2-71 Text Page: fsadm...............................................................................................................2-72 Text Page: getext, setext............................................................................................2-74 Text Page: newfs, tunefs, vxtunefs .........................................................................2-75 Text Page: Process Resource Manager (PRM)..............................................................2-77 Text Page: Work Load Manager (WLM) .........................................................................2-78 Text Page: Web Quality of Service WebQoS..............................................................2-79 SLIDE: System Configuration and Utilization Information (Standard UNIX) ............2-80 TEXT PAGE: bdf, df ........................................................................................................2-81 TEXT PAGE: mount ..........................................................................................................2-83 SLIDE: System Configuration and Utilization Information (HP-Specific) ..................2-84 TEXT PAGE: diskinfo...................................................................................................2-85 TEXT PAGE: dmesg ..........................................................................................................2-86 TEXT PAGE: ioscan........................................................................................................2-88 TEXT PAGE: vgdisplay, pvdisplay, lvdisplay .................................................2-90 TEXT PAGE: swapinfo...................................................................................................2-92 TEXT PAGE: sysdef........................................................................................................2-93 TEXT PAGE: kmtune, kcweb..........................................................................................2-95 SLIDE: Application Profiling and Monitoring Tools (Standard UNIX) .......................2-96 TEXT PAGE: prof, gprof...............................................................................................2-97 Text page: Application Response Measurement (ARM) Library Routines ................2-98 SLIDE: Application Profiling and Monitoring Tools (HP-Specific) ............................2-99 Text page: Transaction Tracker .....................................................................................2-100 Text page: caliper HP Performance Analyzer..........................................................2-101 SLIDE: Summary ..............................................................................................................2-103 LAB: Performance Tools Lab..........................................................................................2-104
Module 3 GlancePlus 3-1. SLIDE: This Is GlancePlus................................................................................................3-2 3-2. SLIDE: GlancePlus Pak Overview...................................................................................3-4 3-3. SLIDE: gpm and glance .................................................................................................3-6 3-4. SLIDE: glance The Character Mode Interface ......................................................3-8 3-5. SLIDE: Looking at a glance Screen ..............................................................................3-11 3-6. SLIDE: gpm The Graphical User Interface ..............................................................3-13 3-7. SLIDE: Process Information ..........................................................................................3-15 3-8. SLIDE: Adviser Components .........................................................................................3-17 3-9. SLIDE: adviser Bottleneck Syntax Example............................................................3-18 3-10. SLIDE: The parm File .....................................................................................................3-19
http://education.hp.com
Contents
SLIDE: GlancePlus Data Flow....................................................................................... 3-21 SLIDE: Key GlancePlus Usage Tips.............................................................................. 3-23 SLIDE: Global, Application, and Process Data ........................................................... 3-24 SLIDE: Can't Solve What's Not a Problem................................................................... 3-25 SLIDE: Metrics: "No Answers without Data"............................................................... 3-26 SLIDE: Summary............................................................................................................. 3-27 SLIDE: HP GlancePlus Guided Tour ............................................................................ 3-28 LAB: gpm and glance Walk-Through ............................................................................ 3-29
Module 4 Process Management 41. SLIDE: The HP-UX Operating System............................................................................ 4-2 42. SLIDE: Virtual Address Process Space (PA-RISC) ....................................................... 4-4 43. SLIDE: Virtual Address Process Space (IA-64) ............................................................. 4-6 44. SLIDE: Physical Process Components........................................................................... 4-7 45. SLIDE: The Life Cycle of a Process ................................................................................ 4-9 46. SLIDE: Process States .................................................................................................... 4-11 47. SLIDE: CPU Scheduler................................................................................................... 4-14 48. SLIDE: Context Switching ............................................................................................. 4-16 49. SLIDE: Priority Queues .................................................................................................. 4-17 410. SLIDE: Nice Values......................................................................................................... 4-19 411. SLIDE: Parent-Child Process Relationship.................................................................. 4-20 412. SLIDE: glance Process List.................................................................................... 4-21 413. SLIDE: glance Individual Process......................................................................... 4-23 414. SLIDE: glance Process Memory Regions ............................................................. 4-24 415. SLIDE: glance Process Wait States....................................................................... 4-25 416. LAB: Process Management ............................................................................................ 4-26 Module 5 CPU Management 51. SLIDE: Processor Module................................................................................................ 5-2 52. SLIDE: Symmetric Multiprocessing................................................................................ 5-4 53. SLIDE: Cell Module .......................................................................................................... 5-5 54. SLIDE: Multi-Cell Processing .......................................................................................... 5-6 55. SLIDE: CPU Processor..................................................................................................... 5-8 56. SLIDE: CPU Cache ......................................................................................................... 5-11 57. SLIDE: TLB Cache .......................................................................................................... 5-12 58. SLIDE: TLB, Cache, and Memory ................................................................................. 5-14 59. SLIDE: HP-UX Performance Optimized Page Sizes............................................... 5-16 510. SLIDE: CPU Metrics to Monitor Systemwide......................................................... 5-19 511. SLIDE: CPU Metrics to Monitor per Process ......................................................... 5-21 512. SLIDE: Activities that Utilize the CPU ......................................................................... 5-23 513. SLIDE: glance CPU Report .................................................................................... 5-25 514. SLIDE: glance CPU by Processor ......................................................................... 5-26 515. SLIDE: glance Individual Process......................................................................... 5-27 516. SLIDE: glance Global System Calls ..................................................................... 5-28 517. SLIDE: glance System Calls by Process............................................................... 5-29 518. SLIDE: sar Command ................................................................................................... 5-30 519. SLIDE: timex Command .............................................................................................. 5-32 520. SLIDE: Tuning a CPU-Bound System Hardware Solutions .................................. 5-33 521. SLIDE: Tuning a CPU-Bound System Software Solutions.................................... 5-35 522. SLIDE: CPU Utilization and MP Systems..................................................................... 5-36
http://education.hp.com
Contents
SLIDE: Processor Affinity ..............................................................................................5-37 LAB: CPU Utilization, System Calls, and Context Switches ......................................5-38 LAB: Identifying CPU Bottlenecks ................................................................................5-40
Module 6 Memory Management 61. SLIDE: Memory Management ..........................................................................................6-2 62. SLIDE: Memory Management Paging ........................................................................6-4 63. SLIDE: Paging and Process Deactivation.......................................................................6-5 64. SLIDE: The Buffer Cache .................................................................................................6-7 65. SLIDE: The syncer Daemon ..........................................................................................6-9 66. SLIDE: IPC Memory Allocation .....................................................................................6-10 67. SLIDE: Memory Metrics to Monitor Systemwide ...................................................6-12 68. SLIDE: Memory Metrics to Monitor per Process ...................................................6-14 69. SLIDE: Memory Monitoring vmstat Output...............................................................6-16 610. SLIDE: Memory Monitoring glance Memory Report...........................................6-18 611. SLIDE: Memory Monitoring glance Process List.................................................6-19 612. SLIDE: Memory Monitoring glance Individual Process......................................6-20 613. SLIDE: Memory Monitoring glance System Tables.............................................6-21 614. SLIDE: Tuning a Memory-Bound System Hardware Solutions ............................6-23 615. SLIDE: Tuning a Memory-Bound System Software Solutions ..............................6-24 6-16: SLIDE: PA-RISC Access Control ...................................................................................6-26 617. SLIDE: The serialize Command..............................................................................6-28 618. LAB: Memory Leaks ........................................................................................................6-30 Module 7 Swap Space Performance 71. SLIDE: Swap Space Management Simple View ....................................................... 7-2 72. SLIDE: Swap Space After a New Process Executes ............................................... 7-4 73. SLIDE: The swapinfo Command ................................................................................. 7-5 74. SLIDE: Swap Space Management Realistic View.................................................... 7-7 75. SLIDE: Swap Space After a New Process Executes ............................................... 7-8 76. SLIDE: Swap Space When Memory Equals Data Swapped.................................. 7-10 77. SLIDE: Swap Space When Swap Space Fills Up ................................................... 7-11 78. SLIDE: Pseudo Swap ..................................................................................................... 7-12 79. SLIDE: Total Swap Space Calculation with Pseudo Swap................................... 7-14 710. SLIDE: Example Situation Using Pseudo Swap ......................................................... 7-16 711. SLIDE: Swap Priorities .................................................................................................. 7-17 712. SLIDE: Swap Chunks ..................................................................................................... 7-18 713. SLIDE: Swap Space Parameters ................................................................................... 7-19 714. SLIDE: Summary ............................................................................................................ 7-21 715. LAB: Monitoring Swap Space ....................................................................................... 7-22 Module 8 Disk Performance Issues 81. SLIDE: Disk Overview ......................................................................................................8-2 82. SLIDE: Disk I/O Read Data Flow................................................................................8-4 83. SLIDE: Disk I/O Write Data Flow (Synchronous) ....................................................8-6 84. SLIDE: Disk Metrics to Monitor Systemwide ...........................................................8-8 85. SLIDE: Disk Metrics to Monitor Per Process..........................................................8-10 86. SLIDE: Activities that Create a Large Amount of Disk I/O.........................................8-12 87. SLIDE: Disk I/O Monitoring sar d Output...............................................................8-14 88. SLIDE: Disk I/O Monitoring sar b Output...............................................................8-16 89. SLIDE: Disk I/O Monitoring glance Disk Report......................................................8-18
http://education.hp.com
Contents
SLIDE: Disk I/O Monitoring glance Disk Device I/O .......................................... 8-19 SLIDE: Disk I/O Monitoring glance Logical Volume I/O....................................... 8-20 SLIDE: Disk I/O Monitoring glance System Calls per Process.......................... 8-21 SLIDE: Tuning a Disk I/O-Bound System Hardware Solutions ........................................................................................................ 8-22 SLIDE: Tuning a Disk I/O-Bound System Perform Asynchronous Meta-data I/O.......................................................................... 8-24 SLIDE: Tuning a Disk I/O-Bound System Load Balance across Disk Controllers ......................................................................... 8-26 SLIDE: Tuning a Disk I/O-Bound System Load Balance across Disk Drives.................................................................................. 8-28 SLIDE: Tuning a Disk I/O-Bound System Tune Buffer Cache .......................................................................................................... 8-30 LAB: Disk Performance Issues...................................................................................... 8-33
Module 9 HFS File System Performance 91. SLIDE: HFS File System Overview ................................................................................. 9-2 92. SLIDE: Inode Structure .................................................................................................... 9-5 93. SLIDE: Inode Data Block Pointers ................................................................................. 9-6 94. SLIDE: How Many Logical I/Os Does It Take to Access /etc/passwd? ................. 9-8 95. SLIDE: File System Blocks and Fragments ................................................................. 9-10 96. SLIDE: Creating a New File on a Full File System ..................................................... 9-13 97. SLIDE: HFS Metrics to Monitor Systemwide ......................................................... 9-15 98. SLIDE: Activities that Create a Large Amount of File System I/O ............................ 9-17 99. SLIDE: HFS I/O Monitoring bdf Output ...................................................................... 9-18 910. SLIDE: HFS I/O Monitoring glance File System I/O ........................................... 9-19 911. SLIDE: HFS I/O Monitoring glance File Opens per Process.............................. 9-20 912. SLIDE: Tuning a HFS I/O-Bound System Tune Configuration for Workload ..... 9-22 913. SLIDE: Tuning a HFS I/O-Bound System Use Fast Links...................................... 9-25 914. LAB: HFS Performance Issues ...................................................................................... 9-27 Module 10 VxFS Performance Issues 101. SLIDE: Objectives ........................................................................................................... 10-2 102. SLIDE: JFS History and Version Review...................................................................... 10-5 103. SLIDE: JFS Extents......................................................................................................... 10-9 104. SLIDE: Extent Allocation Policies .............................................................................. 10-11 105. SLIDE: JFS Intent Log .................................................................................................. 10-13 106. SLIDE: Intent Log Data Flow....................................................................................... 10-16 107. SLIDE: Understand Your I/O Workload ..................................................................... 10-18 108. SLIDE: Performance Parameters ................................................................................ 10-20 109. SLIDE: Choosing a Block Size..................................................................................... 10-21 1010. SLIDE: Choosing an Intent Log Size ........................................................................... 10-23 1011. SLIDE: Intent Log Mount Options............................................................................... 10-25 1012. SLIDE: Other JFS Mount Options ............................................................................... 10-27 1013. SLIDE: JFS Mount Option: mincache=direct...................................................... 10-31 10-14. SLIDE: JFS Mount Option: mincache=tmpcache ................................................. 10-33 1015. SLIDE: Kernel Tunables ............................................................................................... 10-35 1016. SLIDE: Fragmentation.................................................................................................. 10-37 1017. TEXT PAGE: Monitoring and Repairing File Fragmentation .................................. 10-40 1018. SLIDE: Using setext .................................................................................................. 10-50 1019. SLIDE: I/O Tunable Parameters .................................................................................. 10-52
http://education.hp.com
Contents
SLIDE: vxtunefs Command for Tuning VxFS ........................................................10-54 SLIDE: /etc/vx/tunefstab Configuration ..........................................................10-56 SLIDE: Taking Snapshots and Performance ..............................................................10-58 LAB: JFS File System Tuning.......................................................................................10-60
Module 11 Network Performance 111. SLIDE: The OSI Model ....................................................................................................11-2 112. SLIDE: NFS Read/Write Data Flow ...............................................................................11-4 113. SLIDE: NFS on HP-UX with UDP ..................................................................................11-6 114. SLIDE: NFS on HP-UX with TCP...................................................................................11-7 115. SLIDE: biod on Client .....................................................................................................11-9 116. SLIDE: TELNET.............................................................................................................11-11 117. SLIDE: FTP.....................................................................................................................11-13 118. SLIDE: Metrics to Monitor NFS..............................................................................11-15 119. SLIDE: Metrics to Monitor Network ......................................................................11-18 1110. SLIDE: Determining the NFS Workload .....................................................................11-20 1111. SLIDE: NFS Monitoring nfsstat Output ............................................................11-23 1112. SLIDE: Network Monitoring lanadmin Output ..................................................11-28 1113. SLIDE: Network Monitoring netstat i Output ................................................11-31 1114. SLIDE: glance NFS Report ......................................................................................11-32 1115. SLIDE: glance NFS System Report.........................................................................11-33 1116. SLIDE: glance Network by Interface Report.........................................................11-34 1117. SLIDE: Tuning NFS .......................................................................................................11-35 1118. SLIDE: Tuning the Network .........................................................................................11-37 1119. SLIDE: Tuning the Network (Continued)...................................................................11-39 1120. LAB: Network Performance.........................................................................................11-41 Module 12 Tunable Kernel Parameters 121. SLIDE: Kernel Parameter Classes .................................................................................12-2 122. SLIDE: Tuning the Kernel...............................................................................................12-5 123. SLIDE: Kernel Parameter Categories............................................................................12-8 124. SLIDE: File System Kernel Parameters ........................................................................12-9 125. SLIDE: Message Queue Kernel Parameters ...............................................................12-11 126. SLIDE: Semaphore Kernel Parameters.......................................................................12-13 127. SLIDE: Shared Memory Kernel Parameters ...............................................................12-15 128. SLIDE: Process-Related Kernel Parameters ..............................................................12-17 129. SLIDE: Memory-Related Kernel Parameters..............................................................12-19 1210. SLIDE: LVM-Related Kernel Parameters ....................................................................12-21 1211. SLIDE: Networking-Related Kernel Parameters........................................................12-22 1212. SLIDE: Miscellaneous Kernel Parameters..................................................................12-23 Module 13 Putting It All Together 131. SLIDE: Review of Bottleneck Characteristics .............................................................13-2 132. SLIDE: Performance Monitoring Flowchart ................................................................13-4 133. SLIDE: Review Memory Bottlenecks .......................................................................13-6 134. SLIDE: Correcting Memory Bottlenecks ......................................................................13-7 135. SLIDE: Review Disk Bottlenecks .............................................................................13-8 136. SLIDE: Correcting Disk Bottlenecks.............................................................................13-9 137. SLIDE: Review CPU Bottlenecks ...........................................................................13-11 138. SLIDE: Correcting CPU Bottlenecks...........................................................................13-12 139. SLIDE: Final Review Major Symptoms..................................................................13-13
http://education.hp.com
Contents
Appendix A Applying GlancePlus Data A1. TEXT PAGE: Case Studies Using GlancePlus ............................................................. A-2 Solutions
http://education.hp.com
Contents
http://education.hp.com
Overview
Overview
Course Description
This course is intended to introduce students to the various aspects of monitoring and tuning their systems. Students will be taught how to monitor which tools to use, symptoms to look for, and what remedial actions to take. The course also covers HP GlancePlus/Gpm and HP PerfRx. The course is designed to: Introduce the subject of performance and tuning. Describe how the system works. Identify what tools we can use to look at performance. Identify the symptoms we may encounter and what they indicate.
Course Goals
To educate the students on HP-UX performance monitoring To enable them to identify bottlenecks and potential problems To learn the appropriate remedial actions to take
List characteristics of a system yielding good user response time. List characteristics of a system yielding high data throughput. List three generic areas most often analyzed for performance. List the four most common bottlenecks on a system.
Identify various performance tools available on HP-UX. Categorize each tool as either real time or data collection. List the major features of the performance tools. Compare and contrast the differences between the tools
Module 3 GlancePlus
Compare GlancePlus with other performance monitoring/management tools. Start up the GlancePlus terminal interface (glance) and graphical user interface (gpm).
http://education.hp.com
Describe the components of a process. Describe how a process executes, and identify its process states. Describe the CPU scheduler. Describe a context switch and the circumstances under which context switching occurs. Describe in general, the HP-UX priority queues.
Describe the components of the processor module. Describe how the TLB and CPU cache are used. List four CPU related metrics. Identify how to monitor CPU activity. Discuss how best to use the performance tools to diagnose CPU problems. Specify appropriate corrections for CPU bottlenecks.
Describe how the HP-UX operating system performs memory management. Describe the main performance issues that involve memory management. Describe the UNIX buffer cache. Describe the sync process. Identify the symptoms of a memory bottleneck. Identify global and process memory metrics. Use performance tools to diagnose memory problems. Specify appropriate corrections for memory bottlenecks. Describe the function of the serialize command.
http://education.hp.com
Describe the difference between swap usage and swap reservation. Interpret the output of the swapinfo command. Define and configure pseudo swap. Define and configure swap space priorities. Define and configure swchunk and maxswapchunks.
List three ways disk space can be used. List disk device files. Identify disk bottlenecks. Identify kernel system parameters.
List three ways file systems are used. List basic file system data structures. Identify file system bottlenecks. Identify kernel system parameters.
Understand JFS structure and version differences Explain how to enhance JFS performance Set block sizes to improve performance Set Intent-Log size and rules to improve performance Understand and manipulate synchronous and asynchronous IO Identify JFS tuning parameters Understand and control fragmentation issues Evaluate the overhead of online backup snapshots
http://education.hp.com
List factors directly related to network performance. Describe how to determine network workloads (server and client). Evaluate UDP and TCP transport options. Identify a network bottleneck. List possible solutions for a network performance problem.
Identify which tunable parameters belong to which category Identify tunable kernel parameters that could impact performance Tune both static and dynamic tunable parameters
Identify and characterize some network performance problems. List some useful tools for measuring network performance problems and state how they might be applied. Identify bottlenecks on other common system devices not associated directly with the CPU, disk, or memory.
http://education.hp.com
Overview
Curriculum Path
Fundamentals of UNIX (H51434S) | | HP-UX System and Network Administration I (H3064S) | | HP-UX System and Network Administration II (H3065S)
Recommended
http://education.hp.com
Overview
Agenda
The following agenda is only a guideline. The instructor may vary it if desired. The course will run until the afternoon of the third day. The last hour or so can be used to demonstrate more fully the performance offerings, such as HP PRM and HP PerfView. Day 1 1 Introduction 2 Performance Tools Day 2 3 GlancePlus 4 Process Management 5 CPU Management Day 3 6 Memory Management 7 Swap Space PerformanceManaging 8 Disk Performance Issues Day 4 9 File System Performance 10 ---- VxFS Performance 11 NFS Performance Day 5 12 Tunable Kernel Parameters 13 Putting It All Together
http://education.hp.com
Module 1 Introduction
Objectives
Upon completion of this module, you will be able to do the following: List characteristics of a system yielding good user response time. List characteristics of a system yielding high data throughput. List three generic areas most often analyzed for performance. List the four most common bottlenecks on a system.
http://education.hp.com
Module 1 Introduction
Student Notes
Welcome to the HP-UX Performance and Tuning course. This course is designed to provide a high level understanding of common performance problems and common bottlenecks found on an HP-UX system. The course uses HP performance tools to view activity currently on the system. While many tools can be used to analyze the activity, this course primarily utilizes the glance tool, which is specifically tailored for HP-UX systems.
http://education.hp.com
Module 1 Introduction
Course Outline
Introduction to Performance Performance Tools Overview GlancePlus Process Management CPU Management Memory Management Swap Space Performance Issues Disk and File System Performance Issues HFS Performance Issues VxFS Performance Issues Network Performance Issues Tuning the Kernel Putting it All Together Performance Recap
Student Notes
Topics covered in this course include: System Internals This module includes information related to how the system components (CPU, memory, file systems, and network) function and interact with each other. Similar to a mechanic not being able to tune a car's engine until he understands how it works, a system administrator cannot tune system resources properly until he has a good understanding of how the resources work. Performance Tools There are many performance tools that are available with HP-UX. Some tools come as standard equipment; other tools are additional add-on products. Some tools provide runtime monitoring; other tools perform data collection. We will review all of the tools. Specialty Areas These modules cover areas of special interest to customers in particular types of environments. Three specialty areas are covered at a high level. These are NFS and networking, databases, and application profiling.
http://education.hp.com
Module 1 Introduction
System Performance
Computer System
Student Notes
Different computer systems have different requirements. Some systems may need to provide quick response time; other systems may need to provide a high level of data throughput.
http://education.hp.com
Module 1 Introduction
Is it possible to get both good response time and high system throughput?
http://education.hp.com
Module 1 Introduction
Application
Operating System
Hardware
Student Notes
This slide shows a hierarchical view of a computer system. The base of a computer is its hardware. Built on top of the hardware is the operating system (i.e. the operating system is dependent on the hardware in order to run). The application programs are built on top of the operating system (OS). All three of these areas can have performance problems.
Hardware
The hardware moves data within the computer system. If the hardware is slow, then, no matter how finely tuned the OS and applications are, the system will still be slow. Ultimately, the system is only as fast as the hardware can move the data. Items affecting the speed of the hardware include CPU clock speed, amount and speed of memory, type of disk controller (Fast/Wide SCSI or Single-Ended SCSI), and type of network card (FDDI or Ethernet).
http://education.hp.com
Module 1 Introduction
Operating System
The operating system runs on top of the hardware. It controls how the hardware is utilized. The operating system decides which process runs on the CPU, how much memory to allocate for the buffer cache, whether I/O to the disks is performed synchronously or asynchronously, and so on. If the operating system is not configured properly, then the performance of the system will be poor. Items affecting how the operating system performs include process priorities and their nice values, the tunable OS parameters, the mount options used for file systems, and the configurations of network and swap devices
Applications
The applications run on top of the operating system. The application programs include software, such as database management systems, electronic design applications programs, and accounting-based applications. The performance of the application program is dependent on the operating system and hardware, but it is also dependent on how the application is coded, and how the application itself is configured. Items affecting the performance of the application include how the application data is laid out on the disk, how many users are trying to use the application currently, and how efficiently the application uses the system's resources.
Questions
http://education.hp.com
Module 1 Introduction
Performance Bottlenecks
Network
CPU
Processes
Memory
Student Notes
Poor performance often results because a given resource cannot handle the demand being placed upon it. When the demand for a resource exceeds the availability of the resource, then a bottleneck exists for that resource. Common resource bottlenecks are: CPU A CPU bottleneck occurs when the number of processes wanting to execute is constantly more than the CPU can handle. Basic symptoms of a CPU bottleneck are high CPU Utilization and multiple jobs in the CPU run queue, consistently. A memory bottleneck occurs when the total number of processes on the system will not all fit into memory at one time (i.e. there are more processes than memory can hold). When this happens, pages of memory need to be copied out to the swap partition on disk to free space in memory. Basic symptoms of a memory bottleneck are high memory utilization and consistent I/O activity to the swap partition on disk.
Memory
http://education.hp.com
Module 1 Introduction
Disk
A disk bottleneck occurs when the amount of I/O to a specific disk is more than the disk can handle. Basic symptoms of a disk bottleneck include high utilization of a disk drive and multiple I/O requests consistently in the disk I/O queue. A network bottleneck occurs when the amount of time needed to perform network-based transactions is consistently greater than expected. Basic symptoms of a network bottleneck include network collisions, network request timeouts, and packet retransmissions.
Network
http://education.hp.com
Module 1 Introduction
Baseline
Response Time
Student Notes
In order to quantify good versus poor performance, a customer needs to know what the best possible response time for a given workload can be. The procedure for calculating the best possible response time for a given workload is known as baselining. To calculate the baseline (i.e. the best possible response time) for a particular workload, the workload needs to be performed when no other activity is on the system. The intent is that when all resources are free, the workload will be able to execute as quickly as possible, thereby yielding the best possible response time. Once the baseline value is known, a relative measure is now available for determining how poorly the workload is performing. For example, assume a baseline value of 5 seconds for the workload shown on the slide. When five users are on the system, the response time for the workload increases to 7 seconds. The relative comparison shows response time taking 40% (or 2 seconds) more time to perform this workload when five users are on the system. We have just quantified the relative effect of having five users on the system relative to this particular workload.
http://education.hp.com
Module 1 Introduction
The slide illustrates the typical behavior for a given workload. As more users concurrently utilize the system, the response time for a given workload gets worse. NOTE: In this class we will run baseline metrics using simplified "workload" simulation programs. Results will vary greatly with your applications.
http://education.hp.com
Module 1 Introduction
25
50
75
100
Student Notes
The queuing theory of performance states that the average response time of a given resource is directly linked to the average utilization of that resource. The slide shows a baseline value of X seconds for a given resource. According to the queuing theory, the users will experience this response time when the resource has an average utilization of 0 to 25%. When the average utilization of the resource reaches 75%, the average response time will double. As the average utilization approaches 100%, the average response time quadruples. The bottom line is, as the average utilization of the resource increases, the average response time gets worse and worse. Why does the average response time become poor as the average utilization of a resource increases?
http://education.hp.com
Module 1 Introduction
System Resource
Student Notes
The reason why the average response time gets so poor when the average resource utilization increases, is because the length of the line waiting to get to the resource gets longer. As resource utilization increases, the number of jobs waiting on the resource also increases. When poor performance is experienced, it is most often due to the length of the queue becoming long. A long queue causes jobs to spend most of their time waiting in line for the resource (CPU, memory, network, or disk), as opposed to being serviced by the resource. The slide shows four people waiting in line to get to a resource (think of a line in a bank with one bank teller). If it takes 5 minutes to service one customer, then the fourth person in line will wait 15 minutes before reaching the resource. Adding another 5 minutes to service, the request brings the total response time to 20 minutes for the last person in line, as opposed to 5 minutes if the line had been empty. Of course there is also an overhead experienced because of switching from one customer to the next. This switching is minimal in this example because the customers are handled in a serial fashion.
http://education.hp.com
Module 1 Introduction
sar -d 5 5 15:31:55 15:32:00 15:32:05 15:32:10 15:32:15 15:32:20 device c0t6d0 c0t5d0 c0t6d0 c0t5d0 c0t6d0 c0t5d0 c0t6d0 c0t5d0 c0t6d0 c0t5d0 %busy 81 5 84 3 68 1 71 0 69 0 avque 3.4 .5 3.5 .5 2.9 .5 2.7 .5 2.7 .5 r+w/s 31 1 34 2 31 0 30 1 29 1 blks/s 248 32 245 8 248 6 30 3 29 3 avwait 59.31 0.65 71.64 0.25 51.36 0.48 62.88 0.65 61.70 0.65 avserv 21.20 23.58 24.04 17.93 18.55 19.18 24.16 29.25 24.14 29.25
Student Notes
The above slide provides an example of the queuing theory for the disk drives as reported with the sar tool. The four fields to focus on are: %busy avque The percentage of utilization of each disk The average number of I/O requests in the queue for that disk
avwait The average amount of time a request spends waiting in that disks queue avserv The average amount of time to service that I/O request (not including the wait time) Analyzing the data shows a baseline around 20 milliseconds to service an I/O request (approximate average of avserv column). The first line item shows a disk that is 81% utilized. The total response time is the average wait plus the average service, or approximately 80 milliseconds. This is four times longer than the baseline of 20 milliseconds. In fact, each snapshot shows the busy disk waiting in the queue for an amount of time greater than the amount of time to service the I/O request. To
http://education.hp.com
Module 1 Introduction
see why the wait time is so high, look at the avque size. Notice the queue size is highest when the device is most busy. This is the basic concept of the performance queuing theory.
http://education.hp.com
Module 1 Introduction
Summary
Objective for the system:
Student Notes
To summarize this module, systems are tuned for response time or for throughput. This class focuses on tuning for best possible response time. Areas that affect response time are speed of the hardware, configuration of the operating system, and configuration of the application. This class focuses on the configuration of the operating system. Common bottlenecks with computer systems include CPU, memory, disk, and network. This class discusses all four bottlenecks. Baselines are an important measurement tool for quantifying performance. In the lab for this module, the student will establish CPU and disk I/O baselines. Finally, the queuing theory of performance states that the average response time increases as the average utilization of a resource increases. This is an important concept, which will be revisited throughout this course.
http://education.hp.com
Module 1 Introduction
4. Time the execution of the med program. Make sure there is no activity on the system. # timex ./med Record Execution Time real: user: sys:
5. Time the execution of the short program. Make sure there is no activity on the system. # timex ./short Record Execution Time real: user: sys:
6. Time the execution of the diskread program. # timex ./diskread Record Execution Time real: user: sys:
http://education.hp.com
Module 1 Introduction
7. In the case of the long, med, and short programs the real time is the sum of the usr and sys time (approximately). This is not the case with diskread. Explain why.
http://education.hp.com
Module 1 Introduction
How long did the slowest program take to execute? ___________________ How did the CPU queue size change from step 2? ___________________ 4. Time how long it takes for five short programs to execute. # timex ./short & timex ./short & timex ./short & timex ./short & timex ./short & \
How long did the slowest program take to execute? _____________________ How did the CPU queue size change from step 3? _____________________ 5. Is the relationship between elapsed execution (real) time and the number of running programs linear? 6. Comment about the overhead of switching from one process to another.
http://education.hp.com
Module 1 Introduction
http://education.hp.com
http://education.hp.com
Student Notes
Many performance tools are available for many different purposes. In the HP-UX operating system, there are over 50 different performance-related tools. Some tools provide real-time performance information, such as, How busy the CPU is right now? Other tools collect data in the background and maintain a history of performance information. This module addresses all the tools and the different functions they perform.
http://education.hp.com
Student Notes
The objective of this module is to highlight all the performance tools available with HP-UX, to categorize them by function, and to describe how each tool is used. The module is intended to be a quick reference of performance tools, which the student can refer to when needing to select a tool for a specific task. NOTE: This module does not discuss how to interpret the output of the tools. Interpretation of the metrics is provided in later modules.
http://education.hp.com
Sources of Tools
Standard tools
Tools found only on HP-UX Tools licensed and sold separately (Generally available only on HP-UX)
Optional tools
Student Notes
Three types of tools are presented in this module: Standard Tools Standard tools are those frequently found on many UNIX systems, including HP-UX. The advantage of the standard tools is that their results can be compared with those being collected on other UNIX platforms. This provides an "apples for apples" comparison, which is desirable when comparing systems. The output from these standard tools (and some of the options) may vary slightly among UNIX systems. In addition, differences between the various UNIX implementations can affect the reliability of the metrics being output by the tools. Therefore, be careful to check the results with other tools or seek help before basing important tuning decisions on the value of one metric. HP-Specific Tools HP-specific tools are those which are found only on HP-UX operating systems. These tools are often tailored specifically to understand HP-UX implementations. These tools are generally not found on other UNIX implementations, as other
http://education.hp.com
implementations are different from those of HP. Some of the HP-specific tools come with the base OS; others are purchased as optional tools. Optional Tools Optional tools are tools that are added to the operating system in addition to the standard tools. Some of the optional tools, such as the HP-PAK (Programmers Analysis Kit), may be included with add-on software, such as compilers for HP-UX. Other optional tools, like GlancePlus, PerfView, MeasureWare, NetMetrix, PRM (Process Resource Manager), and WLM (Work Load Manager), are purchased individually or in small bundles (GlancePlus Pak also includes a MeasureWare agent). Optional tools are typically licensed from HP. They offer many advantages over the standard tools including:
http://education.hp.com
Types of Tools
Data Collection Performance Run-Time Monitoring System Configuration and Utilization Performance Administration Network Monitoring Application Profiling and Monitoring
Student Notes
The tools covered in this section fall into six main categories: Run-Time Monitoring Tools These tools provide information as to the performance of the system now. The information is current and provides a real-time perspective as to the state of the system at the current moment. Data Collection Performance Tools These tools collect performance data in the background, summarize or average the data into a summary record, and log the summary record to a file or files on disk. They do not typically provide real-time data. Network Monitoring Tools These tools monitor performance, status, and packet errors on the network. They include both monitoring and configuration tools related to network management. Performance Administrative Tools A system administrator can use these tools to manage the performance of his system.
http://education.hp.com
They typically do not report any data, but allow the current configuration of the system (and its components) to be changed to help improve performance System Configuration and Utilization Information Tools These tools report current system configurations (such as LVM and file systems). They also report utilization of resource statistics, like disk and file system space and number of processes. Application Profiling and Monitoring Tools These tools provide in-depth analysis about the behavior of a program. These tools monitor and trace the execution of a process, and report the resources and calls made during its execution.
http://education.hp.com
Student Notes
Each tool has strengths and weaknesses, advantages and disadvantages, and unique features. Some items to consider when selecting a tool are: Source of Data Scope The collected data can come from a variety of sources, including the kernel, an application, or a specific daemon (like the midaemon). The scope determines the level of detail provided with the tool. Most of the standard tools do not show process-level metrics. For example, they display global disk I/O rates, but do not show which process is generating the I/O or the disk on which the I/O is concentrated. The cost sometimes determines if the tool is an option. Many of the HP-specific tools have additional costs associated with them. (Many of these tools have evaluation copies available for a trial period.) The intrusiveness relates to the overhead associated with running the tool. Some tools also have significant overhead. A large user community using top, for example, may be responsible for generating large amounts of "monitoring" overhead on the system. Another example is the ps
Cost
Intrusiveness
http://education.hp.com
command. It has little impact on most systems due to the low frequency at which it is executed. However, the ps command places fairly high overhead on the system during its execution. Accuracy The accuracy of the tool relates to the reliability of the data being reported. Many standard UNIX tools, like vmstat and sar, have been ported from other UNIX systems. The registers that they monitor may not always correspond to the registers that the kernel updates. There are other factors that can have significant impact on the tool you decide to use. These factors include familiarity, metrics available, permissions required, and portability.
Others
As the tools are presented in the upcoming pages, many of these items will be addressed.
http://education.hp.com
Data Sources
scopeux
Socket
logfiles
pv
Student Notes
The standard tools read information from the UNIX counters and registers maintained in kernel memory (accessible via the /dev/kmem device file and the pstat() system call). These counters and registers are updated 10 times a second as a standard part of most UNIX system implementations. The data in the counters and registers are generally adequate for most performance jobs, but do not provide enough detail when in-depth tuning is needed. The optional tools for HP-UX use an additional source called kernel instrumentation (KI). The KI interface provides additional information beyond the UNIX kernel counters and registers. The KI interface gathers performance information on a system call basis, with every system call generated by every process being traced. The KI interface uses a proprietary measurement interface library to derive the additional metrics. These tools are frequently revised and updated to provide the highest levels of accuracy with the lowest possible overhead. The optional tools, such as Glance and MeasureWare, are KI-based tools when running on HP-UX systems, although they are available for other vendor systems as well. Additional information about KI-based tools (also known as resource and performance management (RPM) tools) can be obtained from the RPM web site at: www.hp.com/go/rpm
http://education.hp.com
Student Notes
The slide shows run-time performance monitoring tools included with HP-UX. These tools provide current information about the performance of the system. These tools are standard UNIX performance tools, which are found on most other UNIX implementations. The Global Metrics column indicates whether the tool will show aggregate resource utilization without differentiating between specific resources. The Process Detail column indicates whether the tool will show resources being used by a single PID. The Alarming Capability column indicates whether the tool is capable of sending an alarm when one of the metrics exceeds a user-defined threshold.
http://education.hp.com
Syntax
iostat [-t] [interval [count]] -t interval count Report terminal statistics as well as disk statistics Display successive lines summaries at this frequency Repeat the summaries this number of times
Key Metrics
The iostat metrics include: bps sps msps Blocks (kilobytes) transferred per second Number of seeks per second Average milliseconds per seek
With the advent of new disk technologies, such as data striping, where a single data transfer is spread across several disks, the average milliseconds per seek becomes impossible to compute accurately. At best it is only an approximation, varying greatly, based on several dynamic system conditions. For this reason and to maintain backward compatibility, the milliseconds per seek (msps) field is set to the value 1.0.
Examples
http://education.hp.com
# iostat -t 5 1 tty tin tout 0 0 device c0t6d0 bps 0 sps 0.0 msps 1.0 us 2 cpu ni 0 sy 1 id 98
http://education.hp.com
Syntax ps [-aAcdefHjlP] [-C cmdlist] [-g grplist] [-G gidlist] [-n namelist] [-o format] [-R prmgrplist [-s sidlist] [-t termlist] [-u uidlist] [-U uidlist]
Key Metrics
The ps metrics include: ADDR C F The memory address of the process, if resident; otherwise, the disk address. Recent processor utilization, used for CPU scheduling (0-255). Flags associated with the process (octal, additive): 0 1 2 4 Process is on the swap device Process is in core memory Process is a system process Process is locked in memory
(and many more) NI PPID PID The nice value for the process; used in priority computation. The process ID number of the parent process. The process ID number of this process.
http://education.hp.com
PRI S
The priority of the process. The state of the process I S R T Z Process is being created (very rarely seen) Process is Sleeping Process is currently Runnable Process is Stopped (rare) Process is terminated (aka Zombie process)
Starting time of process. The size in 4-KB memory pages. The cumulative execution time of the process. The controlling terminal for the process. The address of a structure representing the event or resource for which the process is waiting or sleeping.
Example
# ps -fu daemon UID PID PPID daemon 1171 1170 daemon 1565 1171 # ps -lu daemon F S UID 1 S 1 1 S 1 C STIME TTY 0 13:03:42 ? 0 17:47:47 ? TIME COMMAND 3:10 /usr/bin/X11/X :0 0:00 pexd /tmp/to_pexd_1171.2 /dev/ttyp2
SZ 697 115
http://education.hp.com
Syntax
sar [-ubdycwaqvmpAMSP] [-o file] t [n] Metric-related options: -u CPU Utilization -q Run queue and swap queue lengths and utilization -b Buffer cache stats -d Disk utilization -y TTY utilization -c System call rates -w Swap activity -v Kernel table utilization -m Semaphore and message queue utilization -a File access system routine utilization -A Everything! -M Per processor breakdown (used with u and/or q) -P/p Per processor set breakdown (used with MU and/or Mq)
http://education.hp.com
Key Metrics
The sar command has many metrics. Included below are some sample metrics based on the disk and CPU reports:
CPU Report (-u)
The CPU report displays utilization of CPU and the percentage of time spent within the different modes. %usr %sys %wio %idle Percentage of time system spent in user mode Percentage of time system spent in system mode Percentage of time processes were waiting for (disk) I/O Percentage of time system was idle
The disk report displays activity on each block device (i.e. disk drive). Device %busy avque r+w/s blks/s avwait avserv Logical name of the device (device file name) Percentage of time the device was busy servicing a request Average number of I/O requests pending for the device Number of I/O requests per second (includes reads and writes) Number of 512-byte blocks transferred (to and from) per second The average amount of time the I/O requests wait in the queue before being serviced The average amount of time spent servicing an I/O request (includes seek, rotational latency, and data transfer times)
Examples
# sar -u 5 4 HP-UX r3w14 B.10.20 C 9000/712 08:32:24 08:32:29 08:32:34 08:32:39 08:32:44 Average %usr 64 61 61 61 61 %sys 36 39 39 39 39 %wio 0 0 0 0 0 10/14/97 %idle 0 0 0 0 0
# sar -d 5 4 HP-UX r3w14 B.10.20 C 9000/712 08:32:24 08:32:29 08:32:34 08:32:39 08:32:44 Average device c0t6d0 c0t6d0 c0t6d0 c0t6d0 c0t6d0 %busy 19.36 26.40 21.00 21.00 22.44 avque 0.55 0.58 0.54 0.54 0.56
10/14/97 r+w/s 20 27 23 23 23 blks/s 1341 1687 1528 1528 1552 avwait 6.37 7.10 5.48 5.48 6.34 avserv 14.27 15.00 14.09 14.09 14.45
http://education.hp.com
Standard UNIX (System V) man page and kernel source Process completion Kernel registers/counters Process CPU (user, system, elapsed) Standard output device Minimal Timing how long a process executes /usr/bin/timex + minimal overhead - cannot be used on already running processes
Example
timex find / 2>&1 >/dev/null | tee -a perf.data real user sys 39.49 1.47 11.24
http://education.hp.com
Syntax
top [-s time] [-d count] [-n number] [-q] -s time -d count -n number -q Set Set Set Run the delay between screen updates the number of screen updates to "count", then exit the number of processes to be displayed quick. The top command with a nice value of zero.
Key Metrics
The top metrics include: SIZE RES %WCPU %CPU Total size of the process in KB. This includes text, data, and stack. Resident size of the process in KB. This includes text, data, and stack. Average (weighted) CPU usage since top started. Current CPU usage over the current interval.
Example
* Start top with a 10 second update interval # top -s 10 * Start top and display only 5 screen updates then exit # top -d 5 * Start top and display only top 15 processes # top -n 15
http://education.hp.com
* Start top and let it run continuously # top System: r3w14 Fri Oct 17 10:24:23 1997 Load averages: 0.55, 0.37, 0.25 115 processes: 113 sleeping, 2 running Cpu states:LOAD USER NICE SYS IDLE 0.55 9.9% 0.0% 2.0% 88.1%
BLOCK 0.0%
SWAIT 0.0%
INTR 0.0%
SSYS 0.0%
Memory: 24204K (15084K) real, 46308K (33432K) virtual, 2264K free Page# 1/9 TTY PID USERNAME PRI NI SIZE RES STATE TIME %WCPU %CPU COMMAND ? 680 root 154 20 1328K 468K sleep 33:23 12.36 12.34 snmpdm ? 728 root 154 20 340K 136K sleep 18:20 5.82 5.81 mib2agt ? 1141 root 154 20 12784K 3708K sleep 84:06 4.47 4.47 netmon ? 1071 root 80 20 1264K 568K run 0:19 3.00 2.99 pmd ? 3892 root 179 20 308K 296K run 0:00 2.59 0.34 top * To go to the next/previous page, type "j" and "k" respectively * To go to the first page, type "t"
NOTE:
The two values preceding real and virtual memory are the memory allocated for all processes, and in parentheses, memory allocated for processes that are currently runnable or that have executed within the last 20 seconds.
NOTE:
swait and block are relevant for SMP systems and will be 0.0% on single processor systems. swait is the time a processor spends spinning while waiting for a spinlock. block is the time a processor spends blocked while waiting for a kernel-level semaphore.
http://education.hp.com
Standard UNIX (BSD 4.x) man page on demand Kernel registers/counters and /etc/utmp Global Load averages, number of logged on users Standard output device Varies, depending on number of users logged in Easiest way to see time since last reboot, load averages /usr/bin/uptime + quick look at load average, how long systems been up - limited statistics
Example
# uptime 11:23am # uptime 11:23am User root root root root root up 3 days, 22:22, 7 users, load average: 0.62, 0.37, 0.30
-w up 3 days, 22:22, 7 users, load average: 0.57, 0.37, 0.30 tty login@ idle JCPU PCPU what console 9:26am 94:20 /usr/sbin/getty console pts/0 9:26am 5 /sbin/sh pts/3 9:26am 1:57 /sbin/sh pts/4 10:16am 2 2 vi tools_notes pts/5 9:43am script
http://education.hp.com
Syntax vmstat [-dnS] [interval [count]] vmstat -f | -s | -z -d -n -S -f -s -z Include disk I/O information Print in a format more easily viewed on a 80-column display Include swapping information Print number of processes forked since boot, number of pages used by all forked processes, and the average pages/forked process Print virtual memory summary information Zero the summary registers.
Key Metrics
The vmstat metrics include:
Process metrics
r b w
In run queue Blocked for resource (I/O, paging, and so on) Runnable or short sleeper (< 20 sec.) but swapped
http://education.hp.com
avm free re at pi po fr sr
Fault metrics
Active virtual pages Number of pages on the free list Page reclaims Address translation faults Pages paged in Pages paged out Pages freed by vhand, per second Pages surveyed (dereferenced) by vhand, per second
in sy cs
CPU metrics
Device interrupts per second System calls per second CPU context switch rate (switches/second)
us sy id
Examples
# vmstat -n 5 2 VM memory avm free re 7589 728 0 CPU cpu procs us sy id r b 2 1 97 0 74 7670 692 0 47 11 42 0 75 page po 0 faults sy 490
at 0
pi 0
fr 0
de 0
sr 0
in 140
cs 30
w 0 0 0
235
4959
170
# vmstat -nS 5 2 VM memory avm free si 7984 584 0 CPU cpu procs us sy id r b 2 1 97 0 75 7972 549 0 1 1 98 0 76
so 0
pi 0
page po 0
fr 0
de 0
sr 0
in 140
faults sy 490
cs 30
w 0 0 0
203
462
53
126.09
http://education.hp.com
http://education.hp.com
Student Notes
This slide shows the HP-specific, run-time performance monitoring tools included with HP-UX. Currently, glance and gpm are available for HP-UX. Both glance and gpm are optional, and can be purchased separately. If you are running 11i (any version), both glance and gpm are included with the Enterprise and Mission Critical Operating Environments. The glance and gpm tools provide real-time monitoring capabilities specific to the HP-UX operating system. Both tools provide access to performance data not available with standard UNIX tools, and both tools use the midaemon (i.e. KI interface) to collect performance data, yielding much more accurate performance results. xload is an X-windows application, which graphically shows the recent length of the CPUs run queue. It consists of a window that displays vertical lines which represent the average number of processes in the run queue over the previous intervals. The default interval size is 8 seconds.
http://education.hp.com
Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature:
Syntax
glance [-j interval] [-p [dest]] [-f dest] [-maxpages numpages] [-command] [-nice nicevalue] [-nosort] [-lock] [-adviser_off] [-adviser_only] [-bootup] [-iterations count] [-syntax filename] [-all_trans] [-all_instances] [-disks <n>] [-kernel <path>] [-nfs <n>] [-pids <n>] [-no_fkeys]
http://education.hp.com
Key Metrics
The glance tool includes reports for the following areas:
Hot Key a c d g h i l m n s t u v w z A B D F G H I J K L M N P R T W Y Z ? <CR> GLANCE PLUS REPORT CPU by Processor CPU Report Disk Report Process List FUNCTION All CPUs Performance Stats CPU Utilization Stats Disk I/O Stats Global Process Stats Help I/O by Filesystem Lan Stats Memory Stats NFS Stats Single process information OS Table Utilization Disk Queue Length Logical Volume Mgr Stats Swap Stats Zero all Stats
I/O by Filesystem Network by LAN Memory Report NFS Report Process selection System Table Report Disk Report I/O by Logical Volume Swap Detail
Application List Global Waits DCE Activity Process Open Files Process Threads Alarm History Thread Resource Thread Wait DCE Process List Process System Calls Process Memory Regions NFS Global Activity PRM Group List Process Resources Transaction Tracker Process Wait States Global System Calls Global Threads Help with options Update screen with new data
http://education.hp.com
Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons:
Syntax gpm [-nosave] [-rpt [rptname]] [-sharedclr] [-nice nicevalue] [-lock] [-disks <n>] [-kernel <path>] [-lfs <n>] [-nfs <n>] [-pids <n>] [Xoptions]
http://education.hp.com
Glance Advantages
Advantages of using Glance include: It is independent of X-Windows. It uses less overhead.
GPM Advantages
Advantages of using gpm include: It has customizable advisor syntax, which generates color-coded alarms. Has the ability to kill processes Reports are customizable. More comprehensive online documentation is available.
http://education.hp.com
Syntax xload [-toolkitoption ] [-scale integer] [-update seconds] [-hl|-highlight color] [-jumpscroll pixels] [-label string] [-nolabel] [-lights]
http://education.hp.com
Student Notes
This slide shows the standard UNIX data collection tools included with HP-UX. Data collection tools gather performance data and other system-activity information, and store this data to a file on the system. By default, not too many standard UNIX tools perform data collection. The two most common tools are the acct (system accounting) suite of tools and sar (via the sadc and sa1 programs), the system activity reporter.
http://education.hp.com
Description
Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Standard UNIX (System V) man pages on demand Kernel registers and other kernel routines System resources used, on a per user basis Connect time, Disk space used, others Binary file /var/adm/acct/pacct Medium to large (up to 33%), depending on number of users and amount of activity Shows the amount of system resources being consumed by each user on the system. Logs every command executed by every user on the system. Full Pathname: Pros and Cons: /usr/sbin/acct/[acct_command] + provides information to charge users for system use + extensive system utilization information kept - extremely large overhead, especially on an active system. - poor documentation
Syntax
/usr/sbin/acct/acctdisk /usr/sbin/acct/acctdusg [-u file] [-p file] /usr/sbin/acct/accton [file] /usr/sbin/acct/acctwtmp reason /usr/sbin/acct/closewtmp /usr/sbin/acct/utmp2wtmp and many more
http://education.hp.com
CPU time accounting Disk accounting Memory accounting Connect time accounting User command history Several more
http://education.hp.com
Syntax
sar [-ubdycwaqvmAMS] [-o file] t [n] sar [-ubdycwaqvmAMS] [-s time] [-e time] [-i sec] [-f file]
http://education.hp.com
1200 3 -s 8:00 -e 18:01 -i 3600 -u -s 8:00 -e 18:01 -i 3600 -b -s 8:00 -e 18:01 -i 3600 -q
Create the /var/adm/sa directory: mkdir /var/adm/sa Some systems recommend adding the above entries to adm's cron file instead of root's. On these systems, be sure to give write access to all users on the /var/adm/sa directory. chmod a+w /var/adm/sa
http://education.hp.com
Student Notes
This slide shows the HP-specific data collection performance tools, which can be added to an HP-UX system. The MeasureWare/OVPA (OpenView Performance Agent) and PerfView/OVPM (OpenView Performance Manager) tools are available for HP-UX systems. These tools are optional products (separately purchasable). These tools significantly enhance a customer's ability to track performance trends and review historical performance data about a system. The standard UNIX tools collect little to no perprocess information, and have no alarming capabilities. With the MeasureWare/OVPA and PerfView/OVPM tools, global and per-process information is collected. In addition, alarms can be set to notify a user when a collected metric exceeds a defined threshold. Recently, PerfView was renamed OpenView Performance Manager and MeasureWare was renamed OpenView Performance Agent. There were no other significant changes made to the products.
http://education.hp.com
Syntax
mwa [action] [subsystem] [parms] in which action is start stop restart Start all or part of MeasureWare/OpenView Performance Agent. (default) Stop all or part of MeasureWare/OpenView Performance Agent. Reinitialize all or part of MeasureWare/OpenView Performance Agent. This option causes some processes to be stopped and restarted.
http://education.hp.com
status
See the course B5136S Performance Management with HP OpenView for a more complete discussion of MeasureWare/OVPA.
http://education.hp.com
Syntax
pv [options]
PerfView/OVPM Notes
There are three components that make up the PerfView/OVPM product:
PerfView/OVPM Analyzer
The PerfView/OVPM Analyzer allows for the performance administrator to easily access data from any MeasureWare/OVPA Agent. By default, the last 8 days of data are pulled in to be analyzed, but any amount of data that has been collected can be retrieved. The PerfView/OVPM Analyzer allows you to compare multiple systems against a specific metric as well for load balancing. The graphs produced by the PerfView/OVPM Analyzer can be stored, or printed out to any Postscript or PCL printer. As with all of the RPM products, the PerfView/OVPM Analyzer is fully integrated with Network Node Manager and IT Operations.
http://education.hp.com
The PerfView/OVPM Monitor receives alarms sent by MeasureWare/OVPA agents. It allows you to filter alarms by severity and type. The PerfView/OVPM Monitor is an optional module and may not be required if you are also running Network Node Manager or IT Operations.
PerfView/OVPM Planner
The PerfView/OVPM Planner allows you to use collected MeasureWare/OVPA data to see performance trends. The more data provided to the PerfView/OVPM Planner and the less time you project it, the more accurate the reports will be. The PerfView/OVPM Planner is not a true capacity-planning tool in that it does not provide modeling or simulation capability.
See the course B5136S Performance Management with HP OpenView for a more complete discussion of PerfView/OVPM.
http://education.hp.com
Student Notes
This slide shows the standard UNIX networking performance tools included with HP-UX. Networking performance tools monitor performance and errors on the network. The standard UNIX networking tools primarily allow for monitoring of performance. The HP-specific tools will introduce the ability to tune some networking parameters to better meet the needs of a system's networking environment. NOTE: Super user (or root) access is not needed to monitor networking status by default.
http://education.hp.com
Syntax netstat [-aAn] [-f address-family] [system [core]] netstat [-f address-family] [-p protocol] [system [core]] netstat [-gin] [-I interface] [interval] [system [core]]
Examples
Display network connections
# netstat -n Active Internet connections Proto Recv-Q Send-Q Local Address Foreign Address tcp 0 0 156.153.192.171.1128 156.153.192.171.1129 tcp 0 0 156.153.192.171.1129 156.153.192.171.1128 tcp 0 0 156.153.192.171.947 156.153.192.171.1105 Active UNIX domain sockets Address Type Recv-Q Send-Q Inode Conn Refs Nextref c6f300 dgram 0 0 844afc 0 0 0 c87e00 dgram 0 0 844c4c 0 0 0 de4f00 stream 0 0 0 f75240 0 0 f71200 stream 0 0 0 f75280 0 0 (state) ESTABLISHED ESTABLISHED ESTABLISHED Addr /var/tmp/psb_front_socket /var/tmp/psb_back_socket /var/spool/sockets/X11/0
http://education.hp.com
http://education.hp.com
Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons:
Syntax
nfsstat [ -cmnrsz ]
Examples
To reset all nfsstat counters to zero: # nfsstat -z To display server/client RPC and NFS statistics: # nfsstat (this defaults to nfsstat -cnrs)
Server rpc: Connection oriented: calls badcalls nullrecv 0 0 0 Connectionless oriented: calls badcalls nullrecv 0 0 0
badlen 0
xdrcall 0
dupchecks 0
dupreqs 0
badlen 0
xdrcall 0
dupchecks 0
dupreqs 0
http://education.hp.com
Server nfs: calls badcalls 0 0 Version 2: (0 calls) null getattr setattr 0.0% 0.0% 0.0% wrcache write create 0.0% 0.0% 0.0% mkdir rmdir readdir 0.0% 0.0% 0.0% Version 3: (0 calls) null getattr setattr 0.0% 0.0% 0.0% write create mkdir 0.0% 0.0% 0.0% rename link readdir 0.0% 0.0% 0.0% commit 0.0% Client rpc: Connection oriented: calls badcalls badxids 20 0 0 badverfs timers cantconn 0 17 0 Connectionless oriented: calls badcalls retrans 20 0 0 badverfs timers toobig 0 17 0 Client nfs: calls badcalls clgets 20 0 20 Version 2: (20 calls) null getattr setattr 0.0% 18.90% 0.0% wrcache write create 0.0% 0.0% 0.0% mkdir rmdir readdir 0.0% 0.0% 1.5% Version 3: (0 calls) null getattr setattr 0.0% 0.0% 0.0% write create mkdir 0.0% 0.0% 0.0% rename link readdir 0.0% 0.0% 0.0% commit 0.0%
root 0.0% remove 0.0% statfs 0.0% lookup 0.0% symlink 0.0% readdir+ 0.0%
timeouts 0 nomem 0
newcreds 0 interrupts 0
badxids 0 nomem 0
timeouts 0 cantsend 0
waits 0 bufulocks 0
newcreds 0
cltoomany 0 root 0.0% remove 0.0% statfs 1.5% lookup 0.0% symlink 0.0% readdir+ 0.0% lookup 0.0% rename 0.0% readlink 0.0% link 0.0% read 0.0% symlink 0.0%
http://education.hp.com
Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons:
Syntax
Examples
Send two ICMP echo packets to host star1: # ping star1 -n 2 PING star1: 64 byte packets 64 bytes from 156.153.193.1: icmp_seq=0. time=1. ms 64 bytes from 156.153.193.1: icmp_seq=1. time=0. ms ----star1 PING Statistics---2 packets transmitted, 2 packets received, 0% packet loss round-trip (ms) min/avg/max = 0/0/1
http://education.hp.com
Send one ICMP packet and display the IP path taken: # ping -o 156.152.16.10 -n 1 PING 156.152.16.10: 64 byte packets 64 bytes from 156.152.16.10: icmp_seq=0. time=337. ms ----156.152.16.10 PING Statistics---1 packets transmitted, 1 packets received, 0% packet loss round-trip (ms) min/avg/max = 337/337/337 1 packets sent via: 15.63.200.2 - [ name lookup failed ] 15.68.88.4 - [ name lookup failed ] 156.152.16.1 - [ name lookup failed ] 156.152.16.10 - [ name lookup failed ] 15.68.88.43 15.63.200.1 - [ name lookup failed ] - [ name lookup failed ]
http://education.hp.com
Student Notes
This slide shows the HP-specific networking performance tools included with HP-UX. The first three tools listed (lanadmin, lanscan, and ndd/nettune) come standard with the base OS. The NetMetrix product is an additional product. The HP-specific networking tools display additional networking information and allow tuning of various networking parameters.
http://education.hp.com
Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons:
Syntax /usr/sbin/lanadmin [-e] [-t] /usr/sbin/lanadmin [-a] [-A station_addr] [-m] [-M mtu_size] [-R] [-s] [-S speed] NetMgmtID -e -t Echo the input commands on the output device. Suppress the display of the command menu before each command prompt.
Example
# lanadmin Test Selection mode. lan menu quit verbose = = = = LAN Interface Administration Display this menu Terminate the Administration Display command menu
http://education.hp.com
LAN Interface test mode. LAN Interface Net Mgmt ID = 4 clear display end menu ppa quit nmid reset specific = = = = = = = = = Clear statistics registers Display LAN Interface status and statistics registers End LAN Interface Administration, return to Test Selection Display the menu PPA Number of the LAN Interface Terminate the Administration, return to shell Network Management ID of the LAN Interface Reset LAN Interface to execute its selftest Go to Driver specific menu
Enter command: display Network Management ID Description Type (value) MTU Size Speed Station Address Administration Status (value) Operation Status (value) Last Change Inbound Octets Inbound Unicast Packets Inbound Non-Unicast Packets Inbound Discards Inbound Errors Inbound Unknown Protocols Outbound Octets Outbound Unicast Packets Outbound Non-Unicast Packets Outbound Discards Outbound Errors Outbound Queue Length Specific Ethernet-like Statistics Group Index Alignment Errors FCS Errors Single Collision Frames Multiple Collision Frames Deferred Transmissions Late Collisions Excessive Collisions Internal MAC Transmit Errors Carrier Sense Errors Frames Too Long Internal MAC Receive Errors = = = = = = = = = = = = 4 0 0 21353 42774 281589 0 0 0 0 0 0 = = = = = = = = = = = = = = = = = = = = = = 4 lan0 Hewlett-Packard LAN Interface Hw Rev 0 ethernet-csmacd(6) 1500 10000000 0x8000935c9bd up(1) up(1) 14465 3606105787 2767086 88379016 0 464396 7114206 458391388 2842387 2874 0 0 0 655367
http://education.hp.com
Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Full Pathname: Pros and Cons:
Syntax
lanscan [-ainv] [system [core]] -a -i -n -v Display station addresses only. Display interface names only. No headings. No headings. No headings.
Verbose output. Two lines per interface. Includes displaying of extended station address and supported encapsulation methods.
Examples
Output from a 10.x system:
# lanscan Hardware Station Crd Hardware Net-Interface Path Address In# State NameUnit State 2/0/2 0x080009D2C2DE 0 UP lan0 UP NM ID 4 MAC Type ETHER HP DLPI Mjr Support Num Yes 52
http://education.hp.com
# lanscan Hardware Station Crd Hdw Net-Interface Path Address In# State NamePPA 2/0/2 0x08000978BDB0 0 UP lan0 snap0
NM ID 1
http://education.hp.com
CAUTION: Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead Unique Feature:
[-w] object [parm...] -h [-w] [object] -l [-w] [-b size] [object [parm...]] -s [-w] object [parm...] value... (help) Print all information related to the object. This information provides helpful hints about changing the value of an object.
-l
http://education.hp.com
-s
(set)
Set object to value. An object may require more than one value. Display warning messages (for example, 'value truncated'). These are normally discarded when the command is successful.
-w
Examples
To get help information on all defined objects: nettune -h arp_killcomplete: The number of seconds that an arp entry can be in the completed state between references. When a completed arp entry is unreferenced for this period of time, it is removed from the arp cache. . . . To get help information on all TCP-related objects: nettune -h tcp tcp_receive: The default socket buffer size in bytes for inbound data. tcp_send: The default socket buffer size in bytes for outbound data. . . . To set the value of the ip_forwarding object to 1: nettune -s ip_forwarding 1 To get the value of the tcp_send object (socket send buffer size): nettune tcp_send
http://education.hp.com
The ndd utility command accesses kernel parameters through the use of "pseudo device files". These pseudo device files are referred to as a network device on the ndd command line and selected from the following list: /dev/arp /dev/ip /dev/rawip /dev/tcp /dev/udp Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead Unique Feature: Full Pathname: Pros and Cons: For ARP cache-related values For IP routing and forwarding parameters Default IP time-to-live header value Transport Connect Protocol (connection based) parameters User Datagram Protocol (connectionless) parameters HP man pages, ndd -h (for help options) on demand network device pseudo device files (reference above) Global LAN tunable parameters Standard output device minimal Change values of network parameters, which cannot otherwise be changed /usr/bin/ndd + provides ability to modify networking behavior without needing source code + provides access to tunable parameters normally not available - can have a negative impact on performance if used the wrong way - minimal documentation
Syntax ndd -get network device parameter
http://education.hp.com
parameter
The file /etc/rc.config.d/nddconf contains tunable parameters that will be set automatically each time the system boots.
Examples
To list the contents of the "arp cache": ndd -get /dev/arp arp_cache_report
To get the current value of the tunable parameter, ip_forwarding: ndd -get /dev/ip ip_forwarding
To set the value of the default TTL parameter for UDP to 128: ndd -set /dev/udp udp_def_ttl 128
To re-read the configuration file, /etc/rc.config.d/nddconf without rebooting the system: ndd -c
http://education.hp.com
Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature: Pros and Cons:
NetMetrix Notes
NetMetrix makes use of highly sophisticated devices (LAN probes) capable of collecting large amounts of detailed network information. NetMetrix is a truly distributed network management product that makes use of "midlevel managers" for data storage and alarming. There are a number of modules available with NetMetrix. NetMetrix's Internet Response Manager (IRM) and Internet Response Agent (IRA) fully integrate with HP OpenView products to provide a complete system and network management solution.
http://education.hp.com
Student Notes
This slide shows the standard UNIX administrative performance tools included with HP-UX. These tools are used to tune or modify system resources to better improve the performance of a system. These tools are typically used to change or tune a system's component, as opposed to viewing or displaying characteristics about the component. Only the root user is allowed to use these commands, as making these modifications affects the performance for all users on the system. NOTE: The ipcs program is really a performance-monitoring command; however, because it is usually run in conjunction with ipcrm, it is covered here to emphasize the relationship between the two commands.
http://education.hp.com
Syntax
ipcs [-mqs] [-abcopt] [-C -m Display information -q Display information -s Display information -b -c -o -p -t Display Display Display Display Display
corefile] [-N namelist] about active shared memory segments. about active message queues. about active semaphore sets.
largest-allowable-size information creator's login name and group name information on outstanding usage process number information time information
http://education.hp.com
Examples
# ipcs -s IPC status from /dev/kmem as of Fri Oct 17 12:56:36 1997 T ID KEY MODE OWNER GROUP Semaphores: s 0 0x2f180002 --ra-ra-raroot sys s 3 0x412000a9 --ra-ra-raroot root s 4 0x00446f6e --ra-r--r-root root s 6 0x01090522 --ra-r--r-root root s 7 0x013d8483 --ra-r--r-root root s 200 0x4c1c2f79 --ra-r--r-daemon daemon
# ipcrm -s 7
# ipcs -s IPC status from /dev/kmem as of Fri Oct 17 12:57:42 1997 T ID KEY MODE OWNER GROUP Semaphores: s 0 0x2f180002 --ra-ra-raroot sys s 3 0x412000a9 --ra-ra-raroot root s 4 0x00446f6e --ra-r--r-root root s 6 0x01090522 --ra-r--r-root root s 200 0x4c1c2f79 --ra-r--r-daemon daemon
http://education.hp.com
Syntax
nice [-n newoffset_from_default_20] command [command_args] renice [-n newoffset_from_current_value] [-g|-p|-u] id ... An unsigned newoffset increases the system nice value for the command or process, causing it to run at a weaker priority. A negative value requires superuser privileges, and assigns a lower system nice value (strongerer priority) to the process.
Examples
# ps -l F S 1 S 1 R # nice sh # ps -l F S 1 S 1 S 1 R # exit UID 0 0 PID 6044 8286 PPID 6042 6044 C PRI NI 1 158 20 6 179 20 ADDR ff6680 1003d80 SZ 85 22 WCHAN TTY 87cec0 ttyp2 - ttyp2 TIME COMD 0:00 sh 0:00 ps
UID 0 0 0
SZ 85 85 22
COMD sh sh ps
http://education.hp.com
UID 0 0 0
NI 20 30 30
SZ 85 22 121
COMD sh ps sh
NI 20 35 30
SZ 85 22 121
COMD sh ps sh
UID 0 0 0 0
PPID C PRI NI 6042 0 158 20 8294 19 158 39 6044 6 158 30 8305 4 220 39
SZ 85 121 121 22
COMD sh sh sh ps
UID 0 0 0 0
C 0 1 7 6
SZ 85 121 121 22
COMD sh sh sh ps
http://education.hp.com
Student Notes
This slide shows the HP-specific administrative performance tools available on HP-UX systems. Many of the tools shown on the slide come standard with the base OS. The only tools that are add-on products are PRM, WLM, WebQoS, and Advanced JFS (getext, setext, and fsadm). These HP-specific tools were developed to allow modifications and performance enhancements to the functionality unique to the HP-UX operating system.
http://education.hp.com
Syntax
getprivgrp [-g|group_name] setprivgrp [-g|groupname] [privileges] -g Specify global privileges that apply to all groups.
Examples
# getprivgrp global privileges: CHOWN # setprivgrp class CHOWN SERIALIZE RTPRIO # getprivgrp global privileges: CHOWN class: RTPRIO CHOWN SERIALIZE
http://education.hp.com
Notes
Group privileges which can be modified are: RTPRIO RTSCHED MLOCK CHOWN LOCKRDONLY SETRUGID SERIALIZE MPCTL Can use rtprio() call to set real-time priorities. Can use sched_setparam() call and sched_setscheduler() call to set POSIX.4 real-time priorities. Can use plock() to lock process text and data into memory, and the shmctl() SHM_LOCK function to lock shared memory segments Can use chown() to change file ownership. Can use lockf() to set locks on files that are open for reading only. Can use setuid() and setgid() to change, respectively, the real user ID and real group ID of a process. Can use serialize() to force the target process to run serially with other processes that are also marked by this system call. Can use mpctl() to lock a process or a thread to a specific processor on SMP systems. If processor sets are available, can be used to lock a process or a thread to a specific processor set. Can use spuctl() to enable and disable specific processors on SMP systems. (V-class, T-class, N-class, L-class, and Superdome only)
SPUCTL
http://education.hp.com
Syntax rtprio priority command [arguments] rtprio priority -pid rtprio -t command [arguments] rtprio -t -pid -t execute command with a timeshare (non-real-time) priority, or change the currently executing process pid from a possibly real-time priority to a timeshare priority.
Examples
Execute file a.out at a real-time priority of 100: rtprio 100 a.out Set the currently running process PID 24217 to a real-time priority of 40: rtprio 40 24217
http://education.hp.com
Specifies which scheduler to use, SCHED_FIFO (POSIX real-time), SCHED_RR (POSIX real-time), SCHED_RR2 (POSIX real-time), SCHED_RTPRIO (HP-UX real-time), or SCHED_HPUX (HP-UX timeshare)
http://education.hp.com
Examples
Execute file a.out at a POSIX real-time priority of 4: rtsched -s SCHED_FIFO -p 4 a.out
Set the currently running process pid 24217 to a real-time priority of 20: rtsched -s SCHED_RR -p 20
http://education.hp.com
Syntax
scsictl [-akq] [-c command]... [-m mode[=value]]... device -a -m mode Display the status of all mode parameters available. Display the status of the specified mode parameter. ir For devices that support immediate reporting, this displays the immediate reporting status.
queue_depth For devices that support a queue depth greater than the system default, this mode controls how many I/Os the driver will attempt to queue to the device at any one time. -m mode=value Set the mode parameter mode to value. The available mode parameters and values are listed above.
http://education.hp.com
Examples
To display a list of all of the mode parameters, turn immediate_report on, and redisplay the value of immediate_report. scsictl -a -m ir=1 -m ir /dev/rdsk/c0t6d0 will produce the following output: immediate_report = 0; queue_depth = 8; immediate_report = 1
http://education.hp.com
Syntax serialize command [command_args] serialize [-t] [-p pid] -t Indicates the process specified by pid should be returned to timeshare scheduling.
Examples
Use serialize to force a database application to run serially with other processes marked for serialization. Type: serialize database_app Force a currently running process with a PID value of 215 to run serially with other processes marked for serialization. Type: serialize -p 215 Return a process previously marked for serialization to normal timeshare scheduling. The PID of the target process for this example is 174. Type: serialize -t -p 174
http://education.hp.com
Syntax
/usr/sbin/fsadm [-F vxfs|hfs] [-V] [-o largefiles|nolargefiles] mount_point|special /usr/sbin/fsadm [-F vxfs] [-V] [-b newsize] [-r rawdev] mount_point /usr/sbin/fsadm [-F vxfs] [-V] [-d] [-D] [-s] [-v] [-a days] [-t time] [-p passes] [-r rawdev] mount_point
Examples
HFS Example Convert a nolargefiles HFS file system to a largefiles HFS file system: fsadm -F hfs -o largefiles /dev/vg02/lvol1 Display relevant HFS file system statistics: fsadm -F hfs /dev/vg02/lvol1
http://education.hp.com
JFS Example Increase the size of the var file system to 100 MB while it is mounted and online: lvextend -L 100 /dev/vg00/lvol7 fsadm -F vxfs -b 102400 /var Display fragmentation statistics for the /home file system: fsadm -D -E /home
http://education.hp.com
Syntax
Example
Display file attributes for the file, file1: getext file1 file1: Bsize 1024 Reserve 36 Extent Size 3 align noextend
The above output indicates a file with 36 blocks of reservation, a fixed extent size of 3 blocks, all extents aligned to 3-block boundaries, and the file cannot be extended once the current reservation is exhausted.
http://education.hp.com
/usr/sbin/newfs [-F FStype] [-o specific_options] [-V] special /usr/sbin/tunefs [-A] [-v] [-a maxcontig] [-d rotdelay] [-e maxbpg] [-m minfree] special-device /usr/sbin/vxtunefs
Notes
The initial file system parameters are set when the file system is first created with newfs. A small set of these parameters can be changed after the file system is created with tunefs. vxtunefs changes the attributes of the JFS file system when the file system is mounted. NOTE: The tunefs command works only for HFS file systems. The JFS file systems use other commands (getext, setext, vxtunefs).
http://education.hp.com
Examples
Create a file system on vg01 called lvol1. newfs -F hfs -b 16384 -f 2048 /dev/vg01/rlvol1
mkfs (hfs): Warning - 2 sector(s) in the last cylinder are not allocated. mkfs (hfs): /dev/vg01/rlvol1 - 20480 sectors in 133 cylinders of 7 tracks, 22 sectors 21.0Mb in 9 cyl groups (16 c/g, 2.52Mb/g, 384 i/g) Super block backups (for fsck -b) at: 16, 2512, 5008, 7504, 10000, 12496, 14992, 17488, 19728
For VxFS file systems use: # fsdb -F vxfs /dev/vg/NN/rlvolN > 8192 B > p S
http://education.hp.com
See the course U5447S HP-UX Resource Management with PRM & WLM for a more complete discussion of PRM.
http://education.hp.com
See the course U5447S HP-UX Resource Management with PRM & WLM for a more complete discussion of WLM.
http://education.hp.com
http://education.hp.com
Student Notes
This slide shows the standard UNIX tools for displaying system configuration and utilization information on an HP-UX system. System configuration and utilization tools are those which display configurations of LVM disks, file systems, and kernel resources.
http://education.hp.com
df, Standard UNIX (System V) bdf, Standard UNIX (Berkeley 4.x) man pages on demand File system superblocks Disk space resources Disk space utilization Standard output Minimal Shows how much disk space is being utilized. /usr/bin/bdf, /usr/bin/df + Easy to use - minimal tuning statistics
/usr/bin/bdf [-b] [-i] [-l] [-t type | [filesystem|file] ... ] /usr/bin/df [-befgiklnv] [-t|-P] [-o specific_options] [-V] [special|directory]...
kbytes 40960
used 25093
iused 3284
used 3586 0
http://education.hp.com
Examples df Command
# df /home /opt /tmp /usr /var /stand (/dev/vg00/lvol4 (/dev/vg00/lvol5 (/dev/vg00/lvol6 (/dev/vg00/lvol7 (/dev/vg00/lvol8 (/dev/vg00/lvol1 ): ): ): ): ): ): 93062 177124 90010 52732 100122 23596 blocks blocks blocks blocks blocks blocks 12403 23598 11982 7011 13320 5358 i-nodes i-nodes i-nodes i-nodes i-nodes i-nodes
http://education.hp.com
Syntax
Examples
# mount -p /dev/root /dev/vg00/lvol1 /dev/vg00/lvol6 /dev/vg00/lvol5 /dev/vg00/lvol4 /dev/dsk/c0t4d0 /dev/vg00/lvol7 / /stand /usr /tmp /opt /disk /var vxfs hfs vxfs vxfs vxfs hfs vxfs log defaults delaylog delaylog delaylog defaults delaylog 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# mount -v /dev/root on / type vxfs log on Thu Sep 11 12:15:08 1997 /dev/vg00/lvol1 on /stand type hfs defaults on Thu Sep 11 12:15:11 1997 /dev/vg00/lvol6 on /usr type vxfs delaylog on Thu Sep 11 12:17:06 1997 /dev/vg00/lvol5 on /tmp type vxfs delaylog on Thu Sep 11 12:17:07 1997 /dev/vg00/lvol4 on /opt type vxfs delaylog on Thu Sep 11 12:17:07 1997 /dev/dsk/c0t4d0 on /disk type hfs defaults on Thu Sep 11 12:17:08 1997 /dev/vg00/lvol7 on /var type vxfs delaylog on Thu Sep 11 12:17:23 1997 #
http://education.hp.com
Student Notes
This slide shows the HP-specific commands for displaying system configuration and utilization information. All the commands on the slide come standard with the base OS; none are add-on products. These commands display the configuration and utilization of HP-specific subsystems. Many of these commands have corresponding commands on other UNIX systems that perform similar functions.
http://education.hp.com
Syntax
/usr/sbin/diskinfo [-b|-v] character_devicefile The diskinfo command displays information about the following characteristics of disk drives: vendor name, manufacturer of the drive (SCSI only) product identification number or ASCII name type, CS/80 or SCSI classification for the device size of disk specified in bytes sector size, specified as bytes per sector
Example
# diskinfo /dev/rdsk/c0t6d0 SCSI describe of /dev/rdsk/c0t6d0: vendor: QUANTUM product id: PD425S type: direct access size: 416575 Kbytes bytes per sector: 512
http://education.hp.com
Syntax
/usr/sbin/dmesg [-] If the - argument is specified, dmesg computes (incrementally) the new messages since the last time it was run and places these on the standard output. This is typically used with cron (see cron(1)) to produce the error log /var/adm/messages by running the command: /usr/sbin/dmesg every 10 minutes. >> /var/adm/messages
Example
# dmesg Oct 17 12:39 vuseg=1815000 inet_clts:ok inet_cots:ok 1 graph3 2 bus_adapter 2/0/1 c720 2/0/1.0 tgt 2/0/1.0.0 stape 2/0/1.2 tgt 2/0/1.2.0 sdisk 2/0/1.3 tgt 2/0/1.3.0 stape 2/0/1.4 tgt 2/0/1.4.0 sdisk 2/0/1.7 tgt 2/0/1.7.0 sctl 2/0/2 lan2 2/0/3 hil
http://education.hp.com
http://education.hp.com
Syntax
/usr/sbin/ioscan [-k|-u] [-d driver|-C class] [-I instance] [-H hw_path] \ [-f[-n]|-F[-n]] [devfile]
Examples
# ioscan -f Class I H/W Path Driver S/W State H/W Type Description =========================================================================== bc 0 root CLAIMED BUS_NEXUS graphics 0 0 graph3 CLAIMED INTERFACE Graphics ba 0 2 bus_adapter CLAIMED BUS_NEXUS Core I/O Adapter ext_bus 0 2/0/1 c720 CLAIMED INTERFACE Built-in SCSI target 0 2/0/1.0 tgt CLAIMED DEVICE disk 0 2/0/1.0.0 sflop CLAIMED DEVICE TEAC FC-1 HF 07 target 1 2/0/1.1 tgt CLAIMED DEVICE tape 0 2/0/1.1.0 stape CLAIMED DEVICE HP HP35470A target 2 2/0/1.2 tgt CLAIMED DEVICE disk 1 2/0/1.2.0 sdisk CLAIMED DEVICE TOSHIBA CD-ROM XM-3301TA target 5 2/0/1.5 tgt CLAIMED DEVICE disk 4 2/0/1.5.0 sdisk CLAIMED DEVICE QUANTUM FIREBALL1050S target 6 2/0/1.6 tgt CLAIMED DEVICE
http://education.hp.com
# ioscan -fC disk Class I H/W Path Driver S/W State H/W Type Description ========================================================================= disk 5 2/0/1.6.0 sdisk CLAIMED DEVICE QUANTUM PD425S
# ioscan -fnC disk Class I H/W Path Driver S/W State H/W Type Description ========================================================================= disk 5 2/0/1.6.0 sdisk CLAIMED DEVICE QUANTUM PD425S /dev/dsk/c0t6d0 /dev/rdsk/c0t6d0
http://education.hp.com
Syntax
/sbin/vgdisplay [-v] [vg_name ...] /sbin/lvdisplay [-k] [-v] lv_path ... /sbin/pvdisplay [-v] [-b BlockList] pv_path ...
Examples
# vgdisplay --- Volume groups --VG Name VG Write Access VG Status Max LV Cur LV Max PV Cur PV Max PE per PV VGDA PE Size (Mbytes) Total PE Alloc PE
http://education.hp.com
Free PE Total PVG # pvdisplay /dev/dsk/c0t5d0 --- Physical volumes --PV Name VG Name PV Status Allocatable VGDA Cur LV PE Size (Mbytes) Total PE Free PE Allocated PE Stale PE IO Timeout # lvdisplay /dev/vg00/lvol1 --- Logical volumes --LV Name VG Name LV Permission LV Status Mirror copies Consistency Recovery Schedule LV Size (Mbytes) Current LE Allocated PE Stripes Stripe Size (Kbytes) Bad block Allocation
447 0
http://education.hp.com
Tool Source: Documentation: Interval: Data Source: Type of Data: Metrics: Logging: Overhead: Unique Feature:
Syntax
/usr/sbin/swapinfo [-mtadfnrMqw]
Examples
# swapinfo -t Kb AVAIL 159744 42112 201856 Kb USED 19868 51220 15300 86388 Kb FREE 139876 -51220 26812 115468 PCT USED 12% 36% 43% START/ Kb LIMIT RESERVE 0 -
PRI 1
NAME /dev/vg00/lvol2
http://education.hp.com
Syntax
Example
# /usr/sbin/sysdef NAME acctresume acctsuspend allocate_fs_swapmap bufpages create_fastlinks dbc_max_pct dbc_min_pct default_disk_ir dskless_node eisa_io_estimate eqmemsize file_pad fs_async hpux_aes_override maxdsiz maxfiles maxfiles_lim maxssiz maxswapchunks maxtsiz maxuprc maxvgs msgmap nbuf ncallout VALUE 4 2 0 2841 0 50 5 1 0 768 15 10 0 0 16384 60 1024 2048 256 16384 75 10 2555904 4788 292 BOOT MIN-MAX -100-100 -100-100 00-1 00-1 256-655360 30-2048 30-2048 256-655360 1-16384 256-655360 3306UNITS FLAGS -
Pages
Pages
Pages Pages
http://education.hp.com
The name of the parameter The current value of the parameter The value of the parameter at boot time The minimum allowed value of the parameter The maximum allowed value of the parameter The units by which the parameter is measured Further describe the parameter M Parameter may be modified without rebooting
http://education.hp.com
HP man pages on demand /stand/vmunix and the currently running kernel Tunable kernel parameters Current configuration of kernel parameters Standard output device Minimal Works with dynamic and static kernel modules /usr/sbin/kmtune
/usr/sbin/kmtune [-l] [[-q name] . . ] [-S system file] /usr/sbin/kmtune [[-s {+|=}value] . . ] [[-r name] . . ] [-S system file]
Examples
# /usr/sbin/kmtune
http://education.hp.com
Student Notes
This slide shows the standard UNIX application profiling performance tools included with HP-UX. Application profiling tools provide in-depth details regarding the execution of a program, including the number of times each subroutine is called and the amount of time spent in each subroutine.
http://education.hp.com
Syntax
prof [-tcan] [-ox] [-g] [-z] [-h] [-s] [-m mdata] [prog] gprof [options] [a.out [gmon.out...]]
Examples
cc -p prog.c -o program ./program prof program
http://education.hp.com
Syntax:
http://education.hp.com
Student Notes
This slide shows some HP-specific application profiling tools included with HP-UX. Currently, the Transaction Tracker (ttd), and caliper are available for monitoring application behavior and performance. In 10.20, there was a tool called puma which came with all standard programming language compilers (like C, Pascal, and Fortran). The puma tool allowed profiling data to be collected without having to modify the application source code, or recompiling the application (in many cases). puma has been excluded from the more recent releases of HPUX. The Transaction Tracker allows a programmer to time how long a program is spending within a certain area of code. The Transaction Tracker requires the source code be modified to include the starting point and the stopping point. The Transaction Tracker is included as part of the MeasureWare/OVPA product. Transaction Tracker is HPUX specific. arm (discussed earlier) is the generic version of the Transaction Tracker. caliper is thread-aware, MP-aware, and features an easy command-line interface.
http://education.hp.com
The four function calls used by Transaction Tracker are: tt_getid tt_start tt_end tt_abort Names the transaction and returns a unique identifier. Signals the start of a unique transaction. Signals the end of the transaction. Ends the transaction without recording times for the transaction.
http://education.hp.com
The latest version of HP Caliper is available on the HP Caliper home page. You can find it at the http://www.hp.com/go/hpcaliper/ site.
Overview
HP Caliper helps you dynamically measure and improve the performance of your native Itanium-based applications in three ways: Commands to measure the overall performance of your program. Commands to drill down to identify performance parameters of specific functions in your program. A simple way to optimize the performance of your program based on its specific execution profile.
HP Caliper does not require special compilation of the program being analyzed and does not require any special link options or libraries. HP Caliper selectively measures the processes, threads, and load modules of your application. An application's load modules are the main executable and all shared libraries it uses. HP Caliper uses a combination of dynamic instrumentation of code and the performance monitoring unit (PMU) in the Itanium processor. HP Caliper uses the least-intrusive method available to gather performance data.
http://education.hp.com
Both ILP32 (+DD32) and LP64 (+DD64) programs, both 32-bit and 64-bit ELF formats. Archive-, minshared- or shared-bound executables. Both single- and multi-threaded applications, including MxN threads. Applications that fork() or vfork() or exec() themselves or other executables. Shell scripts and the programs they spawn.
Features
HP Caliper is simple to run because it uses a single command for all measurements. You specify the type of measurement and the target program as command-line arguments. For example, to measure the total number of CPU cycles used by a program named myprog, just type: caliper total_cpu myprog HP Caliper features include: Multiple performance measurements, each of which can be customized through configuration files. All reports are available in text format, comma-delimited (CSV) format, and most reports are also available in HTML format for easier browsing. Performance data can be correlated to your source program by line number. Easy inclusion and exclusion of specific load modules, such as libc, when measuring performance. Both per-thread and aggregated thread reports for most measurements. Performance data reported by function, sorted to show hot spots. Support for multi-process selection capabilities. The ability to save performance data in files that you can use to aggregate data across multiple runs to generate reports without having to re-run HP Caliper. The ability to attach and detach to running processes for certain measurements. The ability to restrict PMU measurements to specific regions of your programs. Limited support for dynamically generated code.
http://education.hp.com
Summary
Different categories of performance tools Standard UNIX tools versus HP-specific tools Separately purchasable tools Kernel register-based tools versus midaemon-based Tools
Student Notes
To summarize this module, there are many performance tools for many different purposes. The objective of this module was to highlight all the performance tools available with HP-UX, to categorize them by function, and to describe how each tool worked. In general, you should become most familiar with these tools: sar vmstat top glance/gpm (if available) These tools will tend to be your most commonly used tools. Other tools will tend to be useful in more specialized situations. Remember, never try to rely on just one tool to do everything. No tool will tell you everything. And every tool will mislead you somewhere down the line. No tool is perfect. Thats why you need to be familiar with multiple tools.
http://education.hp.com
Lab
Before we continue with a more focused discussion of glance and gpm, lets spend some time exploring the generic UNIX and HP-UXspecific tools discussed so far. As you answer the following questions, try to categorize each tool as to its type and scope.
Student Notes
The goal of this lab is to gain familiarity with performance tools. A secondary goal is to get familiar with the metrics reported by the tools, although they will be explored in depth during the next days.
Directions
Set up: Change directories to: # cd /home/h4262/tools Execute the setup script: # ./RUN Use glance (or gpm if you have a bit-mapped display), sar, top, vmstat, and any other available tools to answer the following questions. List as many as possible, and include the appropriate OPTION or SCREEN, which will give the requested information. Specific numbers are not the important goal of this lab. The goal is to gain familiarity with a variety of performance tools. Always investigate what the basic UNIX tools can tell you before running glance or gpm. You may want to run through this lab with the solution from the back of this book for more guidance and discussion.
http://education.hp.com
1. How many processes are running on the system? Which tools can you use to determine this?
2. Are there any real-time priority processes running? If so, list the name and priority. What tools can you use to determine this?
3. Are there any nice'd processes on the system? If so, list the name and priority for each. What tools can you use to determine this?
4. Are there any zombie processes on the system? If so, how many are there? What tools can you use to determine this?
5. What is the length of the run queue? What are the load averages? What tools can you use to determine this?
http://education.hp.com
6. How many system processes are running? What tools can you use to determine this? NOTE A system process is defined as a process whose data space is the kernel's data space (such as swapper, vhand, statdaemon, unhashdaemon, and supsched). ps reports their size as zero.
There are three ways this can be determined. If you get stuck on this question, move on. Don't spend more than a few minutes trying to answer this question.
7. What percentage of time is the CPU spending in different states? What tools can you use to determine this?
8. What is the size of memory? What is the size of free memory? What tools can you use to determine this?
9. What is the size of the swap area(s)? What is the percentage of swap utilization? What tools can you use to determine this?
http://education.hp.com
10. What is the size of the kernels incore inode table? How much of the inode table is utilized? What tools can you use to determine this?
11. Are there any CPU-bound processes running (processes using a lot of CPU)? If so, what is the name of the process? What steps did you take to determine this?
12. Are there any processes running which are using a lot of memory? (A "lot" is relative, i.e. a large RSS size compared to other processes.) If so, what is the name of the process? What steps did you take to determine this? Is memory utilization changing?
13. Are there any processes running which are doing any disk I/O? If so, what is the name of the process? What steps did you take to determine this? What are the I/O rates of the disk bound processes? What files are open by this (these) process(es)? NOTE: No processes are really doing a lot of physical disk I/O. However, lab_proc3 is doing a LOT of logical I/O.
http://education.hp.com
14. What is the current rate of semaphore or message queue usage? What tools can you use to determine this?
15. Is there any paging or swapping occurring? What tools can you use to determine this?
16. What is the system call rate? What tools can you use to determine this?
17. What is the buffer cache hit ratio? What tools can you use to determine this?
18. What is the tty I/O rate? What tools can you use to determine this?
19. Are there any traps (interrupts) occurring? What tools can you use to determine this?
http://education.hp.com
20. What information can you collect about network traffic? What tools can you use to determine this?
21. What information can be gathered on CPUs in an SMP environment? What tools can you use to determine this?
22. What information can be gathered on Logical Volumes? What tools can you use to determine this?
23. What information can be gathered on Disk I/O? What tools can you use to determine this?
http://education.hp.com
http://education.hp.com
Module 3 GlancePlus
Objectives
Upon completion of this module, you will be able to do the following: Compare GlancePlus with other performance monitoring/management tools. Start up the GlancePlus terminal interface (glance) and graphical user interface (gpm).
http://education.hp.com
Module 3 GlancePlus
This is GlancePlus
Features
Motif-based interface that offers exceptional ease-of-learning and ease-of-use State-of-the-art, award-winning on-line Help system. Rules-based diagnostics that use customizable system performance rules to identify system performance problems and bottlenecks. Alarms that are triggered when customizable system performance thresholds are exceeded. Tailor information gathering and display to suit your needs. Integrated into OpenView environments.
Capabilities
Get detailed views of CPU, disk, and memory resource activity View disk I/O rates and queue lengths by disk device to determine if your disk loads are well balanced Monitor virtual memory I/O and paging Measure NFS activity And much more ...
Student Notes
GlancePlus is a performance monitoring diagnostic tool. GlancePlus software visually gives you the useful, accurate information you need to pinpoint potential or existing problems involving your systems CPU, memory, disk, or network utilization. To help you monitor and interpret your systems performance data, GlancePlus software includes a rules-based adviser. Whenever threshold levels for measurements such as CPU utilization or disk I/O rates are exceeded the adviser notifies you with on-screen alarms. The adviser also applies rules to key performance measurements and symptoms and then gives you information to help you uncover bottlenecks or other performance problems. NOTE: GlancePlus is integrated into OpenView Windows at the menu bar level.
http://education.hp.com
Module 3 GlancePlus
GlancePlus offers a viewpoint into many of the critical resources that need to be measured in the open system environment.
Benefits
Save time and effort managing your system resources Better understand your computing environment Satisfy your end users system performance needs quickly Leverage from a standard interface across vendor platforms
The features in the product yield a performance monitoring diagnostic solution that offers many benefits to the user. GlancePlus offers a tool that will make your analysis activities easier and quicker to perform. This will save you time. The display of various types of information will also allow you to get a better understanding of your own environment. The same GUI on the Motif version is used on all the supported platforms, which provides a leverage point for a standard user interface across several UNIX platforms. Many times, just by cursory use of the product, people will discover certain things about their systems. You do not have to have a performance problem to use GlancePlus. This simple cursory use of the product has let many people gain a better understanding of their systems. This helps out when a problem does exist. Knowing what is normal can help identify what has become abnormal in your environment.
http://education.hp.com
Module 3 GlancePlus
PerfView
managed node
GlancePlus
MeasureWare
Performance data collection and alarming Online performance monitoring and diagnostic
APPS DATABASES
Student Notes
The view here is from the heights. For our purposes, we will focus our discussion on the capabilities of glance and gpm and the information and reports they can produce from a running HP-UX system. Also understand that GlancePlus may be used in conjunction with MeasureWare/OVPA to enhance and extend its capabilities. Many of you may have purchased glance in the GlancePlus Pak, which includes a license to run glance, gpm and to configure and run the MeasureWare/OVPA Agent (mwa) on your system. The GlancePlus and MeasureWare/OVPA Agent products can be purchased separately or combined in the GlancePlus Pak. The Pak also includes (as of C.03.58.00 June 2002 application release) some event monitoring and graphical configuration components.
http://education.hp.com
Module 3 GlancePlus
The components share a common measurement infrastructure, thus metrics, as well as applications have similar alarming mechanisms.
GlancePlus Pak
GlancePlus
Interfaces include: /opt/perf/bin/gpm /opt/perf/bin/glance
MeasureWare/OVPA
Interfaces include: /opt/perf/bin/extract /opt/perf/bin/utility
PerfView/OVPM
Interfaces include: /opt/perf/bin/pv
Complete information on the configuration and use of MWA/OVPA and PerfView/OVPM are fully covered in the Hewlett-Packard Education Services' course: PerfView MeasureWare (catalog number B5136).
http://education.hp.com
Module 3 GlancePlus
Student Notes
GlancePlus provides dual user interfaces: The gpm GUI See history of activity of the system with multiple window capability Monitor your system while doing other work Use alarms, symptoms and color to assist with monitoring The glance Character Mode Monitor performance remotely over slow datacom line When no high resolution monitor is available Creates less load on the system being monitored
http://education.hp.com
Module 3 GlancePlus
Notes on starting the user interfaces: gpm and glance Starting the GUI # gpm [options] Starting the character based interface: # glance [options]
-nosave
Do not save the current configuration at the next exit Specify one or more additional report windows Share color scheme with other applications Set the gpm nice value Use X-Toolkit options such as -display
-j interval
Preset the number of seconds between screen refreshes Specify the continuous print option destination. Allows glance to lock itself into memory Set the glance nice value
-rpt
-p dest
-sharedclr
-lock
-nice Xoptions
-nice
http://education.hp.com
Module 3 GlancePlus
Student Notes
With glance you can run on almost any terminal or workstation, over a serial interface and relatively slow data communication links, and with lower resource requirements. The default Process List screen is shown in the above screen capture, and provides general data on system resources and active processes. In addition, the user may drill down to more specific levels of detail in areas of CPU, memory, disk I/O, network, NFS system calls, swap, and system table screens. Specific details on a per-process level are also available through the individual process screens. For your convenience, the next two pages contain a hot key quick reference guide for the glance character mode interface.
http://education.hp.com
Module 3 GlancePlus
Glance Hot Key Quick Reference Top Level Screen Hot Keys
Hot Key a c d g i l m n t u v w A B D G H I J K N P T Y Z ?
Screen Displayed/Description CPU By Processor CPU Report Disk Report Process List I/O By File System Network By Interface Memory Report NFS By System System Tables Report I/O By Disk I/O By Logical Volume Swap Space Application List Global Waits DCE Global Activity Process Threads Alarm History Thread Resources Thread Wait DCE Process List NFS Global Activity PRM Group List Transaction Tracker Global System Calls Global Threads Commands Menu
http://education.hp.com
Module 3 GlancePlus
http://education.hp.com
Module 3 GlancePlus
Student Notes
Above is an example of an easy and common performance problem a runaway looping process. Why is the global CPU utilization < 100%, although the sum of the individual process CPU utilizations > 100 %? Hint: Is this a UP or MP system? Also note that / (slashes) are used in glance reports to separate current metric values from cumulative averages. NOTE: For the record there were two CPUs on this system.
http://education.hp.com
Module 3 GlancePlus
On a three-way multiprocessor system with two processes in the same application looping, each process can use nearly 100% of each of 2 CPUs. Over a 10-second interval, each uses nearly 10 seconds of CPU time, so the application used nearly 20 seconds of CPU time in 10 seconds of elapsed time. Process CPU utilization is 100% for each of the 2 looping processes, but global CPU utilization would be 66%. On HP-UX 11.0, processes can have multiple threads, each of which can consume CPU time independently of the others. On a four-way MP system, with one process that has three threads looping, the process as a total uses 300% of the CPU. The application and global CPU utilization would report the CPU utilization at 75%.
http://education.hp.com
Module 3 GlancePlus
Student Notes
gpm presents the same metrics as character-mode glance in graphical form. Significant global metrics, as well as bottleneck adviser symptom status and alarms are shown in the main window. The process list, as well as other reports, is available via menu selections. The process list is very customizable (and customizations are preserved) with filters, sorting, highlights, chosen metrics, and column rearrangement. The online Users Guide is very useful. The ? button on every window is a shortcut into the on-item help, which is useful especially for metric definitions.
http://education.hp.com
Module 3 GlancePlus
http://education.hp.com
Module 3 GlancePlus
Process Information
Process Information Detailed data on each active process CPU data Disk I/O data Memory Use Wait Reasons Open Files
Process Features Access via Main Reports selection Process List Each Process has: Process Resources Open Files
Student Notes
The Process Information screen in gpm presents the user with detailed information on each active process (including CPU utilization, disk I/O data, memory usage, wait state reasons, open() file information, and so on). This screen also allows the user to select a specific process and "drill down" to greater detail via the Reports selection menu.
http://education.hp.com
Module 3 GlancePlus
Customizable GUI
GlancePlus uses the power of Motif and its industry-leading approach to display technology, to provide the user with a powerful graphical user interface that can be customized to fit your needs. Fonts, color, window size and more are configuration options. Additional configuration choices are available in "list" windows to allow easy manipulation of column tabular data for display and sort uses. The gpm Process List and GlancePlus - Main screen provide a pull-down menu to access the numerous, detailed Report screens. These reports allow a logical approach to the extensive amount of system resources and process specific data.
Resource History Window CPU Info Memory Info Disk Info Network Info System Info Global Info Swap Space Wait States Transaction Tracking Application List PRM Group List Process List Thread List
http://education.hp.com
Module 3 GlancePlus
Adviser Components
Adviser Windows Symptom History Symptom Status/Snapshot Alarm History Adviser Syntax Button Label Colors Alarm Button for Alarm Statements Graph Buttons for Symptom Statements Icon Border Color (in OpenView) Changes to Red or Yellow on Alarms
Student Notes
GlancePlus supports performance alarms and a rules-based adviser to help automate the interpretation of performance data. The alarm rules can be customized by the user to reflect local system characteristics. Note: Both interfaces will report alarms, and the same syntax is used for alarms in glance and gpm. Alarms are configured through the /var/opt/perf/advisor.syntax file.
http://education.hp.com
Module 3 GlancePlus
symptom CPU_Bottleneck type=CPU rule GBL_CPU_TOTAL_UTIL > rule GBL_CPU_TOTAL_UTIL > rule GBL_CPU_TOTAL_UTIL > rule GBL_PRI_QUEUE >
75 85 90 3
25 25 25 25
alarm CPU_Bottleneck > 50 for 2 minutes start if CPU_Bottleneck > 90 then red alert "CPU Bottleneck probability= ", else yellow alert "CPU Bottleneck probability= repeat every 10 minutes if CPU_Bottleneck > 90 then red alert "CPU Bottleneck probability= ", else yellow alert "CPU Bottleneck probability= end reset alert "End of CPU Bottleneck Alert"
Student Notes
The bottleneck alarms are a little complex. The CPU bottleneck symptom definition and corresponding alarm is shown. Just because a resource is fully utilized doesnt mean that it is a bottleneck. It is only a bottleneck if there is activity that is hindered waiting for that resource. Therefore, utilization alone is not a good bottleneck indicator. Both utilization and queue lengths are combined to define the symptom probability. Some of the key metrics for performance analysis are the ones we use in the default syntax to define bottleneck alarms.
http://education.hp.com
Module 3 GlancePlus
application = application = user = user = file = file = priority = priority = group = group =
parm file application definitions are used by both GlancePlus and MeasureWare. A .parm in a user's $HOME directory will override the system parm file.
Student Notes
By now you are starting to see the range and scope of the performance metric data that glance and gpm display. While this is invaluable when it comes to understanding the behavior of a single process, many times what we really need is to evaluate and baseline the performance of an entire application suite. This could be achieved by adding up the individual metrics of all processes within the application suite, but this could be a daunting task for all but the simplest of applications. Through the use of the configuration file /var/opt/perf/parm, glance and gpm can help to collect all the metrics from the individual processes within an application suite and present the information in a concise manner for your review. One challenge is in the definition of what constitutes an application. To address this issue, the parm file has several different methods for describing which processes belong to which application definition. Application member processes can be defined by their UID, the front-store file from which they were fork()'d , the priority at which they execute, their GID, or any combination of the above. This provides a very versatile framework for application profiling.
http://education.hp.com
Module 3 GlancePlus
NOTE:
glance and gpm share the same application definitions (via the parm configuration file) as mwa.
# /var/opt/perf/parm for host system garat id = garat # Parameters for what data classes scopeux will log: log global application process dev=disk,lvm transaction # Parameters to control maximum size of scopeux logfiles: size global=10, application=5, process=2, device=1, transaction=1.5 # Thresholds which determine what process data scopeux will log: threshold cpu = 1, disk = 1, nonew, nokilled # Web server: application = WWW user = www or file = httpd # Untrustworthy users: application = HighRisk user = fred,barney,root
The order in which applications are defined is very important. Once a process meets the definition of an application, its data will be contributed to that application's metrics. Care must be taken to assure that ambiguity is avoided in the definition of applications.
http://education.hp.com
Module 3 GlancePlus
glance
Adviser definitions
gpm
Shared Memory
parm file (application definitions)
midaemon
HP-UX kernel
KI
Student Notes
Without going into a lot of detail, note that both interfaces share a common instrumentation source and common application definitions. Instrumentation comes partly from interfaces also accessed by standard UNIX utilities such as vmstat, and partly from special HP-UX KI trace-based instrumentation. There is no generally available API to these interfaces. They are written specifically for use by GlancePlus and MeasureWare/OVPA.
http://education.hp.com
Module 3 GlancePlus
Significant Directories
/opt/perf /opt/perf/bin Product files from installation media Executables
/opt/perf/ReleaseNotes Release Notes /opt/perf/examples /opt/perf/paperdocs /var/opt/perf Supplementary configuration examples Electronic versions of documentation Product and configuration files created during and after installation
Always check ReleaseNotes for version-specific information. (New for C.02.30 and later releases: example configuration files) Config files come from /opt/perf/newconfig if they dont already exist under /var/opt/perf. Compare new default parm file with that on your system if you are updating from a previous release. The directory /var/opt/perf contains the status and data files.
http://education.hp.com
Module 3 GlancePlus
? screen to navigate h for help o screen for setting thresholds and process list sorting
Edit the adviser alarms to be right for you. Adjust update interval to control CPU overhead. Process details including thread lists, wait states, memory regions, open files, and system call reports can be used to impress your programming staff ! 8^)
Student Notes
http://education.hp.com
Module 3 GlancePlus
Student Notes
It is important to understand the interrelationships among metric classes.
http://education.hp.com
Module 3 GlancePlus
Student Notes
One of the hardest skills is to determine what to measure and how to interpret its significance. After all, if the users response time is satisfactory, then oftentimes there is no problem even if an operation metric is higher than normal.
http://education.hp.com
Module 3 GlancePlus
Student Notes
CPU utilization and disk I/O rates compare well on different summarization intervals, whereas CPU time and I/O counts are always larger when the collection interval grows. Examples of breakdowns: Global disk I/O rate is a sum of the BYDSK_ metrics, each class in turn breaks down activity between reads and writes and file system versus raw and system access. For disk bottlenecks, it is often useful to correlate between DSK, FS, and LV classes. Memory utilization is frequently nearly 100% with dynamic buffer cache. If page outs occur or while in raw disk access environments, shrink the buffer cache to avoid paging. Programmers frequently dont know they can view specific system-call metrics, as well as memory region and open file information on a per-process basis.
http://education.hp.com
Module 3 GlancePlus
Summary
Dont try to understand all the capabilities and extensions to the tools, just the ones of most use to you. Start with developing an understanding of what is normal on your systems. Refine and develop alarms customized for your environment. Work from examples in documentation, gpm online help, config files, and example directories.
Student Notes
Remember that performance tuning is an art, and the following two rules apply to most engagements: Rule #1: Rule #2: When answering a question about computer system performance, the initial answer is always, It depends. Performance tuning always involves a trade-off.
Suggested reading: HP-UX Tuning and Performance by Robert F. Sauers and Peter S. Weygant, available through the Hewlett-Packard Professional Books, Prentice Hall Press (ISBN 0-13-102716-6)
http://education.hp.com
Module 3 GlancePlus
memo ry
process
cpu
Topics Main Window CPU Bottlenecks Memory Bottlenecks Configuration Information Alarm and Symptoms
Student Notes
To take the guided tour of GlancePlus, run the gpm GUI and select Help on the menu bar. Next, select the Guided Tour option. This will introduce you to the product. It features captured windows of the actual product, with annotations to help point out the important features of certain screens or windows. Quick Tip: gpm provides an excellent online Help system. Click the right mouse button for the On-Item Help feature. For help in glance, press the h key.
http://education.hp.com
Module 3 GlancePlus
Start a program from another window: # cd /home/h4262/cpu/lab1; # ./RUN & 4. Main Window. Below each graph within the GlancePlus Main window, you will find a button. These buttons display the status color of adviser symptoms. This is a powerful feature of GlancePlus that we will investigate later. Clicking on one of these buttons displays details of that particular graph. To view the advisor symptoms from the main window, select: Adviser -> Edit Adviser Syntax This will display the definitions of the current symptoms being monitored by GlancePlus. Close the Edit Adviser Syntax window.
http://education.hp.com
Module 3 GlancePlus
View CPU details: Click the CPU button. To view a detailed report regarding the CPU, select: Reports -> CPU Report Select: Reports -> CPU by Processor This is a useful report, even on a single processor system. 5. On Line Help. One method for accessing online help within GlancePlus is to click on the question mark (?) button. The cursor changes to a ? . Click on the column heading, NNice CPU %. This opens a new window describing the NNice CPU % column. View descriptions for other columns, including the SysCall CPU %. When finished viewing online help for columns, click on the question mark one more time. This returns the cursor to normal. 6. Alarms and Symptoms. A symptom is some characteristic of a performance problem. GlancePlus comes with predefined symptoms, or the user can define his own. An alarm is simply a notification that a symptom has been detected. From the main window, select: Adviser -> Symptom History For each defined symptom, a history of that particular symptom is displayed graphically. The duration is dependent on the glance history buffers, which are user-definable. Close the window. Click on the ALARM button in the main window. This displays a history of all the alarms that have occurred since GlancePlus was started. Up to 250 alarms can be displayed. Close the window. 7. Process Details. Close all windows except for the main window. Select: Reports -> Process List This shows the interesting processes on the system (interesting in terms of size and/or activity). To customize this listing, select: Configure -> Choose Metrics This will display an astonishing number of metrics, which can be chosen for display in this report. This is also a quick way to get an overview of all of the process-related
http://education.hp.com
Module 3 GlancePlus
metrics available in GlancePlus. Note that the familiar ? button is also available from this window. Use the scroll bar to find the metric PROC_NICE_PRI. Select this metric and click on OK. Close this window by clicking on OK. 8. Customizations. Most display windows can be customized to sort on any metric, and to arrange the metrics in any user-defined order. To define the sort fields, select Configure -> Sort Fields The sort order is determined by the order of the columns. Placing a particular metric into column one makes it the first sort field. If multiple entries have the same value within this field, then the second column is used to determine the order between those entries. If further sorting is needed, then the third column is used, and so forth down the line. To sort on Cumulative CPU Percentage, click on the column heading CPU % Cum. The cursor will become a crosshair. Scroll window back to column one, and click on column one. This makes CPU % Cum the first sort field. Arrange the sort order so that CPU % is followed by CPU % Cum. Click Done when finished. This sort order is automatically saved so that the next time processes are viewed, this will remain the sort order. In a similar fashion, the order of the columns can also be arranged. To define the column order, select Configure -> Arrange Columns Select a column to be moved (for example, CPU % Cum). The cursor will become a crosshair. Scroll the window to the location where the column is to be inserted. Click on the column where the column is to be inserted. Arrange the first four columns to be in the following order: Process Name, CPU %, CPU % Cum, Res Mem. Click Done when finished. This display order is automatically saved so that the next time processes are viewed, this will remain the display order. 9. More Customizations. It is possible to modify the definition of interesting processes by selecting: Configure -> Filters An easy way to limit the processes shown is to and all the conditions (the default is to OR the conditions). In the Configure Filters window, select AND logic, then click on OK. A much smaller list of processes should be displayed. Return to the Configure Filters window. Modify the filter definition for CPU % Cum as follows: Change Enable Filter to ON Change Filter Relation to >=
http://education.hp.com
Module 3 GlancePlus
Change Filter Value to 3.0 Change Change Change Change Enable Highlight to ON Highlight Relation to >= Highlight Value to 3.0 Highlight Color to any LOUD color
Reset the logic condition make to OR, then click OK. Verify the filter took effect. 10. Administrative Capabilities. There are two administrative capabilities with GlancePlus. If working as root, processes in the Process List screen can be killed or reniced. In the Process List window, select the proc8 process. To access the Admintools, select: Admin -> Renice Use the slider to set the new nice value for this process to be +19, then click OK. Note the impact on this process. Now, select the proc8 process again. Select: Admin -> Kill Click OK, and note the process is no longer present. 11. Process Details. Detailed metrics can be obtained on a per process basis. To view process details, go to the Process List window and double click on any process. Much of the details in this report will be explained in the Process Management section of the course. The Reports menu provides much valuable information about the process, including the Files Open and the System Calls being generated. After surveying the information available through this window, close and return to the Main window. There are many other features available in GlancePlus. There are close to 1000 metrics available with it. Notice that when you iconify the GlancePlus Main window, all of the other windows are closed and the GlancePlus active icon is displayed. Alarms and histograms are displayed in this active icon. Exploding this icon will again open up all previously open windows. 12. Exit GlancePlus. From the Main window, select: File -> Exit GlancePlus
http://education.hp.com
Module 3 GlancePlus
13. Glance, the ASCII version. From a terminal window, which has not been resized, type glance. NOTE: Never run glance or gpm in the background.
If you are accessing the ASCII version of glance from an X terminal window, make sure you start up an hpterm window to enable full glance softkeys. Do not resize the window as ASCII glance expects a standard terminal size. . You can make the hpterm window longer, but never wider. However, making it longer is frequently of no use. # hpterm & In the new window # glance Display a list of keyboard functions by typing ?. This brings up a help screen showing all of the command keystrokes that can be used from the ASCII version of GlancePlus. Explore these to familiarize yourself with the interface. 14. Display Main Process Screen. Type g to go to the Main Process Screen. This lists all interesting processes on the system. Retrieve online help related to this window by typing h, which brings up a help menu. Select: Current Screen Metrics Use the cursor keys to select CPU Util NOTE: This metric has two values. Use the online help to distinguish the difference between the two values. Use the space bar or the Page Down key to toggle to the next page of help.
Exit the online help CPU Util description by typing e. Exit the Screen Summary topics by typing e. From the main Help menu, select: Screen Summaries Use the cursor keys to select Global Bars From this help description, explain what R, S, U, N, and A mean in the CPU Util Bar. Exit the online help Global Bar description by typing e. Exit the Screen Summary topics by typing e. Exit the main Help menu by typing e. At any time, you can exit help completely, no matter how deep you are, by pressing the F8 key.
http://education.hp.com
Module 3 GlancePlus
15. Modify Interesting Process Definition. From the main Process List window, (select g). View the interesting processes. What makes these processes interesting? Type o and select 1 (one) to view the process threshold screen. Cursor down to the Sort Key field, and indicate to sort the processes by CPU usage. Before confirming the other options are correct, note that any CPU usage (greater than zero), or any disk I/Os will cause the process to be considered interesting. Run the KILLIT command to stop all lab loads. 16. Glance Reports. This is the free form part of the lab. Spend the rest of your lab time going through the various Glance screens and GlancePlus windows. Use the table below to produce the different performance reports. Feel free to use this time to ask the instructor "How Do I . . .?" types of questions. Glance
COMMAND *a b *c *d e f *g h *i j *l *m *n o p q r *s *t *u *v *w y z ! ? <CR> FUNCTION All CPUs Performance Stats Back one screen CPU Utilization Stats Disk I/O Stats Exit Forward one screen Global Process Stats Help I/O by Filesystem Change update interval Lan Stats Memory Stats NFS Stats Change Threshold Options Print current screen Quit Redraw screen Single process information OS Table Utilization Disk Queue Length Logical Volume Mgr Stats Swap Stats Renice process Zero all Stats Shell escape Help with options Update screen data
GlancePlus (gpm)
"REPORT" CPU by Processor
Process List I/O by Filesystem Network by LAN Memory Report NFS Report
Process List, double-click process System Table Report Disk Report,double-click disk I/O by Logical Volume Swap Detail Administrative Capabilities
http://education.hp.com
http://education.hp.com
Gateway
System Call Interface File Subsystem Interprocess Communication Scheduler Memory Management
Student Notes
The main purpose of an operating system is to provide an environment where processes can execute. This includes scheduling processes for time on the CPU, managing the memory which is assigned to processes, allowing processes to read data from disk, and many other things. When processes execute within the HP-UX operating system, there are two modes that they can be in: User mode and Kernel (system) mode.
Kernel mode is also used for background activities, performed by the kernel on behalf of processes. Examples include page faulting the program's text or data in from disk, initializing and growing a process's data space, paging a portion of the process to swap space, performing file system reads and writes, and many other things. In general, when a process spends too much time in kernel mode, it is considered bad for performance. This is because too much time (overhead) is being spent to manage the environment in which the process executes, and not enough time on executing the actual process itself (which is user mode).
Performance Tools
Most all performance tools that track CPU utilization distinguish between time spent by the CPU in user mode versus time spent in kernel mode. On a good, healthy system with plenty of memory resources, a typical ratio between user mode and kernel mode time is 4:1. This means the process spends 75-80% of its execution in user mode and 20-25% in kernel mode. Another general rule of thumb is, kernel mode CPU time should not exceed 50%. When this happens, it generally means too much time is being spent managing the system (i.e. memory and swap space management, context switching), and not enough is being spent executing process code.
http://education.hp.com
Text Data
(1GB/quadrant)
(4TB/quadrant)
Shared Objects
Data
Shared Objects
Shared Objects
Student Notes
Each process views itself as starting at address 0 and ending at the maximum address addressable by 32 or 64 bits. This address space is known as the Virtual Address Space for a process. The virtual address space is a logical addressing scheme used internally by the process to reference related instructions and data variables. The physical memory address locations cannot be used, because a program does not know where in physical memory it will be loaded. In fact, a program could be loaded at different memory locations each time it executes.
http://education.hp.com
The second quadrant holds the programs private data variables. Again, 1 GB of address space is reserved for data variables, and only a fraction of this space is used (in general). Since this quadrant is limited to 1 GB of address space, a maximum global data size of approximately 900 MB is imposed. (In HP-UX, changes were made to allow the global data to use addresses in other quadrants for private data, thereby increasing its maximum size to 3.9 GB.) The third and fourth quadrants are usually used to address shared memory segments, shared text segments, shared memory-mapped files, and other shared structures, such as the System Call Interface.
http://education.hp.com
Text Data
(1GB/octant)
Shared Objects Shared Objects Text Data Data Shared Objects Shared Objects
(2EB/octant)
(2EB/octant)
Kernel
Kernel
Student Notes
There is no 32-bit kernel running on the IA-64 processor. The virtual address space is always 16 EB in size, although it may not all be used or allocated while a particular process is running. The space is divided into eight equal-sized octants each octant is 2 EB in size. When executing a PA-RISC 32-bit process, the first four octants are set up just like the PARISC, 32-bit virtual address space, using only 1 GB out of each octant to simulate the four original quadrants. The last octant holds the kernel and all of its related structures.
64-Bit Processes
With a 64-bit process, the virtual address space changes dramatically. The first two octants become the equivalent to the first PA-RISC quadrant and hold shared objects. The third octant holds the text. The fourth and fifth octants are reserved for any process private data, and the sixth and seventh octants contain more shared objects. Only the last octant is laid out exactly the same for both 32-bit and 64-bit processes.
http://education.hp.com
Text Data
MemMap
Memory
Student Notes
Each process executing in memory contains an entry in the kernel's process table. The entry in the proc table then references the locations of the program's four main components: text, data, stack, and uarea. The text segment contains the program's executable code. The data segment contains the programs' global data structures and variables. The stack area contains the programs' local data structures and variables. The uarea is an extension of the proc table entry. In a multithreaded process, each thread will have its own uarea. Other components that may or may not be associated with a process are shared libraries, shared memory segments, and memory-mapped files. The text and initialized global data segments of the process are taken from the executed program file on disk during process startup. In an attempt to save on startup time, the uninitialized global data segments and the stack area are zero filled, and no pages of a program are loaded at startup. Copying the entire text and data into memory would generate long startup latency. This latency problem is avoided in HP-UX by demand paging the program's text and data as needed.
http://education.hp.com
Using this demand paging approach, the program is loaded into memory in smaller pieces (pages), on an as-needed basis. One page on HP-UX 10.X is equal to a 4-K size. On HP-UX 11.00, the page size is variable (meaning the initial program could page in sizes greater than 4 KB).
http://education.hp.com
filesys
filesys C P U C P U
Process Starts C P U
Cache
End
Main Memory
Stop Disk
CPU Queue
Swap
Student Notes
The life cycle of a process can be generalized by the above slide. When a process is born (or starts), its text must be paged in from the file system on disk (on demand) in order to be executed. (Remember, the operating system only pages in a text page when it determines that a process needs a particular page in order to execute.) In addition, space must be reserved on the swap partition for the process in the event it may need to page portions of the data area out to swap. Once the swap space is reserved and the process is initialized, the process can begin executing on the CPU. As the process executes, it often performs actions that require it to wait. These actions include reading data from the disk or the network, waiting for a user to enter a response at a terminal window, or waiting on a shared resource (like semaphores). Once the item, which the process is waiting on, becomes available, the process puts itself in the CPU run queue so it can begin executing again. This is the standard cycle that a process goes through: WAIT for a resource, enter the CPU run queue when the resource is available, execute on the CPU. The waiting on a resource is symbolized in the slide as the octagon (or stop sign). The entering of the CPU run queue is symbolized by the triangle, and the execution on the CPU is indicated by the CPU in the rectangle.
http://education.hp.com
An advantage of the glance performance tool is that it displays on a per process basis (or system-wide) the various reasons why a process is blocked or waiting on the CPU.
http://education.hp.com
Process States
ZOMBIE SZOMB
Exit Go to user mode
STOP SSTOP
Context Switch
fork completes
IDLE SIDL
Student Notes
The process table entry contains the process state. This state is logically divided into several categories of information to do the following: scheduling, identification, memory management, synchronization, and resource accounting. There are five major process states: SRUN SSLEEP SIDL SZOMB SSTOP The process is running or is runnable, in kernel mode or user mode, in memory or on the swap device. The process is waiting for an event in memory or on the swap device. The process is being setup via fork. The process has released all system resources except for the process table entry. This is the final process state. The process has been stopped by job control or by process tracing and is waiting to continue.
http://education.hp.com
Most processes, except the currently executing process, are placed in one of three queues within the process table: a run queue, a sleep queue, or a deactivation queue. Processes that are in a runnable state (ready for CPU) are placed on a run queue, processes that are blocked awaiting an event are located on a sleep queue, and processes that are temporarily out of the scheduling mix are placed on a deactivation queue. Deactivated processes typically only occur during a system memory management crisis. Processes either terminate voluntarily through an exit system call or involuntarily as a result of a signal. In either case, process termination causes a status code to be returned to the parent of the terminating process. This termination status is returned to the parent process using a version of the wait() system call. Within the kernel, a process terminates by calling the exit() routine. The exit(0) routine completes the following tasks: cancels any pending timers, releases virtual memory resources, closes open file descriptors, and handles stopped or traced child processes. Next, the process is taken off the list of active processes and is placed on a list of zombie processes, which is finally changed to being a no process state. The exit() routine continues to record the termination status in the proc structure, bundles up the process's accumulated resource usage for accounting purposes, and notifies the deceased process's parent. If a process in SZOMB state is found, the wait() system call will copy the termination status from the deceased process and then reclaim the associated process structure. The process table entry is taken off the zombie list and returned to the freeproc list. As of HP-UX 10.10, the concept of a thread was introduced into the kernel. Processes became an environment in which one (or more) threads could execute. Each thread was visible and manageable by the kernel separately. When this occurred, processes were in any of the following states: SINUSE SIDL SZOMB The process structure is being used to define one or more threads. The process is being setup via fork. The process has released all system resources except for the process table entry. This is the final process state.
Whereas threads now took on the previous states of the process: TSRUN TSSLEEP TSIDL TSZOMB The thread is running or is runnable, in kernel mode or user mode, in memory or on the swap device. The thread is waiting for an event in memory or on the swap device. The thread is being setup via fork. The thread has released all system resources except for the thread table entry. This is the final thread state.
http://education.hp.com
TSSTOP
The thread has been stopped by job control or by process tracing and is waiting to continue.
The generic UNIX tools have no awareness of threads and so they continue to report process states and all other metrics from the viewpoint of the process, Only the HP-specific tools (such as glance, gpm, PerfView/OVPM, and MeasureWare/OVPA) have the ability to look at individual threads and report their metrics separately from the process. Of course, the vast majority of processes are single-threaded. In those cases, there is no practical difference between the reports of the various tools.
http://education.hp.com
CPU Scheduler
The CPU scheduler handles: Context switches Interrupts CPU Kernel OS Tables
CPU Scheduler
Proc A pri=156
Proc B pri=220
Proc C pri=172
Proc D pri=186
Memory
Student Notes
Once the required data is available in memory, the process waits for the CPU scheduler to assign the process CPU time. CPU scheduling forms the basis for the multitasking, multiuser operating system. By switching the CPU between processes that are waiting for other events, such as I/O, the operating system can function more productively. HP-UX uses a round robin scheduling mechanism. The CPU lets each process run for a preset maximum amount of time, called a quantum or time slice (default = 1/10th second), until the process completes, or is preempted to let another process run. Of course, the process can always voluntarily surrender the CPU before its timeslice expires when it realizes that it cannot continue. The CPU saves the status of the first process in a context and switches to the next process. When a process is switched out due to its timeslice expiring, it drops to the bottom of the run queue to wait for its next turn. If it is preempted by a stronger priority process, it is placed back onto the front of the run queue. If it voluntarily gives up the CPU, it goes onto one of the sleep queues, until the resource its waiting for becomes available. When that resource does become available, the process moves the end of the run queue.
http://education.hp.com
As a multitasking system, HP-UX requires some way of changing from process to process. It does this by interrupting the CPU to shift to the kernel. The clock interrupt handler is the system software that processes clock interrupts. It performs several functions related to CPU usage including gathering system and accounting statistics and signaling a context switch. System performance is affected by how rapidly and efficiently these activities occur.
Terms
CPU scheduler System clock Clock Interrupt handler Context switching Schedules processes for CPU usage Maintains the system timing Executes the clock interrupt code and gathers system accounting statistics Interrupts the currently running process and saves information about the process so that it can begin to run after the interrupt, as if it had never stopped.
http://education.hp.com
Context Switching
A context switch occurs when A timeslice expires (a thread accumulates 10 clock ticks) (Forced) A preemption occurs (a stronger priority thread is runnable) (Forced) - if the stronger thread is RT, immediate preemption - if the stronger thread is not RT, at next convenient time A thread becomes non-computable, i.e. - it goes to sleep - it is stopped - it exits (Voluntary)
Student Notes
A context switch is the mechanism by which the kernel stops the execution of one process and begins execution of another. A context switch occurs under the circumstances shown on the slide. There are two types of context switches: forced and voluntary. A forced context switch occurs when the process is forced to give up the CPU before it is ready. These include timeslice expiration or a stronger priority process becoming runnable. A voluntary context switch occurs when the process itself gives up the CPU without using its full timeslice. This happens when the process exits, or puts itself to sleep (waiting on a resource), or puts itself into a stopped state (debugging). The glance tool distinguishes between forced and voluntary context switches on a per process basis.
http://education.hp.com
Priority Queues
-32
-1
127
128 131
152 155
252 255
...
...
...
...
...
PSWP (128) Real Time Priority Queues (1 priority wide) POSIX Real Time (rtsched) HP-UX Real Time (rtprio)
PZERO (153)
PUSER (178)
Time-shared Priority Queues (4 priorities wide) System Level Priorities Nonsignalable User Level Priorities
Signalable Priorities
Signalable Priorities
Student Notes
Every process has a priority associated with it at creation time. These priorities determine the order in which processes execute on the CPU. Processes with the weakest priority number always execute before processes with stronger numbers. In UNIX, stronger priorities are represented by smaller numbers and weaker priorities are represented by larger numbers. HP-UX uses adjustable priorities to schedule its time slicing for general timeshare processes generated by all users (priorities 128-255). By that we mean, a processs priority can be adjusted, up or down, by the kernel, according to how favored a process might be. In general, the more a process executes, the less favorable it will be treated by the kernel. However, since HP-UX also supports real-time processing, it must include priority-based scheduling for those processes (priorities 0-127). As of HP-UX 10.X, support is also provided for POSIX real-time processes (priorities -32 through -1). The /usr/include/sys/param.h file contains some extra information on the priorities used in the system. Each processor in an HP system has its own run queue. Each run queue is further broken down into multiple priority queues, to make it easier for that processor to select the most deserving process to run.
http://education.hp.com
http://education.hp.com
Nice Values
177
ProcB
Priority
nice = 20
ProcA nice = 39
255
Student Notes
Time shared processes are all initially assigned the priority of the parent when they are spawned. The user can make modifications to how much the kernel favors a process with the nice value. Timeshare processes lose priority as they execute, and regain priority as they wait their turns. The rate at which a process loses priority is linear, but the rate at which it regains priority is exponential. A process's nice value is used as a factor in calculating how fast a process regains priority. The nice value is the only control a user has to give greater or less favor to a time share process. The default nice value is 20. Therefore, to make a process run at a weaker priority, it should be assigned a higher nice value (maximum value 39). The superuser can assign a lower nice value to a process (minimum value 0), effectively giving it a stronger priority.
http://education.hp.com
Kernel OS Tables
ksh
sam
ksh
su
csh
sh
glance
sh
Memory
Student Notes
One item to keep in mind related to process management is the relationship between parent and child processes. Every process started from a terminal window on the system has a parent process that spawns it. The parent process does not terminate once a child is spawned. Instead, it goes to sleep waiting for the child to terminate from its execution. If a child process does not exit properly, for example, if it spawns a new process rather than exiting to its parent, then the system could end up with many processes sleeping in memory and using proc table entries unnecessarily. The example in the slide shows a ksh shell that spawns a sam process. Within sam, the system administrator shells out to su to a regular user. Once in the login shell, the user starts glance. From within glance, they shell out, and now decide they'd rather be in a csh shell. This string of events caused eight different processes to be started. If the user decides he wants to return to sam by typing sam, would the previous sam process be reactivated, or would a new sam process be spawned? (Answer: A new sam process is spawned).
http://education.hp.com
B3692A GlancePlus B.10.12 14:52:27 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S S N N CPU Util | 22% 29% 51% F Disk Util | 1% 7% 13% Mem Util | 91% 91% 91% S S U U B B Swap Util | 25% 24% 35% U U R R -------------------------------------------------------------------------------PROCESS LIST Users= 11 User CPU Util Cum Disk Thread Process Name PID PPID Pri Name ( 100 max) CPU IO Rate RSS Count -------------------------------------------------------------------------------netscape 16013 12988 154 sohrab 12.9/14.0 64.9 0.0/ 0.6 14.7mb 1 supsched 18 0 100 root 2.9/ 2.1 942.6 0.0/ 0.0 16kb 1 lmx.srv 1219 1121 154 root 1.6/ 0.9 389.4 0.5/ 0.0 2.7mb 1 glance 15726 15396 156 root 0.6/ 0.9 2.0 0.0/ 0.2 4.0mb 1 statdaemon 3 0 128 root 0.6/ 0.7 302.1 0.0/ 0.0 16kb 1 midaemon 1051 1050 50 root 0.4/ 0.4 201.4 0.0/ 0.0 1.3mb 2 ttisr 7 0 -32 root 0.4/ 0.3 121.0 0.0/ 0.0 16kb 1 dtterm 15559 15558 154 roc 0.4/ 0.4 1.6 0.0/ 0.0 6.2mb 1 rep_server 1098 1084 154 root 0.2/ 0.1 23.7 0.0/ 0.0 2.0mb 1 syncer 325 1 154 root 0.2/ 0.0 20.2 0.1/ 0.0 1.0mb 1 xload 13569 13531 154 al 0.2/ 0.0 2.4 0.0/ 0.0 2.6mb 1 Page 1 of 13
Student Notes
The next four slides are designed to illustrate how the management of processes can be monitored through glance. Topics just covered (like kernel versus user CPU time, process components, process wait states, nice values, and process priorities) can all be viewed through glance. The first Global Bar graph, which displays on every glance screen, is the CPU Util. This displays how the CPU is being distributed. S = System or Kernel Time N = User Time (executing processes who have had their nice value set greater than 20. (21-39) U = User Time (executing processes with a nice value of 20) A = User Time (executing processes who have had their nice value set less than 20 (0 19). In other words: Anti-nice. R = Real Time (executing processes with priorities 127 and less)
http://education.hp.com
The Process List screen (g key), as shown on the slide, can be used to see process priorities. The order in which the processes are displayed can be configured (o key) to display by CPU usage, memory usage, or disk I/O activity. In HP-UX version 10.X, the thread count column was the blocked on column. The blocked on information can still be obtained by looking at the individual processes resource summary screens.
http://education.hp.com
B3692A GlancePlus B.10.12 15:17:52 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util | 91% 91% 91% S S U U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------Resource Usage for PID: 16013, netscape PPID: 12988 euid: 520 User:sohrab -------------------------------------------------------------------------------CPU Usage (sec) : 3.38 Log Reads : 166 Wait Reason : SLEEP User/Nice/RT CPU: 2.43 Log Writes: 75 Total RSS/VSS : 22.4mb/ 28.3mb System CPU : 0.73 Phy Reads : 4 Traps / Vfaults: 414/ 8 Interrupt CPU : 0.14 Phy Writes: 61 Faults Mem/Disk: 0/ 0 Cont Switch CPU : 0.08 FS Reads : 4 Deactivations : 0 Scheduler : HPUX FS Writes : 29 Forks & Vforks : 0 Priority : 154 VM Reads : 0 Signals Recd : 339 Nice Value : 24 VM Writes : 0 Mesg Sent/Recd : 775/ 1358 Dispatches : 1307 Sys Reads : 0 Other Log Rd/Wt: 3924/ 957 Forced CSwitch : 460 Sys Writes: 32 Other Phy Rd/Wt: 0/ 0 VoluntaryCSwitch: 814 Raw Reads : 0 Proc Start Time Running CPU : 0 Raw Writes: 0 Fri Feb 6 15:14:45 1998 CPU Switches : 0 Bytes Xfer: 410kb
Student Notes
From the Process List screen, an individual process can be selected for further analysis (s key). The above slide shows some of the additional details available when analyzing a process further. Items of interest from the Individual Process screen include the process's nice value, the number of Forced versus Voluntary context switches, the current Wait reason, and the Parent PID.
http://education.hp.com
B3692A GlancePlus B.10.12 10:17:41 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S S N N CPU Util | 22% 29% 51% F Disk Util | 1% 7% 13% Mem Util | 91% 91% 91% S S U U B B U U R R Swap Util | 25% 24% 35% -------------------------------------------------------------------------------Memory Regions for PID: 16013, netscape PPID: 14061 euid: 520 User:sohrab Type RefCt RSS VSS Locked File Name -------------------------------------------------------------------------------NULLDR/Shared 64 4kb 4kb 0kb <nulldref> TEXT /Shared 3 4.3mb 9.5mb 0kb /opt//netscape-bin DATA /Priv 1 5.8mb 8.6mb 0kb /opt//netscape-bin MEMMAP/Priv 1 4kb 20kb 0kb /opt//netscape-bin MEMMAP/Priv 1 36kb 36kb 0kb /opt//netscape-bin MEMMAP/Priv 1 12kb 12kb 0kb <memmap> STACK /Priv 1 28kb 28kb 0kb <stack> UAREA /Priv 1 16kb 16kb 0kb <uarea> LIBTXT/Shared 85 56kb 60kb 0kb /usr/lib/dld/sl Text RSS/VSS:4.3mb/9.5mb Shmem RSS/VSS: 0kb/ 0kb Data RSS/VSS:5.8mb/8.6mb Other RSS/VSS:4.1mb/5.7mb Stack RSS/VSS: 28kb/ 28kb
Student Notes
From the Individual Process screen, the memory regions (i.e. process components) corresponding to that process can be viewed (M key). The above slide shows the memory regions for the currently selected process. Items of interest from the Memory Region screen include the location of the process's Text, Data, Stack, and U-Area, along with its Shared/Private flag, its Resident Set Size and Virtual Set Size, and its reference count. If the process is associated with Memory Map files (MEMMAP), Shared Libraries (LIBTXT), or Shared Memory Segments (SHMEM), these will be displayed. In HP-UX version 11.X, glance no longer displays the addresses of each memory region. However, gpm still does.
http://education.hp.com
B3692A GlancePlus B.10.12 10:23:03 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------Wait States for PID: 14205, netscape PPID: 14061 euid: 520 User:sohrab Event % Blocked On % -------------------------------------------------------------------------------IPC : 0.0 Cache : 0.0 CPU Util : 13.7 Job Control: 0.0 CDROM IO : 0.0 Wait Reason: SLEEP Message : 0.0 Disk IO : 0.0 Pipe : 0.0 Graphics : 0.0 RPC : 0.0 Inode : 0.0 Semaphore : 0.0 IO : 0.0 Sleep : 77.2 LAN : 0.0 Socket : 0.0 NFS : 0.0 Stream : 0.0 Priority : 9.1 Terminal : 0.0 System : 0.0 Other : 0.0 Virtual Mem: 0.0
C - cum/interval toggle
% - pct/absolute toggle
Page 1 of 1
Student Notes
From the Process List screen, the process wait states can be viewed (W key). The above slide shows the categories of wait states and where/what the selected process has waited on. Items of interest from the Process Wait State screen include the percentage of time the process has spent in each of the possible wait state categories.
http://education.hp.com
http://education.hp.com
5. Select another long process and set the nice value to 30. # renice n 10 <PID of another selected process> What effect did that have on that process? ___________________________________ ______________________________________________________________________ 6. You can either let the processes finish up on their own as the next module is covered, or you can kill them now with: # kill $(ps el | grep long | cut c18-22)
http://education.hp.com
http://education.hp.com
http://education.hp.com
Processor Module
CPU
TLB
Cache
Coprocessor
System Bus
Student Notes
A typical HP processor module consists of a central processing unit (CPU), a cache, a translation lookaside buffer (TLB), and a coprocessor. These components are connected via internal processor busses, with the entire processor module being connected to the system bus. The cache is made up of very high-speed memory chips. Cache can be accessed in one CPU cycle. Its contents are instructions and data that recently have been or are anticipated to be used soon by the CPU. Cache size varies between processors. The size of the cache can have a big effect on system performance. The translation lookaside buffer (TLB) is used to translate virtual addresses into physical addresses. It is a high-speed cache whose entries consist of pairs of recently accessed virtual addresses and their associated physical addresses, along with access rights and an access ID. The TLB is a subset of a system-wide translation table (page directory) that is held in memory. TLB size also affects system performance, and different HP 9000 processors have different TLB sizes.
http://education.hp.com
The address translations kept in the TLB enable us to locate the appropriate data and instructions in the memory. The memory is accessed via the physical address. Without the translation in the TLB, we would not be able to find the information in the memory. Note these other points regarding the TLB: Each process has a unique virtual address space. Each TLB entry refers to a page of memory, not a single location. In all 64-bit architectures used by HP, pages are fundamentally 4KB in size, but can be any multiple of 4K under various circumstances to reduce the number of entries needed in the TLB.
http://education.hp.com
Symmetric Multiprocessing
CPU
CPU
TLB
Cache
Coprocessor
TLB
Cache
Coprocessor
System Bus
Student Notes
Symmetric Multiprocessing (SMP) refers to systems containing two or more processor units. SMP is implemented on all Hewlett-Packard workstations and servers capable of supporting more than one CPU. Each processor on an SMP system has exactly the same characteristics, including the same processing unit, the same CPU cache design, and the same size translation lookaside buffer (TLB).
http://education.hp.com
Cell Module
Processor Processor
Processor Processor
Memory
Student Notes
A more recent design of HP systems is based on the cell architecture. In a cell, there are multiple processors, some memory and some I/O buses. Each cell could act as an independent SMP system, or as part of a collection of cells, forming a larger SMP system. Each processor in the cell has the same access speed (or latency) to the memory within the same cell. However, if one of those processors would have to access a location in the memory of a different cell, the latency would be greater. Each processor within the cell does have its own cache memory and TLB. Each processor has equal access to the I/O buses that are part of the same cell. They may also have access (with somewhat greater delays) to the I/O of other cells in the same system.
http://education.hp.com
Multi-Cell Processing
I/O
P P P P
Memory
P P P P
Memory
I/O
Memory
Memory
I/O
I/O P P P P
High-Speed Memory interconnect
P P P P
Student Notes
The best example HP currently has of a SMP using cell architecture is the Superdome. Here we find 4 cells, each with four processors, some memory, and some I/O buses. Each cell could be configured (using Node Partitioning or NPars) into a separate and individual system capable of booting its own operating system. It would be functionally apart from the other cells. The only way that the operating system on that cell could communicate with the software running on any other cell would be through a network interface. On the other hand, multiple cells could be configured to act as a unit. They would pool their resources and boot a single operating system. They would seamlessly act as a SMP system. This architecture gives the customer and the system administrator tremendous flexibility in how to set up their hardware. They could even change it relatively easily from one configuration to another as their needs changed. On a wider range of systems, you may be using Virtual Partitioning (VPars). There are similar to NPars, but are not limited to cell boundaries and are handled entirely by software. A system could use both NPars and VPars at the same time. Using software, processors can be moved from one VPar to another.
http://education.hp.com
Finally, on an even wider range of systems, we have the concept of processor sets (psets). Multiple psets could exist within the same partition (either NPar or VPar). Each pset would be set aside for use by a particular application of group of applications. Using software, psets can be created and removed, and processors could be moved from one pset to another.
http://education.hp.com
CPU Processor
Shadow Registers
General Registers
Control Registers
Space Registers CPU Coprocessor Registers TLB Cache Coprocessor Process Status Word Instruction Address Queues
Student Notes
The CPU ultimately is responsible for your system speed. The kernel loads the process text for the CPU to execute. The processor module has many Registers, which assist in the execution of instructions. The definition of all these registers is beyond the scope of this course. The primary objective of this module is to focus on CPU clock speed, the size of the CPU cache, and the effects of the TLB related to overall system performance. Each HP 9000 server and workstation has a chip at its heart. The latest version PA-RISC chip is the 64-bit, PA-8xxx. HP has also introduced systems using the 64-bit, IA-64 Itanium chip. A selection of the range of current systems is listed on the following pages. Note the difference not only in clock speeds, but also in cache size. The following tables list the specifics of several HP-UX servers and workstations. It is very difficult to keep a list of this nature up to date in training materials but it has been included merely to demonstrate the wide variety of system characteristics present in the HP computing products family.
http://education.hp.com
Business Servers Model rp3410-2 (PA-8800) rp3440-4 (PA-8800) rp4440-8 (PA-8800) rp7420-16 (PA-8800) rp8420-32 (PA-8800) Superdome (PA-8800) rx1600 (Itanium 2) rx2600 (Itanium 2) rx4640 (Itanium 2) rx5670 (Itanium 2) rx7620 (Itanium 2) rx8620 (Itanium 2) Superdome (Itanium 2) 8 (2 cells) 16 (4 cells) 64 (16 cells) 1.5 GHz 512 6MB(L3) 1.5 GHz 128 6MB(L3) 1.5 GHz 64 6MB(L3) 15 PCI (128-bit) 16+16 PCI (128-bit) 0/128/64 PCI * 4 1.5 GHz 96 6MB(L3) 0/6/3 PCI * 4 1.5 GHz 64 6MB(L3) 0/4/2 PCI * 2 1.5 GHz 24 6MB(L3) 0/4/0 PCI * 16 (2 cells) 32 (4 cells) 128 (16 cells) 2 1 GHz 16 1 GHz 1024 1 GHz 128 1 GHz 64 8 1 GHz 64 4 1 GHz 24 No. of CPUs 2 Clock Speed 800 MHz Max. RAM (GB) 6 1.5MB(L1) 32MB(L2) 1.5MB(L1) 32MB(L2) 1.5MB(L1) 32MB(L2) 1.5MB(L1) 32MB(L2) 1.5MB(L1) 32MB(L2) 1.5MB(L1) 32MB(L2) 1.5MB(L3) 0/1/1 PCI * 192 PCI 16 PCI 15 PCI 6 PCI 4 PCI (64-bit) 2-PCI (64-bit) Cache (KB) I/O Slots
http://education.hp.com
Workstations Model B2600 (PA-8600) B3700 (PA-8700) C3750 (PA-8700+) J6750 (PA-8700+) zx2000 (Itanium 2) zx6000 (Itanium 2 * 2/3/1 means 2 32-bit PCIs, 3 64-bit PCIs and 1 128-bit PCI. All Itanium 2 processors include 32KB of L1 cache and 256KB of L2 cache. To determine the specifics of your system, refer on-line to http://www.hp.com/go/enterprise, select "Products Index" and scroll down to select your system platform name [i.e. J-Class (HP 9000)]. This will display the "Product Information" screen for the selected hardware. 2 1.5 GHz 24 6144(L3) 3 PCI 1 AGP 1 1.4 GHz 8 1536(L3) 5 PCI - 1 AGP 2 875 MHz 16 768/1536 0/0/3 PCI * 1 875 MHz 8 768/1536 2/3/1 PCI * 1 750 MHz 8 768/1536 2/3/1 PCI * No. of CPUs 1 Clock Speed 500 MHz Max. RAM (GB) 4 512/1024 2/2/0 PCI * Cache (KB) I/O Slots
http://education.hp.com
CPU Cache
Memory
CPU Instruction to Execute xxxx
TLB Cache Coprocessor
System Bus
Student Notes
The CPU loads instructions from memory and runs multiple instructions per cycle. To minimize the time that the CPU spends waiting for instructions and data, the CPU uses a cache. The cache is a very high-speed memory that can be accessed in one CPU cycle with the contents being a subset of the contents of main memory. As the CPU requires instructions and data, they are loaded into the cache. The size of the cache has a large bearing on how busy the CPU is kept. The larger the cache, the more likely it is that it will contain the instructions and data to be executed. Most current processors support multi-level caches. The Level 1 cache (L1) is the fastest operating at the same speed as the CPU. It is relatively small. The Level 2 cache (L2) operates at one-half the speed of the CPU. It is somewhat larger. The IA-64 has a Level 3 cache (L3) that is even larger and slower.
http://education.hp.com
TLB Cache
Memory
Instruction Address Queues
CPU
Instruction to Execute
VA | PA TLB
xxxx
Cache Coprocessor
System Bus
Student Notes
All 32-bit programs view their address space as starting at address 0, and ending at address 4 GB. All addresses referenced by the program are referenced relative to this address space. This is referred to as the program's virtual address space. A program's physical address is the address location in physical memory where the program is loaded at execution time. When the CPU executes a program, it is presented with the virtual address containing the instruction to be executed. In order to fetch this instruction from physical memory, the CPU must convert the virtual address (VA) into the corresponding physical address (PA). To do this, the CPU checks the TLB. If the VA->PA is present, it then knows the PA in memory of the instruction. If the VA is not present, it then needs to fetch the information from the PDIR (Page DIRectory) table in memory. This memory fetch of the PDIR table is relatively expensive from a performance standpoint. Once the PA is known, the CPU then checks the Instruction Cache on the CPU for the PA. If the PA is present, it then loads the instruction straight from Instruction Cache. If not present, it then needs to fetch the instruction from memory, which is relatively expensive (performance-wise).
http://education.hp.com
The size of the TLB is anywhere from 96 to 160 entries (each entry points to a variable-sized memory page) on a PA-RISC and an IA-64.
http://education.hp.com
Consequence 1 CPU cycle fetch Data/instruction memory fetch PDIR memory fetch Page fault
X = Dont Care
Student Notes
The slide shows some of the permutations of hits and misses on memory, cache, and the TLB, as well as the consequences of each. The best situation is when the VA has an entry in the TLB, and the corresponding PA has an entry in the CPU cache. This allows the instruction or data to be present to the CPU in one clock cycle. The next-best scenario, would be to have a hit on the TLB, but a miss on the CPU cache. An example number of clock cycles to fetch a PA from memory to the CPU cache is 50 clock cycles. Another scenario would be to have a miss on the TLB, but a hit on the CPU cache. The miss on the TLB requires the PDIR table in memory to be searched, and an appropriate entry to be loaded into the TLB. This takes a variable number of cycles to perform. On one model the average was 131 clock cycles. Therefore, a miss on the TLB is more expensive than a miss on CPU cache. A miss on both the TLB and the CPU cache would translate into 131 + 50 or 181 clock cycles on average to access the instruction or data that the CPU needs. This could have been accessed in 1 clock cycle had the VA been in TLB, and the PA been in CPU cache.
http://education.hp.com
The worst scenario, performance-wise, is not having the instruction or data loaded in memory at all. In this case, a page fault would occur to retrieve the information from disk. Assuming a 1-GHz clock, a 10-ms disk transfer rate, and an idle disk drive, this would correspond to 10,000,000 clock cycles to access the data or instruction.
http://education.hp.com
VA
PA
Filesystem
VA 0 4096 8192 12288 16384
File Memory
VA
PA
0 8192
Filesystem
VA 0
512000
Memory
File
16384
Student Notes
HP-UX 11.00 is the first release of the operating system to have general support for performance optimized page sizes (POPS), also known as variable page sizes. Partial support for variable memory page sizes has existed since HP-UX 10.20. HP-UX 11.00 allows customers to configure executables to use specific performance optimized page sizes, based on the program's text and data sizes. Page sizes can be selected from a range of 4 KB to 4 GB. The use of performance optimized page sizing can significantly increase performance of applications that have very large data or instruction sets. NOTE: Performance-optimized page sizing works on PA-8000-based and IA-64-based systems.
http://education.hp.com
had a few Block TLB entries, which could map multiple pages into a single entry, if the pages were contiguous in both virtual and physical address spaces. These entries were reserved for mapping the kernel, the I/O pages, and other segments that were locked into memory. At some point, the TLB would become full, and the virtual-to-physical address mapping would only be stored in the PDIR table in memory, not in the TLB on the CPU. This meant that if a virtual address needed to be translated, there would be a chance that the address would not have an entry in the TLB, and time would have to be spent to look up the address within the PDIR table in memory. This handling of the TLB miss was expensive in terms of performance.
PA-RISC
4K 16K 64K 256K 1M 4M 16M 64M 256M 1G -
IA-64
4K 8K 16K 64K 256K 1M 4M 16M 64M 256M 4G
http://education.hp.com
header of the executable file and be visible to the kernel whenever the program was invoked. The kernel would do its best to see that the hint is followed. However, if memory pressure exists, the kernel may not be able to honor the request and may end up demoting the size of the page to be able to manage it in memory. There is a third tunable parameter, vps_chatr_ceiling, that determines the maximum value a chatr command can assign to an executable file.
http://education.hp.com
Student Notes
The load on the CPU can be monitored in a number of different ways. There are multiple tools and multiple metrics that monitor CPU performance.
Nice/Anti-Nice Utilization
This is the percentage of time the CPU spent running user processes with nice values of 21-39 (Nice) or 0-19 (Anti-Nice). This is typically included in USER CPU utilization, but some tools, like glance, track this separately to see how much CPU time is being spent on weaker or stronger priority processes.
http://education.hp.com
Idle CPU
This is the percentage of time the CPU spent doing nothing (i.e. it did not execute any user or kernel code). It is good to see some, even lots, of idle CPU time. A non-idle CPU means the CPU run queue is never exhausted (or emptied), which means processes are always having to wait before reaching the CPU. The size of the line (CPU run queue) grows, as idle CPU time approaches 0.
http://education.hp.com
Student Notes
Individual processes vary greatly in terms of the load they place on the CPU. Metrics to monitor on an individual process include the following.
Process Priority
This is the priority of the process. If the priority is 127 or less, we know it is a real time process. If the priority is 128-177, either it is a system process, or it is a user process that is sleeping. If the priority is 178-255, then we know the process is executing in USER mode.
http://education.hp.com
http://education.hp.com
Student Notes
Examples of activities that place a load on the CPU include the following.
System Activities
System activities are those activities which execute in kernel mode. Examples of system activities include system processes and user processes executing system calls. Process startup Process scheduling File system and raw I/O Memory management Handling of system calls
http://education.hp.com
User Activities
User activities are those activities that execute in user mode. CAD/CAM applications Database processing Client/server applications Compute-bound applications Background jobs (i.e. batch jobs)
http://education.hp.com
B3692A GlancePlus B.10.12 05:00:42 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 25% 20% 47% Disk Util F | 12% 6% 23% Mem Util S S U | 85% 83% 85% U B B Swap Util U | 18% 18% 18% U R R -------------------------------------------------------------------------------CPU REPORT Users= 4 State Current Average High Time Cum Time -------------------------------------------------------------------------------User 18.9 6.0 32.3 0.96 3.61 Nice 0.0 2.4 5.7 0.00 1.47 Negative Nice 0.4 0.8 16.2 0.02 0.51 RealTime 0.4 0.4 0.7 0.02 0.22 System 3.3 7.0 16.2 0.17 4.21 Interrupt 1.8 1.7 2.7 0.09 1.02 ContextSwitch 0.6 0.7 1.4 0.03 0.40 Traps 0.0 0.0 0.0 0.00 0.00 Vfaults 0.0 0.7 3.6 0.00 0.45 Idle 74.6 80.2 91.2 3.79 48.18 Top CPU user: PID Active CPUs: 1 2097, dthelpview 19.5% cpu util Page 1 of 2
Student Notes
The glance CPU report (c key) provides details on where the CPU is spending its time from a global perspective. User mode: This is time spent by the CPU in user mode for all processes on the system. This includes processes with a nice value of 20 (user), processes with nice values between 21-39 (nice), processes with nice values between 0-19 (negative nice), and realtime priority processes. System mode: This is time spent by the CPU in system mode for all processes on the system. It includes time spent handling general system calls (system), and time spent handling interrupts, context switches, traps, and Vfaults (virtual faults). Load Average: This is the number of jobs in the CPU run queue averaged over three time intervals. It includes the average length of the run queue over the last 1 minute, the last 5 minutes, and the last 15 minutes. The CPU load average data is viewable on page 2 of this glance report. Also on page two are the System Call Rate, the Interrupt Rate, and the Context Switch Rate.
http://education.hp.com
B3692A GlancePlus B.10.12 05:13:18 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 25% 20% 47% Disk Util F | 12% 6% 23% Mem Util | 85% 83% 85% S S U U B B Swap Util U | 18% 18% 18% U R R -------------------------------------------------------------------------------CPU BY PROCESSOR Users= 4 CPU State Util LoadAvg(1/5/15 min) CSwitch Last Pid -------------------------------------------------------------------------------0 Enable 25.4 0.6/ 0.4/ 0.3 72187 1061
Page 1 of 2 CPU Util User Nice NNice RealTm Sys Intrpt CSwitch Trap Vfault -------------------------------------------------------------------------------0 25.4 20.7 0.0 0.0 0.0 4.7 0.0 0.0 0.0 0.0
Page 2 of 2
Student Notes
The glance CPU-by-processor report (a key) provides details on a per CPU basis. CPU Utilization: This is the CPU utilization for the specific processor. If two or more processors exist on the system, the Global CPU Util bar graph shows an average CPU utilization. That is, a CPU that is 100% utilized and a second CPU that is 0% utilized will display 50% CPU utilization. This report displays utilization on a per processor basis. Load Average: This is the number of processes, on average, in the CPU run queue over the last 1 minute, 5 minutes, and 15 minutes. This report displays CPU run queue information on a per processor basis. Page two of this display shows the Utilization broken down into User mode, Nice, Negative Nice, Realtime, System, Interrupts, Context Switches, Trap and Virtual Faults on a perprocessor basis.
http://education.hp.com
B3692A GlancePlus B.10.12 15:17:52 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------Resource Usage for PID: 16013, netscape PPID: 12988 euid: 520 User:sohrab -------------------------------------------------------------------------------CPU Usage (sec) : 3.38 Log Reads : 166 Wait Reason : SLEEP User/Nice/RT CPU: 2.43 Log Writes: 75 Total RSS/VSS : 22.4mb/ 28.3mb System CPU : 0.73 Phy Reads : 4 Traps / Vfaults: 414/ 8 Interrupt CPU : 0.14 Phy Writes: 61 Faults Mem/Disk: 0/ 0 Cont Switch CPU : 0.08 FS Reads : 4 Deactivations : 0 Scheduler : HPUX FS Writes : 29 Forks & Vforks : 0 Priority : 154 VM Reads : 0 Signals Recd : 339 Nice Value : 24 VM Writes : 0 Mesg Sent/Recd : 775/ 1358 Dispatches : 1307 Sys Reads : 0 Other Log Rd/Wt: 3924/ 957 Forced CSwitch : 460 Sys Writes: 32 Other Phy Rd/Wt: 0/ 0 VoluntaryCSwitch: 814 Raw Reads : 0 Proc Start Time Running CPU : 0 Raw Writes: 0 Fri Feb 6 15:14:45 1998 CPU Switches : 0 Bytes Xfer: 410kb
Student Notes
The glance individual process report (s key followed by the PID) displays CPU usage for an individual process, and the distribution of CPU time when executing the process (user, system, interrupt, context switch). Ideally, a process should spend more time in User/Nice/RT mode than in any of the other three modes. Also displayed on a per-process basis is the Priority and Nice values for the selected process. In addition, the total number of forced context switches (time slice expiration or process preemptions) and voluntary context switches (process putting itself to sleep) are displayed.
http://education.hp.com
B3692A GlancePlus B.10.12 05:17:52 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 25% 20% 47% Disk Util F | 12% 6% 23% Mem Util S S U | 85% 83% 85% U B B Swap Util U | 18% 18% 18% U R R -------------------------------------------------------------------------------GLOBAL SYSTEM CALLS Users= 4 System Call Name ID Count Rate CPU Time Cum CPU -------------------------------------------------------------------------------syscall-0 0 16 3.1 0.05921 2.19037 fork 2 0 0.0 0.00000 0.01398 read 3 105 20.5 0.00210 0.07625 write 4 47 9.2 0.00208 0.13624 open 5 16 3.1 0.00143 0.03146 close 6 16 3.1 0.00040 0.00848 wait 7 1 0.1 0.00011 0.00031 time 13 46 9.0 0.00023 0.00446 chmod 15 0 0.0 0.00000 0.00009 ioctl 54 503 57.8 0.00900 0.79813 poll 269 277 48.5 0.00983 1.83466 Cumulative Interval: 87 secs Page 1 of 7
Student Notes
The glance global system calls report (Y key) displays all the system calls that have been executed system-wide. When system CPU utilization is high, this report can be used to identify on which system calls the CPU is spending most of its time.
http://education.hp.com
B3692A GlancePlus B.10.12 05:39:20 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------System Calls for PID: 1822, netscape PPID: 1775 euid: 503 User:roc Elapsed Elapsed System Call Name ID Count Rate Time Cum Ct CumRate CumTime -------------------------------------------------------------------------------read 3 477 93.5 0.16884 742 49.1 0.24275 write 4 219 42.9 0.02831 352 23.3 0.06787 open 5 63 12.3 0.01396 99 6.5 0.02491 close 6 9 1.7 0.00046 20 1.3 0.00104 time 13 34 6.6 0.00031 89 5.8 0.00083 brk 17 27 5.2 0.00171 45 2.9 0.00264 lseek 19 69 13.5 0.00150 135 8.9 0.00304 stat 38 4 0.7 0.00131 13 0.8 0.00415 ioctl 54 636 124.7 0.01463 1167 77.2 0.02813 utssys 57 0 0.0 0.00000 3 0.1 0.00013 Cumulative Interval: 15 secs Page 1 of 3
Student Notes
While examining an individual process, the system calls generated by that particular process can be viewed using the L key. When the system time utilization is high for an individual process, this report can be used to view the specific system calls the process is performing, how many times the system calls are being invoked, and (most importantly) how much time is being spent by the CPU to execute the system calls. The read() and write() system calls often take the most time, as they require physical I/O to the disk drives.
http://education.hp.com
sar Command
$ sar option <Interval size> <Number of intervals> Options: -u -q -M -c CPU Utilization (usr, sys, wio, idle) Queue lengths/utilization (run, swap) Above information in per-processor format System calls
Student Notes
The sar command can be used to display global statistics on several important CPU operations. Using the u option, information can be displayed on the time the system spent in User mode, System mode, Waiting for (disk) I/O, and idle. The Waiting for (disk) I/O is not reported by any other tool. Other tools simply lump it in with idle time. An example of the sar output with the u option is shown below:
# sar -u 5 4 HP-UX r3w14 B.10.20 C 9000/712 08:32:24 08:32:29 08:32:34 08:32:39 08:32:44 Average %usr 64 61 61 61 61 %sys 36 39 39 39 39 %wio 0 0 0 0 0 10/14/97 %idle 0 0 0 0 0
http://education.hp.com
Using the q command, information can be displayed on the length and utilization of the run queue and the swap queue. We are most interested at this time in the run queue. An example of the sar output with the q option is shown below:
# sar -q 5 4 HP-UX r3w14 B.10.20 C 9000/712 08:33:24 08:33:29 08:33:34 08:33:39 08:33:44 Average 10/14/97
The M option is always used in conjunction with u and/or q. It causes the metrics to be broken down by processor, so you can see how each processor is being utilized. The c option shows the total number of system calls being executed per second and singles out four specific system calls for further detail. They are the read(), write(), fork(), and exec() system calls. Also reported on this display is the average number of characters transferred in or out each second. An example of this output follows:
# sar -c 5 4 HP-UX r3w14 B.10.20 C 9000/712 08:33:24 scalls/s 08:33:29 332 08:33:34 435 08:33:39 270 08:33:44 524 Average 390 sread/s 3 4 3 20 7 10/14/97 fork/s 0.00 0.00 0.00 0.20 0.05 exec/s 0.00 0.00 0.00 0.20 0.05 rchar/s 38630 30310 6758 73523 37187 wchar/s 2657 2662 0 0 1331
swrit/s 9 24 14 15 15
http://education.hp.com
timex Command
$ timex real user sys prime_med 25.65 20.71 3.43
Student Notes
The timex command can be used to benchmark how long the execution of a particular process takes in seconds. The command measures: real time the amount of elapsed time from when the program started to when the program completed (sometimes referred to as the wall clock time). user time the amount of time spent by the program executing in user mode. sys time the amount of time spent by the program executing in kernel mode.
The example on the slide shows a total of 25.65 seconds elapsed from when the program prime_med started to when it completed. The execution spent 20.71 seconds executing in user mode and 3.43 seconds executing in kernel mode. The difference between user + system and real time is attributed to time the process spent not running on the CPU. The process may not get CPU time either because it was waiting on some resource (like disk or CPU) or because it was in a sleep state waiting for an event (like a child process waiting to finish executing).
http://education.hp.com
Student Notes
Practically speaking, the easiest performance gains are usually achieved by adding more and faster hardware. This could be upgrading to a faster processor, upgrading to a processor with more cache, adding another processor, or buying another system and off-loading some applications to the second system. Upgrading to a faster processor may be possible with a simple module swap, but, more than likely, it would involve upgrading your entire system to a newer model. Some systems come with two or three possible processors and yours may not have the fastest available processors. If so, you may be able to upgrade the systems processors to faster versions without touching the rest of the system. Nowadays, its unlikely that youll be able to upgrade the cache memory or TLB to larger sizes. Each processor chip seems to come with a predetermined amount of cache and Sized TLB. Only going to a different processor chip (and thus a larger model) will you be able to affect the cache memory and TLB sizes. If your system is not yet at its full complement of processors, it may relieve your workload to add more processors. If you have a cell-based architecture, you may be able to add more processors to each cell, or even add more cells. Some servers come with extra processors
http://education.hp.com
installed, but not enabled. These systems have a feature called ICOD (Instant Capacity On Demand). By simply contacting HP, these disabled processors can be enabled, giving you more processing power with a minimum of time. If, at a later date, those processors are no longer needed, they can be disabled in a similar fashion. Finally, if you have a system which is heavily loaded and another system which is lightly loaded, it may be possible to transfer some of the tasks from the busy system to the one which is less busy. The disadvantage of these solutions is that most of them cost money.
http://education.hp.com
Student Notes
If the easiest performance gains are upgrading the hardware, then the greatest performance gains that are likely to be achieved are improving the software. A system with the fastest and most current hardware can still run slowly if the software is not configured properly. One way to improve the performance of specific processes is to improve the priority of those processes. You can do this by improving the process's nice value or by making the process a real-time process. Or, you can reduce the nice value of other processes. Be careful when promoting a process to real time. If the process is not well-behaved, it can take over your entire system. By well-behaved, we mean that it is not compute bound and it is free of serious bugs. Running batch jobs at non-peak hours has been a standard performance solution for many years on many systems. Other software performance improvements can be realized by using PRM (Process Resource Manager), WLM (Workload Manager), or the mpctl() system call.
http://education.hp.com
Processor 1
CPU
Processor 2
CPU
Memory
Process
TLB
Cache
Coprocessor
TLB
Cache
Coprocessor
mpctl (proc2)
System Bus
Student Notes
The sar command can be utilized to report CPU utilization for the overall system on a perprocessor basis (when the -u and -M options are specified). In addition the -q option will report average run queue length while occupied, and percent of time occupied. Both of these metrics can assist in the evaluation of CPU loading and should be considered before making processor affinity calls. top can also show you how your CPU resource is being distributed over the system. It automatically breaks down the load and utilization percentages on a per-processor basis when invoked. Remember, when you are running a system that supports Partitions (NPars or VPars), these tools only show you what is happening within a partition, as each partition has booted its own copy of the operating system and is acting as an independent system.
http://education.hp.com
Processor Affinity
Processor 1
CPU
Processor 2
CPU
Memory
Process
TLB
Cache
Coprocessor
TLB
Cache
Coprocessor
mpctl (proc2)
System Bus
The mpctl() system call assigns the calling process to a specific processor.
Student Notes
The mpctl() system call provides a means for determining how many processors are installed in the system (or partition), how many processors are in this pset, and assigning processes or threads to run on specific processors (also known as processor affinity) or within specific psets, and much, much more. Refer to the man page for mpctl() on your system. Much of the functionality of this capability is highly dependent on the underlying hardware. An application that uses this system call should not be expected to be portable across architectures or implementations. Processor sets are supported by the pset() system call. If your version of the operating system supports psets, refer to the man page for pset() for full details.
http://education.hp.com
5-24. LAB: CPU Utilization, System Calls, and Context Switches Directions
General Setup
Create a working data file in a separate file system (on a separate disk, if possible). If another disk is available: # # # # vgdisplay v | grep Name (Note which disks are already in use by LVM) ioscan fnC disk (Note any disks not mentioned above, select one) pvcreate -f <raw disk device file> vgextend vg00 <block disk device file>
In either case: # # # # # # lvcreate -n vxfs vg00 lvextend -L 1024 /dev/vg00/vxfs <block disk device file> newfs -F vxfs /dev/vg00/rvxfs mkdir /vxfs mount /dev/vg00/vxfs /vxfs prealloc /vxfs/file <75% of main memory in bytes>
The lab programs are under /home/h4262/cpu/lab0 # cd /home/h4262/cpu/lab0 The tests should be run on an otherwise idle system otherwise results are unpredictable. If the executables are missing, generate them by typing: # make all
http://education.hp.com
# timex dd if=/stand/vmunix of=/dev/null bs=2k real __ user __________ system ____________ # timex dd if=/stand/vmunix of=/dev/null bs=64 real __ user __________ system ____________
1. What is the system call rate when your system is "idle"? ________________ 2. Run filestress in the background. What is the system call rate now? What system calls are generated by filestress? Take an average with sar over about 40 seconds i.e. # sar c 10 4 3. Terminate the filestress process by entering the following commands: # kill $(ps -el | grep find | cut -c24-28) # kill $(ps -el | grep find | cut -c18-22) 4. Run the syscall program and again answer question 2. Is the system call rate lower or higher than with filestress? Why? _____________________________________________________________________ Kill the syscall program, before proceeding. # kill $(ps el | grep syscall | cut c18-22) 5. Using cs, compare the number of context switches on an idle system and a loaded system. Idle ________ Loaded ______________
6. Kill the cs program, remove the /vxfs/file, and dismount the /vxfs filesystem. # kill $(ps el | grep cs | cut c18-22) # rm f /vxfs/file # umount /vxfs
http://education.hp.com
Lab 1
1. Change directory to /home/h4262/cpu/lab1 # cd /home/h4262/cpu/lab1 2. Start the processes running in the background. # ./RUN 3. Start a glance session and answer the following questions. What is the CPU utilization? _______ What are the nice values of the processes receiving the most CPU time? _______ What is the average number of jobs in the CPU run queue? ______ 4. Characterize the 8 lab processes that are running (proc1-8). Which are CPU hogs? Memory hogs? Disk I/O hogs etc. Identify processes that you think are in pairs. ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ 5. Determine the impact of this load on user processes. Time how long it takes for the short baseline to execute. # timex /home/h4262/baseline/short & How long did the program take to execute? _______ 6. Compare your results to the baseline established in the lab exercise in module 1, step 5. 7. End the CPU load by executing the KILLIT script. # ./KILLIT
http://education.hp.com
Lab 2
1. Change directory to /home/h4262/cpu/lab2. # cd /home/h4262/cpu/lab2 2. Start the processes running in the background. # ./RUN 3. In one terminal window, start glance. In a second terminal window run # sar -u 5 200. Answer the following questions: What does glance report for CPU utilization? _______ What does sar report for CPU utilization? ________ What is the priority of the process receiving the most CPU time? _______ How much time is the process spending in the sigpause system call? ______ How is the process being context switched (forced or voluntary)? ______ 4. Determine the impact of this load on user processes. Time how long it takes for the short baseline to execute. # timex /home/h4262/baseline/short & How long did the program take to execute? _______ 5. End the CPU load by executing the KILLIT script. # ./KILLIT
http://education.hp.com
http://education.hp.com
http://education.hp.com
Memory Management
Memory
Student Notes
Memory management refers to the subsystem within the kernel that is responsible for managing the main memory (also known as RAM) of the computer. When managing main memory, the kernel allocates memory pages (default size is 4 KB) to processes as they need space. When main memory runs low on free space, the kernel will try to free up some pages in memory by copying those pages out to swap space on disk. The swap space can be thought of as an extension of main memory (like an overflow area) that is used when main memory becomes full. Processes paged out to the swap area cannot be referenced again until they are paged back in to main memory. The term virtual memory refers to how much memory the kernel perceives as being available for allocation to processes. When the kernel allocates space to a process, it must track that page for the life of the process. Virtual memory includes main memory and swap space, as pages allocated to processes may be moved to swap space.
Example
In the slide, there are three different processes being tracked: a one-page process, a two-page process, and a three-page process. The one-page process started in main memory and was
http://education.hp.com
subsequently paged to swap space. The two-page process is entirely resident in main memory. And the three-page process has been partially paged to swap space (two of three pages are on swap). From a virtual memory standpoint, the three processes are taking up six pages of memory: three pages in main memory and three pages on swap. The preceding example is pretty simple. Reality is a little more complex. Processes actually consist of two basic types of pages, text and data. The data pages have write capabilities and thus their contents must be preserved when they are moved out of memory (to swap space). The text pages cannot be modified by the executing program. They are initially read in from the file system. If the memory manager should want to release the space that a text page is taking, it does not have to copy it out to swap, or even back to the file system.
http://education.hp.com
1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1
F 0 0 0 0 0 0 1 1 1 1 0 1 1 0 1
1 0 0 0 1 0 0 1 1 1 1 1 1 0 1 1
1 1 0 0 0 0 0 1 1 0 1 0 1 1 1 1
1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0
F 0 1 0 0 0 0 1 1 1 1 1 0 1 1 0
F 0 0 0 0 1 0 1 1 1 1 1 1 1 1 1
F 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1
F F 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1
1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1
F 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1
Free Hand
Vhand Process
Reference Hand
1 = Page is being referenced 0 = Page is NOT being referenced F = Freed Memory Page by vhand process
Memory
Student Notes
The vhand daemon is responsible for keeping a minimum amount of memory free on the system at all times. The vhand daemon does this by monitoring free pages and trying to keep their number above a threshold to ensure sufficient memory for efficient demand paging. The vhand daemon utilizes a "two-handed" clock algorithm as seen on the slide. The first hand (also known as the reference hand or age hand) clears reference bits on a group of pages in an active part of memory. If the bits are still clear by the time the second hand (also known as the free hand or steal hand) reaches them, the pages are paged out. The kernel automatically keeps an appropriate distance between the hands, based on the available paging bandwidth, the number of pages that need to be stolen, the number of pages already scheduled to be freed, and the frequency in which vhand runs. In essence, the distance between the hands determines how aggressive vhand is behaving. It behaves more aggressively as the memory pressure increases.
http://education.hp.com
Non-Kernel memory
Paging begins with possibility of stabilization. Paging continues at maximum rate, with no possibility of stabilization.
Student Notes
The system uses a combination of paging and deactivation to manage the amount of free memory. A minimum amount of free memory is needed to allow the demand paging system to work properly. No paging occurs until the free memory falls below a threshold call LOTSFREE. Upon falling below LOTSFREE, paging will occur at a minimum level becoming more aggressive as the number of free pages decreases. If the demand for memory continues, then paging will continue. However, if the demand for memory subsides, then there is a possibility that the amount of free memory will stabilize below the LOTSFREE threshold. If free memory falls below a second threshold call DESFREE, then there is no possibility of stabilization (until free memory goes back above DESFREE) and the paging rate becomes much more aggressive compared to the initial paging rate. Finally, if free memory falls below MINFREE, then process deactivation begins. A process is chosen by the kernel to be deactivated, and it is placed on the deactivation queue. Because the process is deactivated (therefore its pages are not being referenced) vhand will be able to page all its pages (including the uarea) out to the swap partition. The process will be
http://education.hp.com
reactivated automatically once free memory rises above MINFREE. When a process is reactivated, only the uarea is immediately paged in. Other pages are faulted in as needed. Below are the default formulae for LOTSFREE, DESFREE, and MINFREE. (NKM = Non-Kernel Memory) <=32 MB LOTSFREE DESFREE MINFREE 1/8 of NKM 1/16 of NKM 1/2 of DESFREE >=32 MB, <=2GB 1/16 of NKM 1/64 of NKM 1/4 of DESFREE >2 GB 64 MB 12 MB 5 MB
NOTE
The values of LOTSFREE, DESFREE, and MINFREE were made tunable kernel parameters in HP-UX 11.00. Prior to the 11.00 release, these values were fixed and could not be changed. It is recommended by HP, however, that the parameters not be tuned manually.
http://education.hp.com
Buffer Cache
Filesystem
Process
File
Memory
Student Notes
The buffer cache exists to speed up file system I/O. The system tries to minimize disk access by going to disk as infrequently as possible, because disk access is often a bottleneck on most systems. Therefore, the most recently- or commonly-accessed files from disk persist in the portion of memory called the buffer cache. It is called dynamic because the size of the buffer cache grows or shrinks dynamically, depending on competing requests for system memory. Its minimum size is governed by the tunable parameter dbc_min_pct, and it cannot grow larger than the size specified in dbc_max_pct. These two parameters are expressed as percentages of total physical memory on the system. Let's say dbc_min_pct is set to 10, while dbc_max_pct is 50. This means that initially 10% of physical memory is allocated to the buffer cache. As the system needs more space to buffer files read in from disk, the buffer cache will allocate more memory, and this will continue until it occupies 50% of memory, its maximum size. Later, when the system requires more memory for another use, say processes, the buffer cache could shrink an appropriate amount, but will never be less than the 10% minimum value. Therefore, a larger buffer cache is able to hold more files and will minimize their access time but will leave less memory available for other uses.
http://education.hp.com
NOTE:
The buffer cache is dynamic in nature only when two other tunable parameters, bufpages and nbuf, are both set to their default values of 0.
Another example: if dbc_min_pct and dbc_max_pct are both set to the same value, say 20, the kernel will always use exactly that percentage of physical memory for the buffer cache.
http://education.hp.com
Buffer Cache
syncer
Memory
Student Notes
For disk writes, data flows from the buffer cache to disk. How does it get to the buffer cache? The kernel writes data to it. The syncer process takes care of flushing data in the buffer cache to the files on the disk. When a user edits a file, makes changes to that file, and saves the changes, those changes do not go to disk right away. The kernel writes the data to the buffer cache, and some time later (within 60 seconds) the data finally arrives at the disk. This time period is chosen as a balance between ensuring that the file system is fairly up-to-date in case of a crash and efficiently performing disk I/O. There are many applications that do not rely on the operating system's built-in processes to flush data to disk, but instead take over that operation themselves. In other words, they create their own buffers and manage the flushing at appropriate intervals. A common example is a database application that needs to guarantee the completion of a transaction within a specified time interval.
http://education.hp.com
1998 NATTCH 0 2
Memory
Text Data Sh. Mem Sh. Lib Text Data Sh. Mem Sh. Lib
Student Notes
UNIX implements interprocess communications using different mechanisms. Three mechanisms that require additional system memory are semaphores, shared memory, and message queues. Semaphores are used to synchronize memory resources between competing processes. Shared memory segments are resources capable of holding (in memory) large amounts of data that can be shared between processes. Message queues hold strings of information (messages) that can be transferred between processes. Two types of processes that utilize message queues are networking and database processes.
Shared memory provides a mechanism to reduce interprocess communication costs significantly. Two processes that are ready to share data, address the same portion of shared memory into their addressable space. Changes made to the shared memory are seen immediately by all processes and do not require kernel services. So from a kernel perspective, other than initially setting up the shared memory, there is very low cost in using shared memory.
http://education.hp.com
On the slide, each process has a shared memory segment that references one and the same shared memory area. The more processes that allocate shared memory segments, the higher the memory usage. The shared memory segments in physical memory can be viewed with the ipcs -mob command or a reporting tool like glance. From time-to-time, they might have to be cleaned up or removed manually if an application terminates ungracefully. This is done by the superuser with the ipcrm command. A worthwhile baseline measurement for a system administrator is to run the ipcs -mob command during a quiet period. It is also eye opening to repeat this command when the system is at its busiest.
http://education.hp.com
Processes deactivated (SO) Amount of free memory relative to - lotsfree - desfree - minfree
Student Notes
The utilization of memory can be monitored in a number of different ways. There are multiple tools and multiple metrics that monitor memory usage. The first metrics you want to look at are those that will tell you whether vhand is active.
http://education.hp.com
Amount of Paging
This indicates the level of disk activity to the swap partition. If a consistent amount of paging to swap space is occurring, then performance is impacted (most likely significantly). Next, check to see if the swapper is active.
Process Deactivations
This indicates that processes are being deactivated, meaning free memory has fallen below the MINFREE threshold. There is severe memory pressure.
http://education.hp.com
Student Notes
Individual processes vary greatly in terms of the amount of memory they use. Metrics to monitor memory utilization on a per-process basis include the following:
Size of RSS/VSS
The Resident Set Size (RSS) for a process is the portion of the process (in KB) that is currently resident in physical memory. Since the entire process does not have to be resident in memory in order to execute, this shows how much of the process is actually resident in memory. The Virtual Set Size (VSS) for a process is the total size of the process (in KB). This indicates that if the entire process were to be loaded, this is how much memory the entire process would consume. Very rarely is the entire process resident in memory. If the entire process were in memory, then the RSS value would be equal to the VSS value.
http://education.hp.com
Each of these three segments has a maximum size to which they can grow limited by tunable kernel parameters. They are maxtsiz, maxdsiz, and maxssiz for a 32-bit process. They are maxtsiz_64bit, maxdsiz_64bit, and maxssiz_64bit for a 64-bit process. If a process tries to grow one of these segments beyond its maximum size, then the process terminates (and in some cases core dumps).
http://education.hp.com
#=> vmstat -n 5 VM memory avm free 9140 3824 CPU cpu us sy id r 9 5 86 1 9017 3500 24 17 60 0 10292 2255 67 24 9 5 10227 976 67 33 0 7 10958 400 67 31 3 8 10759 454 62 20 18 6 13448 404 32 15 53 0
at 4
pi 0
page po 0
fr 0
de 0
sr 0
in 675
w 0 49 0 20 0 19 0 12 0 3 0 0 0
11 41 85 91 98 65
0 0 0 48 51 74
0 0 0 26 24 39
0 0 0 0 0 0
Student Notes
A useful command to view virtual memory statistics is vmstat. The slide shows vmstat's output being updated every 5 seconds. When viewing vmstat's output, always keep an eye on the po (pages paged out) parameter. Ideally, you want this to be zero, indicating no paging out is occurring. Statistics regarding the vhand algorithm, the fr (pages freed per second) and sr (pages scanned by the clock algorithm, per second) parameters show the actual behavior of vhand.
Output Headings
procs
r b w
In run queue Blocked for resources (I/O, paging, and so on) Runnable or short sleeper (less than 20 seconds) but deactivated
http://education.hp.com
Active virtual pages (run during the last 20 seconds) Size of free list page (in 4K pages) Page reclaims per second Address translation faults per second (page faults) Pages paged in per second Pages paged out per second Pages freed per second Anticipated short term memory shortfall Pages scanned by algorithm per second
in sy cs
CPU
Non-clock device interrupts per second System calls per second CPU context switches per second
us sy id
with -S option
Percentage of time CPU spent in user mode Percentage of time CPU spent in system mode Percentage of time CPU is idle
si so
http://education.hp.com
B3692A GlancePlus B.10.12 17:33:59 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------MEMORY REPORT Users= 19 Event Current Cumulative Current Rate Cum Rate High Rate -------------------------------------------------------------------------------Page Faults 78 287 7.5 24.3 139.3 Paging Requests 3 21 0.2 1.7 12.0 KB Paged In 52kb 336kb 5.0 28.4 189.3 KB Paged Out 0kb 0kb 0.0 0.0 0.0 Reactivations 0 0 0.0 0.0 0.0 Deactivations 0 0 0.0 0.0 0.0 KB Reactivated 0kb 0kb 0.0 0.0 0.0 KB Deactivated 0kb 0kb 0.0 0.0 0.0 VM Reads 3 6 0.2 0.5 2.0 VM Writes 0 0 0.0 0.0 0.0 Total VM : Active VM: 78.9mb 23.4mb Sys Mem : Buf Cache: 10.6mb 19.1mb User Mem: Free Mem: 78.0mb 20.3mb Phys Mem: 128.0mb Page 1 of 1
Student Notes
glance has extensive memory monitoring abilities. Like vmstat, it can give paging statistics, in addition to showing if any processes are being deactivated. Remember, this is an indication of severe memory shortage. There is other valuable information on this report, such as the statistics at the bottom showing the current Dynamic Buffer Cache size, the current amount of Free Memory, and the total Physical Memory in the system.
http://education.hp.com
B3692A GlancePlus B.10.12 14:52:27 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 22% 29% 51% Disk Util F | 1% 7% 13% Mem Util S S U | 91% 91% 91% U B B Swap Util U | 25% 24% 35% U R R -------------------------------------------------------------------------------PROCESS LIST Users= 11 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100 max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------netscape 16013 12988 154 sohrab 12.9/14.0 64.9 0.0/ 0.6 14.7mb 1 supsched 18 0 100 root 2.9/ 2.1 942.6 0.0/ 0.0 16kb 1 lmx.srv 1219 1121 154 root 1.6/ 0.9 389.4 0.5/ 0.0 2.7mb 1 glance 15726 15396 156 root 0.6/ 0.9 2.0 0.0/ 0.2 4.0mb 1 statdaemon 3 0 128 root 0.6/ 0.7 302.1 0.0/ 0.0 16kb 1 midaemon 1051 1050 50 root 0.4/ 0.4 201.4 0.0/ 0.0 1.3mb 2 ttisr 7 0 -32 root 0.4/ 0.3 121.0 0.0/ 0.0 16kb 1 dtterm 15559 15558 154 roc 0.4/ 0.4 1.6 0.0/ 0.0 6.2mb 1 rep_server 1098 1084 154 root 0.2/ 0.1 23.7 0.0/ 0.0 2.0mb 1 syncer 325 1 154 root 0.2/ 0.0 20.2 0.1/ 0.0 1.0mb 1 xload 13569 13531 154 al 0.2/ 0.0 2.4 0.0/ 0.0 2.6mb 1 Page 1 of 13
Student Notes
The glance Process List report can be used to monitor process statistics, including how much memory processes are currently consuming. The highlighted column, RSS (Resident Set Size), shows memory being used on a per-process basis. Very simply put, this helps to identify the "memory hogs" on the system. For example, the process called netscape has an RSS of 14.7 MB, while statdaemon is minimal. Other large processes include glance, xload, and dtterm. What do all these processes have in common? They are all GUI (graphical user interface) programs running as windows in a graphical window environment. Moral: programs that open their own windows are relatively memory-intensive and should be minimized. Users should be encouraged not to leave several windows open on their screens if they do not have a continuing need for them.
http://education.hp.com
B3692A GlancePlus C.03.70.00 15:52:03 r206c42 9000/800 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 15% 15% 15% Disk Util F | 1% 0% 2% Mem Util S S U | 96% 96% 96% U B B Swap Util U | 15% 15% 15% U R R -------------------------------------------------------------------------------Resources PID: 28030, glance PPID: 27993 euid: 0 User: root -------------------------------------------------------------------------------CPU Usage (util): 0.1 Log Reads : 1 Wait Reason : STRMS User/Nice/RT CPU: 0.1 Log Writes: 0 Total RSS/VSS : 3.6mb/ 5.6mb System CPU : 0.0 Phy Reads : 0 Traps / Vfaults: 1/ 10 Interrupt CPU : 0.0 Phy Writes: 0 Faults Mem/Disk: 6/ 0 Cont Switch CPU : 0.0 FS Reads : 0 Deactivations : 0 Scheduler : HPUX FS Writes : 0 Forks & Vforks : 0 Priority : 154 VM Reads : 0 Signals Recd : 0 Nice Value : 10 VM Writes : 0 Mesg Sent/Recd : 0/ 0 Dispatches : 6 Sys Reads : 0 Other Log Rd/Wt: 38/ 172 Forced CSwitch : 0 Sys Writes: 0 Other Phy Rd/Wt: 0/ 0 VoluntaryCSwitch: 4 Raw Reads : 0 Proc Start Time Running CPU : 0 Raw Writes: 0 Tue Mar 16 15:49:14 2004 CPU Switches : 0 Bytes Xfer: 0kb : C - cum/interval toggle % - pct/absolute toggle Page 1 of 1
Student Notes
The glance Individual Process report displays memory usage for an individual process, and the RSS and VSS sizes for the process. Also displayed on a per-process basis, is the VM reads and VM writes being performed by the process. This indicates how much paging from/to the swap device the individual process is performing. If performance is poor for an individual process, this is a good field to check.
http://education.hp.com
B3692A GlancePlus C.03.70.00 15:58:40 r206c42 9000/800 Current Avg High -------------------------------------------------------------------------------S N N CPU Util S | 15% 15% 15% Disk Util F | 0% 0% 4% Mem Util | 96% 96% 96% S S U U B B Swap Util U | 15% 21% 45% U R R -------------------------------------------------------------------------------SYSTEM TABLES REPORT Users= 1 System Table Available Requested Used High -------------------------------------------------------------------------------Inode Cache (ninode) 2884 na 645 645 Shared Memory 12.5gb 11.1mb Message Buffers 800kb na 0kb 0kb Buffer Cache 314.4mb na 314.4mb na Buffer Cache Min 32.0mb Buffer Cache Max 320.0mb DNLC Cache 8004 Model : 9000/800/A400-6X OS Name : HP-UX OS Release: B.11.11 OS Kernel Type: 64 bits Phys Memory :640.0mb Network Interfaces : Number CPUs : 1 Number Swap Areas : Number Disks: 2 Avail Volume Groups: Mem Region Max Page Size: 1024mb Page 2 of 2 2 1 2
Student Notes
The glance System Table report displays the size of kernel tables in memory, and the current utilization of theses tables. It is important not to set the size of these tables too large, as the tables are memory resident (and the bigger the table, the more memory it consumes). Yet, it is even more important that enough resources be allocated so that the kernel does not have to wait for a resource to become free (or even error out) when a particular resource is requested. The Available column displays the total size of the particular table, and the Used column shows how many entries within the table are currently being used. In general, the Used value should not be close to the Available value. If it is, then the kernel is close to running out of that particular resource. The High % column shows the high water mark for the resource since glance has been running. Also of interest in this report are the buffer cache statistics, especially the Buffer Cache that shows the current size of the buffer cache.
http://education.hp.com
NOTE:
There are two pages to this report. Shown here is the second page of this report. More system tables are shown on the first page.
http://education.hp.com
Student Notes
An obvious hardware solution to a memory bottleneck is to add more physical memory. While this solution requires an outlay of money, it may pay for itself quickly by saving the system administrator hours of time looking for ways to reduce memory consumption. If adding more memory is not an option, then a second hardware suggestion is to look at the use of X terminals on the system. An X terminal typically consumes a large portion of memory. X terminals will take up 34 MB of memory for light application usage, and as much as 1020+ MB for heavy application usage. These figures do not take into account any additional RAM that the system will use for window managers or any other X-related overhead.
http://education.hp.com
Reduce dbc_max_pct (max size of dynamic buffer cache). Identify programs with memory leaks. Check for unreferenced shared memory segments. Use serialize command to reduce process thrashing. Use PRM to prioritize memory allocation.
Student Notes
Quite often, users will run X-windows type programs to enhance the look of their desktop. Examples include an X-eyes program, a bouncing ball program, or fancy screen savers. All of these graphical programs consume system resources, including memory. The biggest consumer of memory will most likely be the buffer cache. We saw earlier that if the buffer cache is dynamic, it will grow to its maximum size, as long as memory is available. The problem with this is when a process needs additional memory, and the free memory is below LOTSFREE, then the buffer cache is slow to shrink (if at all!), causing paging to occur among the processes. To prevent this situation, the tunable parameter dbc_max_pct should be tuned to limit the maximum size in which the buffer cache can grow. A recommendation for dbc_max_pct is 25 or less. Programs with memory leaks will allocate memory and then stop using without returning it to the system for use elsewhere. These programs may require you to shut them down periodically, to release the memory. They may even require you to reboot the system occasionally to reclaim the memory. There are a number of third party tools that will help you locate memory leaks in applications such as Purify.
http://education.hp.com
Unreferenced shared memory segments can also be a problem. An application sets one up and then forgets to deallocate it when the application exits. Here is a possible procedure for locating abandoned shared memory segments: First, look for any shared memory segments that have no processes attached to them. # ipcs ma Note which shared memory segments have a 0 in the NATTCH column. If they are owned by root, let them stay. Otherwise, write down their ID numbers and their CPID numbers. Second, one at a time, find out whether the creating process still exists. # ps el | grep <CPID number>
If it does, its probably just a quiescent segment, But if not, the segment is probably abandoned. Finally, remove the segment. # ipcrm m <ID number>
The serialize command will be discussed later in this chapter. You may wish to use PRM to control your memory resource and its allocation.
http://education.hp.com
Student Notes
Since we are discussing system memory and performance there is one other topic that we should think about, hardware based memory page access control. The processor architecture has several features related to assuring that a process thread can not access areas of physical memory that are not part of its process space. An in depth discussion of page access control is presented in the HP-UX training course; "Inside HP-UX, course number H5081S and we won't attempt to recreate it here. There is one particular aspect of this hardware feature that we will spend some time with in discussion though, and that is "Protection ID's". Every discrete region of virtual memory assigned to a process (text space, private data space, shared memory space, shared library data space, etc) is assigned a unique ID "key", called an Access Key. Any process attempting to access that memory space must have a copy of a matching ID "key", called a Protection Key. To speed things up, the most frequently or likely used Protection Keys are kept in processor registers. (These registers are part of a process threads "context" and are preserved across switches and interrupts.) The hardware performs the Protection check as part of the actual memory access instruction.
http://education.hp.com
Now here is the catch, there is only room in the control registers for a limited number of frequently used Protection Keys. The rest are stored in kernel space in memory management tables, which are accessed when a protection ID fault occurs. The fault handler will search for and find these other "keys" when they are needed but at the cost of CPU cycles! To better understand the dynamics of this process consider the following analogy:
http://education.hp.com
Kernel OS Tables
Swap Space
Proc I Proc K Proc J Proc L
Memory
Student Notes
The serialize (1) command can help if a system has a number of large processes and is experiencing memory pressures. The serialize command will allow these big processes to run one after another, instead of running all at the same time. By running the processes sequentially, rather than in parallel, the CPU can spend more time executing the process code (i.e. user mode) and less time managing the competing processes (i.e. kernel mode).
Thrashing
On systems with very demanding memory needs (for example, systems that run many large processes), the paging daemons can become so busy moving pages in and out that the system spends too much time paging and not enough time running processes. When this happens, system performance degrades rapidly, sometimes to such a degree that nothing seems to be happening. At this point, the system is said to be thrashing, meaning it is doing more overhead work than productive work.
http://education.hp.com
http://education.hp.com
2. Before starting the background processes, look up the current value for maxdsiz using the kmtune command on 11i v1 and the kctune command on 11i v2. On the rp2430: # kmtune lq maxdsiz On the rx2600: # kctune avq maxdsiz The default maxdsiz on 11i v2 is 1 GB. This will make proc1 very slow in reaching its limits. You can change maxdsiz to a more reasonable number for this lab exercise by:
# kctune maxdsiz=0x10000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 1073741824 Default Immed (now) 0x10000000 0x10000000
Also take some vmstat reading to satisfy yourself that the system is not under memory pressure. How much free memory do you have? # vmstat 2 2
http://education.hp.com
4. Open another window. Start glance. Sort the processes by CPU utilization (should be the default), and answer the following questions fairly quickly, before the memory leaks get too large. What is the current amount of free memory? What is the size of the buffer cache? Is there any paging to the swap space? How much swap space is currently reserved? Which process has the largest Resident Set Size (RSS)? What is the data segment size of the process with the largest RSS?
5. After a several minutes, the proc1 process should reach its maximum data size. If your maxdsiz is set to 1 GB, this could take a while. Please be patient. Observe the behavior of the system when this occurs. What happens when the process reaches its maximum data size? Why does disk utilization become so high at this point?
6. As the other processes grow towards their maximum data segment size, continue to monitor the following: Free memory Swap space reserved The size of the processes' data segments The RSS of the processes The number of page-outs/page-ins to the swap space
http://education.hp.com
7. Run the two baseline programs, short and diskread. # timex /home/h4262/baseline/short # timex /home/h4262/baseline/diskread How does the performance of these programs compare to their earlier runs?
8. When finished monitoring the behavior of processes with memory leaks, clean up the processes. Exit glance. Execute the KILLIT script: # ./KILLIT If you changed maxdsiz, change it back:
# kctune maxdsiz=0x40000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 0x10000000 0x10000000 Immed (now) 0x40000000 0x40000000
http://education.hp.com
http://education.hp.com
Reserved: 20 MB Used : 0 MB
Processes Memory
Program
Usr
Disk
New program wants to execute; not enough space for program to fit into memory.
Student Notes
The purpose of swap space is to relieve the pressure on memory when memory becomes too full. When free memory falls below a certain threshold, processes (or parts of processes) will be written out to the swap partition on disk in order to free up space in memory for other processes. For simplicity, the above slide assumes each process is 1 MB in size, and the amount of available memory for process execution is 20 MB. The slide also assumes (for simplicity) that each process reserves 1-MB on the swap partition each time it executes. Therefore, since 20 processes are currently present in memory (as shown on the slide), 20 MB of swap space has been reserved1 MB for each process. The HP-UX operating system reserves swap space for each process that executes on the system. The reservation of swap space is done so that the operating system knows how much swap space potentially may be needed for all the processes currently running on the system. For example, if all the processes in memory were to be swapped out, the operating system would know it had enough swap space to perform that function.
http://education.hp.com
Analogy
A good analogy for swap space reservation, is a hotel that takes room reservations. When a hotel takes a reservation, it subtracts one from the count of available rooms. If a hotel had 55 rooms, and it took 20 reservations, then it would only have 35 rooms still available, even though none of the 55 rooms were currently occupied. The same holds true for swap space. In the above example, a total of 55 MB of swap space exists, 20 MB of the space is reserved by processes currently running in memory, even though none of the processes are currently using the swap space they have reserved. To take the analogy even further, the hotel does not earmark a particular room to satisfy a reservation. Room assignments are done when the occupant shows up at the front desk. Likewise, a swap reservation is not associated with a particular block out on the swap device. Only when the kernel actually wants to move a page in memory out to the swap device does it select a block. It knows it has the swap space available. It just doesnt know where it is until it needs to use it.
Current Situation
In the above slide, all the memory is in use by the 20 processes. Now assume a new program from disk wants to execute. What happens? How does it fit in memory if all the memory is in use?
http://education.hp.com
Reserved: 20 MB Used : 1 MB
Swap 3
Processes Memory
Usr
Disk
Student Notes
Below is the basic sequence of steps that occurs when a new process wants to execute and there is not enough memory available: 1. The operating system selects a process (or portion of a process) to be written out to the swap partition on disk. The process selected is one that is not expected to execute in the near future. 2. Once the process is written to the swap partition, the amount of swap space used is incremented accordingly and the amount of swap space reserved is decremented by the same amount. 3. The new program which wants to execute reserves swap space for itself. The amount of swap space reserved is incremented accordingly. 4. The new program is copied into memory and the operating system initializes the process. The new process uses the physical memory that was just freed.
http://education.hp.com
Mb USED 1 0 20 21
Mb FREE 31 23 -20 34
Mb RESERVE 0 0
PRI 1 1 -
Student Notes
The swapinfo command displays important swap-related information, including how much swap space is used, and how much swap space is reserved. With todays systems, we recommend that you always use the m option to display all spaces in MB rather than the default KB. The swapinfo mt command shows information related to device (raw) swap partitions and file system swap space and their totals, including: Mb AVAIL Mb USED Mb FREE PCT USED The total amount of swap space available. For file system swap, this value may vary, as more swap space is needed. The current amount of swap space being used. The current amount of swap space free. The Mb FREE plus Mb USED is equal to Mb AVAIL. The percentage of swap space in use on that device.
http://education.hp.com
START/LIMIT
Applies only to file system swap. Specifies the starting block within the file system of the paging file. The LIMIT specifies the maximum size to which the paging file can grow.
Mb RESERVED Applies only to file system swap, and is only applicable when no limit is given to the maximum size of the paging file. In these situations, this value specifies how much file system space to reserve for user files on the file system. PRI The priority of the swap area. The highest priority swap areas are used first. The swap priorities range from 0-10. (Note: stronger priority swap areas have smaller priority numbers.)
The swapinfo command also shows how much swap space all the processes on the system are reserving currently. This is indicated by the reserve entry. The columns described above for device and file system swap do not apply to the reserve entry in the output of the swapinfo command. In the example, there are 32 MB of device swap on a raw disk, and 23 MB of swap in the /home file system, making a total of 55 MB. 1 MB is in use on the device swap and 20 MB are reserved, leaving 34 MB available.
http://education.hp.com
Initial Allocation
Reserved Used Kernel and OS Tables CPU Reserved Processes Memory Used Program : 20 MB : 0 MB : 0 MB : 0 MB
Current Allocation
Swap Avail : 35 MB
Disk
New program wants to execute; not enough memory for program to fit.
Student Notes
An earlier slide implied that specific space was allocated on a swap device for each process running in memory. The analogy was of a hotel subtracting one from the count of available rooms when a customer phoned in for a reservation. As mentioned earlier, specific space is not allocated on a swap device for a reservation. Instead, a variable is maintained call SWAP_AVAIL. The SWAP_AVAIL variable is initialized when the system boots to equal the total amount of swap space available. As each new process begins executing, this variable is decremented according to the amount of swap space the process would need if its entire contents were to be swapped out. When a process terminates, it returns the amount of swap space it reserved back to the SWAP_AVAIL variable. The slide above shows what the SWAP_AVAIL variable would contain when 20 MB worth of processes is executing on the system. Each process has caused the SWAP_AVAIL variable to be decremented, but no specific space has been allocated on the swap partition. No specific swap space is allocated until processes need to be paged out, as shown on the next slide.
http://education.hp.com
Current Allocation
Kernel and OS Tables CPU Processes Memory 3 1 Swap (55 MB) Reserved Used : 20 MB : 1 MB
Disk
Student Notes
This is an updated description of the sequence of events that occurs when a program is being executed and not enough memory is available: The operating system selects a process (or portion of a process) to be written out to the swap partition on disk. Since no specific swap space has been reserved, swap space is allocated from the strongest priority swap device, first available block. Once the process is written to the swap partition, the amount of swap space used is incremented accordingly, and the old program unreserves its swap space by incrementing the SWAP_AVAIL variable. Then the new program decrements SWAP_AVAIL to reserve its swap space. In effect, the amount of swap space reserved is decremented by the amount of space being moved out to swap space and then incremented by the new reservation amount. In the slide, the process being swapped out causes the USED swap to become 1 MB, causing the SWAP_AVAIL to become 34 MB. Then the old process releases its 1 MB reservation, causing the SWAP_AVAIL to increase back to 35 MB. Finally, the new process starts up and causes the SWAP_AVAIL to decrease from 35 to 34 MB.
http://education.hp.com
The new program is copied into memory, and the operating system initializes the process after it has confirmed that it can successfully reserve the needed swap for the new process (SWAP_AVAIL does not go negative when the swap reservation is made).
http://education.hp.com
Current Allocation
Kernel and OS Tables 1 CPU 3 Processes Available Memory (20 MB) Swap (55 MB) Reserved Used : 20 MB : 20 MB
Disk
Student Notes
The above slide shows the state of the system and the current swap space allocations when 20 MB (or all of available memory) has been paged out to the swap partition. The swap partition contains 20 MB worth of processes, which is the size of available memory. The initial 20 MB of processes is shaded in gray, to distinguish them from the second 20 MB of processes, which are filled with black. With this color code, we can see only 4 MB of the original processes are still loaded in memory, everything else (including 4 MB of the 21st to 40th processes) has been paged to the swap partition. The swap space allocation reflects 20 MB worth of processes that have reserved swap space, and 20 MB that is currently in use. This would be analogous to stating that a hotel received 40 room reservations, and 20 of those reservations are currently being used. The SWAP_AVAIL variable is down to 15 MB, because the total amount of swap space is 55 MB and 40 MB of that space is reserved or in use.
http://education.hp.com
Current Allocation
Swap (55 MB) Reserved Used : 20 MB : 35 MB
Swap Avail : 0 MB Processes Available Memory (20 MB) 2 ERROR: no more swap space
Program
Q: Could this error have been prevented? A: YES!! Use pseudo swap.
Disk
Student Notes
The above slide shows the situation when SWAP_AVAIL equals 0 MB. In this situation, the error message, ERROR: no swap space available is displayed, even though there is swap space to page an existing process to the swap partition and thus free up memory for a new program to load. The reason the system reports no swap space is available is because 35 MB of memory have been paged out, and the remaining 20 MB of swap space are reserved by the existing processes currently executing in memory.
http://education.hp.com
Pseudo Swap
Definition: Pseudo swap is fictitious, make-believe, swap space. It does NOT exist physically, but logically the operating system recognizes it. Purpose: Pseudo swap allows more swap space to be made available than physically exists. Benefit: Pseudo swap adds 75% of physical memory to the amount of swap space that the operating system thinks is available. This lessens swap space requirements (especially helpful on large memory systems.) **NOTE: Pseudo swap is NOT allocated in memory!
Student Notes
Pseudo swap is HP's solution for large memory customers who do not wish to purchase a large amount of disks to use for swap space. The justification for purchasing large memory systems is to prevent paging and swapping, therefore, the argument becomes, Why purchase a lot of device swap space if the system is not expected to page or swap? Pseudo swap is swap space that the operating system recognizes, but in reality it does not exist. Pseudo swap is make-believe swap space. It does not exist in memory; it does not exist on disk; it does not exist anywhere. However, the operating system does recognize it, which means more swap space can be reserved than physically exists. The purpose of pseudo swap is to allow more processes to run in memory than could be supported by the swap device(s). It allows the operating system (specifically the SWAP_AVAIL variable) to recognize more swap space, thereby allowing additional processes to start when all physical swap has been reserved. By having the operating system recognize more swap space than physically exists, large memory customers can now operate without having to purchase large amounts of swap space, which they will most likely never use. The size of pseudo swap is dependent on the amount of memory in the system. Specifically, the size is (approximately) 75% of physical memory. This means the SWAP_AVAIL variable
http://education.hp.com
will have an additional amount (75% of physical memory) added to its content. This additional amount allows more processes to start when the physical swap has been completely reserved. NOTE: Pseudo swap is enabled through a tunable OS parameter call swapmem_on. If the value for swapmem_on is 1, then pseudo swap will be enabled (turned on). If the value for swapmem_on is 0, then pseudo swap will be disabled (turned off).
Analogy
A good analogy for pseudo swap is an airline overbooking a flight. Airlines know that customers sometimes dont show up for their flight. If they reserved only enough seats for the plane, they would likely depart with a plane that wasnt full lost revenue. So they reserve more seats than actually exist on the plane, betting that a certain percentage of customers wont show. That way they can fly a plane that is much closer to full and get more revenue. Of course, they are occasionally wrong.
http://education.hp.com
= 32 MB 0.75 = 24 MB = 55 MB = 79 MB
Student Notes
The above slide shows how Total Available Swap Space (also known as SWAP_AVAIL) is calculated with pseudo swap turned on. The SWAP_AVAIL variable is calculated as all of the configured physical swap space (device and file system swap) PLUS 75% of physical memory (pseudo swap). (The calculation of the size of pseudo swap is actually more complex than given here. The resultant value of pseudo swap can vary anywhere from 67% to 88% of physical memory. But well use 75% as a pretty typical figure.) In our example, the total amount of physical swap was 55 MB, and the amount of physical memory was 32 MB. Since the size of pseudo swap is estimated at 75% of physical memory, the pseudo swap size in our example is 24 MB.
http://education.hp.com
This means the Total Available Swap Space (SWAP_AVAIL) is: 55 MB (Physical Swap) + 24 MB (Pseudo Swap) --------79 MB (Total Avail Swap)
http://education.hp.com
Swap Avail : 0 MB
CPU
Swap Avail : 24 MB
Disk
New program wants to execute; not enough memory for program to fit. With pseudo swap turned ON, program can now execute!
Student Notes
The above slide revisits our previous situation with pseudo swap turned ON. In our previous situation, we had swap space of 55 MB, of which 35 MB was in use and the remaining 20-MB was reserved. With pseudo swap turned OFF, we saw that no new processes could start because no physical swap space was available for reservation purposes. With pseudo swap turned ON, the total available swap space is 79 MB (not 55 MB). Therefore, when the system runs out of physical swap, it still has 24 MB (due to pseudo swap), which it thinks it can allocate and therefore can reserve. Consequently, the operating system is able to support more processes without having to allocate more physical swap space. This is important for large memory customers who do not want to purchase a lot of swap space on disk in order to support the large memory.
http://education.hp.com
Swap Priorities
Equal Priorities 1st chunk of swap - disk 1, chunk 1 2nd disk 2, chunk 1 3rd disk 1, chunk 2 4th disk 2, chunk 2 5th chunk will be allocated here Unequal Priorities 1st chunk of swap 2nd 3rd 4th disk 1, chunk 1 disk 1, chunk 2 disk 1, chunk 3 disk 1, chunk 4
1 3
2 4
1 4 2 3
Swap - Priority 1
Swap - Priority 1
Swap - Priority 1
Swap - Priority 2
Student Notes
When the HP-UX operating system needs to page something from memory to a swap device, it selects the smallest-numbered, strongest-priority swap device. A system administrator can define a priority number for each swap device on the system. The priority numbers range from 0 to 10, with 0 being the strongest priority, and 10 being the weakest priority. If multiple swap devices are available when the system needs to page out to swap, the strongest priority swap device is used. The slide shows two examples. The first example illustrates how the system behaves when two equal priority swap devices are available. In this situation, the system alternates between the two swap devices, with the first chunk of swap being allocated on swap device #1, and the second chunk of swap being allocated on swap device #2. The second example illustrates how the system behaves when two unequal priority swap devices are available. In this situation, the system will continue to allocate chunks of swap from the lowest-numbered (strongest priority) swap device. Only when that device is 100% full will the system begin allocating chunks from the second swap device.
http://education.hp.com
Swap Chunks
1 3
2 4
Swap - Priority 1
Swap - Priority 1
Space on the swap device is allocated to the kernel in increments called swapchunks. The default swapchunk size is 2 MB.
Student Notes
A swap chunk is the amount of space that the operating system allocates swap devices. The default swap chunk size is 2 MB. In the above example, two equal priority swap devices are available to the system. The system will allocate the first swap chunk to be on swap device #1, and this size will be 2 MB by default. Once this swap chunk has been filled by 512 pages (page size = 4 KB), then the system will allocate a second swap chunk to be on swap device #2. The system continues alternating swap space between the two systems in swap chunk increments. Swap chunks are also the unit in which swap space is allocated on file system swap devices. With file system swap devices, the operating system will only allocate swap space on the file system if the space is needed; if it does not need the swap space, then it does not allocate space. When it does need swap space, it allocates the file system swap space in swap chunk sizes. Files are created each of a size equal to a swap chunk and named hostname.N, where N is a number from 0 on up.
http://education.hp.com
DEV_BSIZE
Device block size. This is the size (in bytes) of a block on the disk. The default size is 1024 bytes. This is the number of blocks to allocate to the kernel when it need swap space. The default is to allocate swap space to the kernel in 2-MB increments. The default value is 2048. The maximum value is 65,536.
swchunk
maxswapchunks This is the maximum number of swchunks which can be allocated to the kernel. The default value is 256. The maximum value is 16,384. Total swap space recognized by the kernel = maxswapchunks x swchunk x DEV_BSIZE Defaults: 256 x 2048 x 1024 = 512 MB
Student Notes
There are two configurable parameters and one fixed, non-configurable parameter that affect swap space configurations and allocations. DEV_BSIZE The size in bytes of a block of disk space. The default size is 1 KB. It is not configurable. The number of blocks (of size DEV_BSIZE) to associate with a chunk of swap space, referred to as a swap chunk. The default value is 2048 blocks or 2 MB. The maximum value is 65,536 or 64 MB. This is the maximum number of swap chunks that will be recognized systemwide. The default value is 256. The maximum value is 16,384.
swchunk
maxswapchunks
Using these defaults, the maximum amount of swap space that the operating system recognizes is 512 MB. This means if a system is configured physically for 1 GB of swap space, only 512 MB of the 1 GB will be used by the system. In order for the system to use the other 512 MB, the tunable OS parameter maxswapchunks needs to be increased to 512.
http://education.hp.com
If you were to install HP-UX on a system that had 2 GB of physical memory, the installation process would automatically increase maxswapchunks to accommodate the larger memory. In this example, it would set maxswapchunks to 1024. However, if you were to add more memory at a later date (without reinstalling the kernel), you would have to manually tune maxswapchunks to be able to allocate enough swap space and use all of your available memory. Or, use pseudo swap. In 11.23 (11i v2), maxswapchunks has been eliminated and no longer becomes an issue.
http://education.hp.com
Summary
Swap space reservation Pseudo swap Swap priorities Swap chunks Swap space parameters
Student Notes
To summarize this module, all processes must reserve swap space by decrementing a variable called SWAP_AVAIL when they initialize. If this variable cannot be decremented, the process will not be able to start. To allow this variable to recognize more swap space than physically exists, setting a tunable parameter, swapmem_on, to 1 will turn on pseudo swap. This allows more processes to execute than the amount of swap space can support. This is not considered a problem on large memory systems, because these machines are not expected to swap. If a system does need to swap, it will swap to the lowest-numbered (strongest) priority swap device first. The priority of a swap device is specified when the device is activated. If two swap devices have the same priority, the system will alternate between the two devices. Swap chunks are the unit of disk space by which swap space is allocated. By default, the size of a swap chunk is 2 MB. By default, the system recognizes a maximum of 512 MB of swap space. If more swap space exists, the tunable parameter, maxswapchunks, must be increased, in order for the additional swap space to be recognized. If maxswapchunks is already set to the maximum value, then increase the value of swchunk.
http://education.hp.com
Directions
The following lab illustrates swap reservation, configures and de-configures pseudo swap, and adds additional swap partitions with different swap priorities. 1. Use the swapinfo command to display the current swap space statistics on the system. List the MB Avail and MB Used for the following three items:
MB Used
2. To see total swap space available and total swap space reserved, enter: # swapinfo -mt What is the total swap space available (including pseudo swap)? What is the total space reserved?
http://education.hp.com
3. Start a new shell process by typing sh. Re-execute the swapinfo command and verify whether any additional swap space was reserved when the new shell process started. In this case, the difference is going to be pretty small, so lets not use the m option. Upon verification, exit the shell. Is the swap space returned upon exiting the shell process?
4. Start glance and observe the Global bars at the top of the display for the duration of this step. Start a large, memory process and note how much the Current Swap Utilization percentage increases in glance. Type: # /home/h4262/memory/paging/mem256 & Use the process that most closely matches your physical memory size. This should reserve a large amount of swap space. Start as many mem256 processes as possible. For best results, wait until each swap reservation is complete, by observing the incremental increases in Current Swap Utilization in glance. The system will get slower and slower as you start more mem256 processes. What was the maximum number of mem256 processes that can be started? What prevented an additional mem256 process from being started? Kill all mem256 processes to restore performance.
5. Recompile the kernel, disabling pseudo swap. Use the following procedure: 11i v1 or earlier: # # # # # # cd /stand/build /usr/lbin/sysadm/system_prep -s system echo "swapmem_on 0" >> system mk_kernel -s ./system cd / shutdown -ry 0
http://education.hp.com
6. Reboot from the new kernel. Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test 7. Once the system reboots, login and execute swapinfo. Is there a memory entry? Why or why not? Will the same number of mem256 processes be able to execute as earlier? How many mem256 processes can be started now? Kill all mem256 processes to restore performance.
8. If you have a two disk system. If you have a two disk system, add the second disk to vg00 (if this was not already done in a previous exercise) and build a second swap logical volume on it. This lvol should be the same size as the primary swap volume. If you do not have a second disk continue this lab at question 13. If you did not add the second disk earlier:
http://education.hp.com
# # # #
vgdisplay v | grep Name (Note the physical disks used by vg00) ioscan fnC disk (Note which disks are unused) pvcreate f <raw_dev_file_of_unused_disk> vgextend /dev/vg00 <block_dev_file_of_second_disk>
To create the new swap device on the second disk: # lvcreate n swap1 /dev/vg00 # lvextend L 512 /dev/vg00/swap1 <dev_file_of_second_disk> Note in our case the primary swap is 512MB. See swapinfo on your system and match the size of the new swap device to the primary swap. 9. Now add the new logical volume to swap space. Ensure that the priority is the same as the primary swap: Check your work. # swapon p 1 /dev/vg00/swap1 swapon: Device /dev/vg00/swap1 contains a file system. Use -e to page after the end of the file system, or -f to overwrite the file system with paging. Oops! Problem 1, swapon is being overly cautious. If you get this message, the memory manager has detected what appears to be a file system already on the device. (Probably, left over from some previous use.) You need to override. # swapon -p 1 -f /dev/vg00/swap1 swapon: The kernel tunable parameter "maxswapchunks" needs to be increased to add paging on device /dev/vg00/swap1. Oops! Problem 2, the kernel cannot deal with this amount of swap. If you get this message, the tunable parameter, maxswapchunks, is set too small to accommodate all of the new swap space. We need to modify maxswapchunks and reboot. If you have this problem, use sam to double maxswapchunks. In 11i v2, maxswapchunks has been obsoleted and will not have to be modified. Recompile the kernel, increasing maxswapchunks. Use the following procedure: # # # # # cd /stand/build echo "maxswapchunks 512" >> system mk_kernel -s ./system cd / shutdown -ry 0
10. If you had to rebuild the kernel to increase maxswapchunks, reboot the system. Otherwise, skip to step 11. Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test
http://education.hp.com
And now add the new swap device: # swapon -p 1 -f /dev/vg00/swap1 Verify that the new swap space has be recognized by the kernel: # swapinfo -mt Done! 11. Start enough mem256processes to make the system start paging.
12. Measure the disk I/O to see what is happening with swapspace. Go to question 15 when you have finished.
13. If you have a single disk system. Create three additional swap devices with sizes of 20 MB. # lvcreate -L 20 -n swap1 vg00 # lvcreate -L 20 -n swap2 vg00 # lvcreate -L 20 -n swap3 vg00 List the current amount of swap space in use. If 10 MB is currently in use on a single swap device, and we activate an equal priority swap device, what is the distribution if an additional 10 MB is paged out? A) The distribution would be 10 MB and 10 MB. B) The distribution would be 15 MB and 5 MB. Prior to activating these swap devices, make note of the amount of swap space currently in use. When the new swap devices are activated with equal priority, all new paging activity will be spread evenly over these swap devices.
http://education.hp.com
14. Activate the newly created swap devices. Activate two with a priority of 1, and the third with a priority of 2. # swapon -p 1 /dev/vg00/swap1 # swapon -p 2 /dev/vg00/swap2 # swapon -p 1 /dev/vg00/swap3 Start enough mem256 processes to make the system start paging. Is the new paging activity being distributed evenly across the paging devices?
15. When finished with the lab, reboot the system as normal (do not boot vmunix_test) to re-enable pseudo swap and remove the additional swap devices. For 11i v1 and earlier, follow this procedure: # cd / # shutdown ry 0 For 11i v2 and later, follow this procedure: # cd / # kctune swapmem_on=1 # shutdown ry 0
http://education.hp.com
http://education.hp.com
http://education.hp.com
Disk Overview
Tracks
Data Blocks
Physical View
Logical View
Student Notes
Disks are used to store data for the operating system and the applications. A disk can be used several different ways, but they boil down to just two file system and raw. If a disk holds a file system, there are several structures which are built on the disk (using the data blocks of the disk) to help support the software in the kernel, which needs to access and manage the file system files and their contents. If the disks are to be used raw (such as a device swap space or an application database), no kernel structures are built out on the disk. The related code simply reads, manages and organizes the data blocks as it sees fit. There are several types of file systems available with the HP-UX 10.x and 11.x releases. The two primary types of local file systems are HFS (High performance File System), which was the original file system for HP-UX and has continually been enhanced since, and JFS (Journaled File System), which was introduced with the HP-UX 10.01 release and continues to grow in popularity and functionality. In the near future, you should see another type of file system become available for HP-UX the Advanced File System (AdvFS) ported over from Tru64 UNIX. In later modules, we will
http://education.hp.com
discuss the performance issues that pertain to each of the available file systems. In this module, well address the issues pertaining to all disks.
Physical View
From a physical disk perspective, the disk drives upon which a file system is placed contains sectors, tracks, platters, and read/write heads. A key behavior of most all disk drives is that the read/write heads move in parallel across the platters in such a way that each read/write head is over the same track within each platter at the same time. To maximize the I/O throughput of the disk, it is desirable to minimize the amount of head movement. To help achieve this goal, all the sectors in a cylinder are addressed in sequential order.
Cylinder Analogy
Consider a health spa or gym with three floors. Each floor contains a jogging track, and the three jogging tracks are located directly above or beneath one another from floor to floor. From this point of view, a cylinder would be all the same lanes from each floor's jogging track. In other words, all lane 1 tracks would make up cylinder 1; all lane 2 tracks would make up cylinder 2, etc. By organizing space on disks in cylinders, the software can logically distribute its sectors across all platters of the disk evenly and uniformly. For example, in the slide above, the first 6 sectors would be allocated as follows: block block block block block block #1: #2: #3: #4: #5: #6: Platter Platter Platter Platter Platter Platter #1, #1, #1, #1, #1, #1, Track Track Track Track Track Track #1, #1, #1, #1, #1, #1, Sector Sector Sector Sector Sector Sector #1 #2 #3 #4 #5 #6
By allocating disk space in this manner, a multiple block read (say 6 blocks) could be read in one operation.
Logical View
From a logical view, each cylinder is simply a repository for a certain amount of data, which can be read or written without having to move the heads. This data area is further broken down into blocks. The block is the most fundamental unit of data that can be read from or written to the disk. We mentioned in an earlier chapter a value in the kernel, called DEV_BSIZE. It is equal to 1024 bytes. This is the block size from the kernels perspective. The disk can be viewed as simply a series of blocks running from block 0 to block N-1, where N is the total number of blocks on the disk. The closer two blocks are to each other, the more likely they will be in the same cylinder. If they are in the same cylinder, a minimum amount of time is needed to read or write both blocks.
http://education.hp.com
Filesystem
3
Process
File
Memory
Student Notes
Up to this point, we have looked at I/O from the standpoint of the disk. The following slide illustrates disk I/O activities from the standpoint of memory and the process initiating the I/O. The assumption here is that we are dealing with a disk that has a file system on it, so the buffer cache becomes a factor in the operation. If this were a raw disk, the buffer cache would be bypassed by all I/O operations.
http://education.hp.com
3. The physical read is performed because the data was not in the buffer cache. Because physical I/O involves movement of the disk head (seek time), waiting for the data on the platter to rotate under the disk head (latency time), and moving the data from the platter into memory (transfer time), the cost of a physical I/O is high from a performance standpoint. Physical I/Os are the most time-consuming operations that the kernel performs. If the disk I/O queue is long (3 or more requests), the time spent waiting to be serviced can be longer than the time to actually service the I/O request. 4. Once the physical I/O request returns, the data is stored in the buffer cache so that future I/O requests for the same file system block can be satisfied without having to perform another physical I/O. This step completes the physical I/O initiated by the kernel. 5. The final step is to return the data to the original calling process that issued the read(). The sleeping process is awakened and transfers the desired data from the buffer (in buffer cache) to the data area of the process. Then the process returns from the read() system call. This step completes the logical I/O initiated by the process.
http://education.hp.com
3 4
Disk Controller Cache
Process Memory
Student Notes
As with reads, there are two methods for performing write() system calls: asynchronous and synchronous. Although the default write operation is asynchronous (the writing process does not sleep waiting for the write to complete), it is quite simple for a program to choose synchronous writes. It can be done by simply setting a flag on the open file before issuing the write. This can be done when the file is opened or at some later time.
Synchronous Writes
The slide shows the data flow of a synchronous write, from the time the write()system call is issued, to when the write call returns to the process. 1. The process issues a synchronous write() system call. 2. Assuming the process is writing to a new file data block, a new file system block is allocated on disk and an image of that block is allocated in the buffer cache. 3. Once the data is copied from the data area of the process to the buffer cache, an I/O request is placed in the disk I/O queue for that particular disk. The calling process goes to sleep until the write is reported to be complete.
http://education.hp.com
4. When the physical write is performed, the data is first copied from the buffer cache to the firmware cache on the disk drive controller. NOTE Most SCSI disk drive controllers can be configured to return an I/O complete acknowledgment at this point, rather than waiting for the data to be transferred to the physical platters. This condition is called immediate reporting.
5. The data is transferred from the disk controller cache to the platter. This operation is often the most time consuming part of the write, as it involves seek, latency, and data transfer operations. 6. Once the data has been successfully transferred to the platters, the disk drive controller returns an I/O complete acknowledgment to the kernel (assuming this was not done in step 4 with immediate reporting). 7. The kernel, upon receiving the I/O complete acknowledgment, Wakes the sleeping process, which then returns from the write call.
Asynchronous Writes
An asynchronous write does not wait for the data to get to the disk. An asynchronous write system call returns immediately upon the data being written to the buffer cache. In the diagram on the slide, the write call would return following step 2. The advantage of asynchronous writes is performance the process does not have to wait for the physical I/O. The disadvantage is lack of data integrity. Because the process continues executing before the data is written to disk, it can perform additional actions that are dependent upon the data being written successfully. If for some reason the data does not get written (a disk goes offline or a disk head crashes), the additional actions can leave the system in an inconsistent state. For example, assume a database record is written asynchronously. Because it is written asynchronously, the database process continues its execution. A subsequent action is to update a corresponding entry in another table of the database located on another disk. Assume the first asynchronous write is posted to a busy disk with a long queue, and the subsequent write is posted to a disk with an empty queue. The second write finishes before the first write begins! If the system were to crash after the second write, but before the first write, the database would be out-of-sync and corrupted, because the second write assumed that the first write succeeded. There is no signaling to the writing process to let it know that a write has completed. For that, the process would have to do synchronous writes.
http://education.hp.com
Student Notes
When monitoring disk I/O activity, the main metrics to monitor are: Percent utilization of the disk drives: As utilization of the disk drives increases, so does the amount of time it takes to perform an I/O. According to the performance queuing theory, it takes twice as long to perform an I/O when the disk is 50% busy, than it does when the disk is idle. Therefore, we consider that a disk may be experiencing a bottleneck if the disk is 50% busy or more. Requests in the disk I/O queue: The number of requests in the disk I/O queue is one of the best indicators of a disk performance problem. If the average number of requests is above two, then requests are forced to wait in the queue longer than the amount of time needed to service their own requests. If the average number of requests is three or greater, you should also see that the average wait time for a request is greater than the average service time. Amount of physical I/O: If the amount of disk activity is high, it is important to investigate on which disk, which logical volume, and which file system the activity is occurring on.
http://education.hp.com
Buffer cache hit ratio: One reason disk activity could be high is that read or write requests are not finding corresponding disk blocks in the buffer cache. As a result, physical I/O requests are being generated to the disk. The read cache hit ratio on the buffer cache indicates how frequently read data is found in the buffer cache. The minimum read hit ratio should be 90% or higher for optimal performance. Less than 90% indicates the buffer cache may be too small, causing (potential) excess disk activity. It may also indicate that the application is not using buffer cache in an efficient manner, e.g. doing a lot of random I/O or very large I/O. The write cache hit ratio on the buffer cache indicates how frequently a write to a buffer does not trigger a physical read or write to the disk. (if only a portion of a block is being written, and the image of that block is not already in a buffer, it may be necessary to read the original contents of the block into buffer cache before modifying it with the new write data.) The minimum write cache hit ratio should be 70% or higher for optimal performance. Less than 70% indicates the buffer cache may be too small, causing (potential) excess disk activity. Again, the fault may lie with the applications use of the buffer cache.
http://education.hp.com
Student Notes
On a per process basis, it is important to identify which processes are generating large amounts of disk I/O. Metrics that help to identify I/O activity on a per process basis are: Amount of physical and logical I/O: This indicates how much I/O the process is performing. For processes performing large amounts of I/O, the additional three metrics shown below should be investigated. Type and amount of I/O related system calls being generated: For each process performing high I/O, the number of read(), write(), and other I/O related calls should be inspected. Amount of VM reads and VM writes: If the I/O activity being generated is due to paging (VM read and VM writes), then the problem is probably not a disk I/O problem, but more like a memory problem. Files opened with heavy access: For each process performing large amounts of file system I/O, the names of the files to which they are reading or writing should be inspected. For files receiving high I/O activity, consider relocating these files to other disks that are less busy. To determine how random the I/O requests are, hit <CR>
http://education.hp.com
frequently while looking at the list of open files for that process (in glance), then inspect how quickly the offset to each file changes and whether it is monotonically increasing or varies up and down.
http://education.hp.com
Student Notes
Common causes of disk-related performance problems are shown on the slide. Buffer cache misses cause physical I/O to occur. When the appropriate buffer is not found in the buffer cache, a physical I/O is triggered. By the way, a buffer cache can be too large as well. A very large buffer cache takes more time to search to see if the appropriate buffer exists! More on how to properly size a buffer cache will be given later in this module. Synchronous I/O forces the write system calls to wait until the I/O physically completes. Very good for data integrity, very poor for performance. Sequential access, with a small block size, causes excessive amounts of physical I/O. Accessing lots of files on one disk, versus many disks, creates an imbalance of disk drive utilization. This leads to performance problems with the busy disks and under utilization with the less busy disks. Accessing lots of disks on the same disk controller creates contention problems on the SCSI bus. You can determine this by noticing that multiple disks on the same controller
http://education.hp.com
have request queues that are consistently three or greater in length and the average time a request waits to be serviced is greater than the average time it takes to actually service the request. The individual disks may not show a disk utilization 50% or greater! If this situation occurs, it would be best to spilt up the busiest disks onto separate controllers.
http://education.hp.com
%busy 0.60 62.40 33.20 54.80 1.20 63.80 39.20 61.80 2.20 56.40 35.60 62.80 0.20 68.60 33.80 60.00 24.40 23.00 50.60 0.60 1.40
avque 0.50 10.51 2.76 8.10 0.50 10.84 2.94 19.60 0.50 18.40 2.69 18.41 0.50 13.00 3.25 5.72 4.25 3.46 18.77 0.50 1.17
r+w/s 2 46 16 31 3 48 19 36 3 39 17 36 2 51 16 33 15 14 28 0 2
blks/s 35 2783 1226 2166 39 2943 1427 2371 45 2392 1258 2643 35 3118 1226 2301 823 851 1846 2 23
avwait 1.55 127.97 42.89 242.52 1.97 129.23 38.85 331.15 3.85 234.33 39.96 192.28 1.01 154.68 47.82 238.43 60.83 43.33 306.13 4.63 9.85
avserv 5.07 152.92 143.96 193.15 6.72 159.47 154.55 208.49 13.04 163.10 138.81 178.66 4.86 159.02 147.32 203.88 180.68 118.87 233.36 11.53 21.50
Student Notes
The sar -d report shows disk activity on a per disk drive (spindle) basis. The key fields within this report are: % busy avque avwait avserv Indicates the average percent utilization of the disk over the interval (5 seconds in the slide). Indicates the average number of requests in the disk I/O queue. Indicates the average amount of time a requests spends waiting in the disk I/O queue. Indicates the average amount of time to service a disk I/O request.
The sar -d report on the slide shows that when the disk had the most requests in the queue (19.60 and 18.77), the average wait time was at its highest. The slide also shows that there are five disk drives spread across two disk controllers. One disk controller (c0) appears to have two busy drives (t4 and t6), and a relatively low usage drive (t5). Disk controller (c1) has two disks that are mainly idle. One performance solution
http://education.hp.com
here would be to balance the disk activity across the two controllers by moving one disk (say c0t4) over to the less busy disk controller (c1).
http://education.hp.com
05:51:04 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 05:51:14 0 0 0 1 1 25 0 0 05:52:04 0 0 0 0 1 85 0 0 05:52:14 0 0 0 1 8 87 0 0 05:52:24 0 0 0 0 4 100 0 0 05:52:34 0 0 0 0 1 100 0 0 05:52:54 1 68 99 0 0 33 0 0 05:53:04 7 11936 100 1 2 13 0 0 05:53:14 6 19506 100 1 1 0 0 0 05:53:24 28 24147 100 1 2 65 0 0 05:53:34 64 16659 100 0 14 99 0 0 05:53:44 118 118 0 2 3 46 0 0 05:53:54 0 0 0 3 3 0 0 0 05:54:04 0 0 0 18 19 4 0 0 05:54:14 179 179 0 18 18 3 0 0 05:54:24 179 179 0 13 14 4 0 0 Average 29 3639 99 3 5 39 0 0
Student Notes
The sar -b report shows disk activity related to the buffer cache. The key fields within this report are: bread/s Indicates the average number of physical I/O reads per second over the interval. The term bread refers to block reads. Indicates the average number of logical I/O reads per second over the interval. Indicates the average percent read cache hit rate. This shows what percentage of read requests were satisfied through the buffer cache. Ideally, this value should be consistently 90% or greater. Indicates the average number of physical I/O writes per second over the interval. The term bwrit refers to block writes. Indicates the average number of logical I/O writes per second over the interval.
lread/s
%rcache
bwrit/s
lwrit/s
http://education.hp.com
%wcache
Indicates the average percent write cache hit rate. This shows what percentage of write requests were satisfied through the buffer cache. Ideally, this value should be consistently 70% or greater.
The sar -b report on the slide shows the two extreme situations. The first extreme is a 100% cache hit rate, which occurs when there are lots of logical I/O requests and all requests are satisfied through the buffer cache, rather than having to go to disk. This is a very desirable condition. The other extreme is a 0% cache hit ratio. This occurs when every logical I/O request required a physical I/O from disk. In this case, the number of physical reads or writes is equal to the number of logical reads or writes. This is most undesirable.
http://education.hp.com
B3692A GlancePlus B.10.12 06:16:25 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------DISK REPORT Users= 4 Req Type Requests % Rate Bytes Cum Req % Cum Rate Cum Byte -------------------------------------------------------------------------------Local Logl Rds 68 2.7 13.6 5kb 1260 7.8 9.6 3.2mb Logl Wts 2455 97.3 491.0 19.2mb 14798 92.2 112.9 114.8mb Phys Rds 10 1.7 2.0 80kb 189 5.1 1.4 1.8mb Phys Wts 565 98.3 113.0 18.9mb 3520 94.9 26.8 112.4mb User 571 99.3 114.2 18.9mb 3448 93.0 26.3 112.2mb Virt Mem 0 0.0 0.0 0kb 66 1.8 0.5 968kb System 4 0.7 0.8 32kb 195 5.3 1.4 1.2mb Raw 0 0.0 0.0 0kb 0 0.0 0.0 0kb Remote Logl Rds 0 0.0 0.0 0kb 0 0.0 0.0 0kb Logl Wts 0 0.0 0.0 0kb 0 0.0 0.0 0kb Phys Rds 0 0.0 0.0 0kb 1 100.0 0.0 0kb Phys Wts 0 0.0 0.0 0kb 0 0.0 0.0 0kb
Student Notes
The glance disk report (d key) shows local and remote I/O activity. The I/O distribution can be viewed from the following: Logical Perspective (logical reads and logical writes) Physical Perspective (physical reads and physical writes) I/O Type Perspective (User, Virtual Mem, System, Raw)
Items of interest in this report include the number of logical I/O requests (read and writes), the number of physical I/O requests (reads and writes), and the ratio between the two. In the slide, disk utilization is 94% (very high), with the majority of the I/Os being writes (92%) as opposed to reads. It is also interesting to note the logical to physical write ratio is 14,798 / 3,520 or approximately 4:1, which is an acceptable write performance ratio.
http://education.hp.com
B3692A GlancePlus B.10.12 06:31:12 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S S R U U |100% 100% 100% Cpu Util F Disk Util F | 83% 22% 84% Mem Util | 94% 95% 96% S S U U B B Swap Util | 21% 21% 22% U U R R -------------------------------------------------------------------------------IO BY DISK Users= 4 Idx Device Util Qlen KB/Sec Logl IO Phys IO -------------------------------------------------------------------------------1 56/52.6.0 0/ 0 0.0 0.0/ 1.8 na/ na 0.0/ 0.2 2 56/52.5.0 1/ 1 0.0 16.0/ 5.1 na/ na 2.0/ 0.7 3 56/36.4.0 78/ 9 18.2 1584.8/ 178.4 na/ na 48.0/ 5.6 4 56/36.5.0 52/ 6 3.8 932.8/ 120.5 na/ na 24.0/ 3.0 5 56/36.6.0 68/ 9 10.6 1172.8/ 154.9 na/ na 35.8/ 4.6 6 56/52.2.0 0/ 0 0.0 0.0/ 0.0 0.0/ 0.0 0.0/ 0.0
3280, disc
106.4 IOs/sec
S - Select a Disk
Student Notes
The glance disk device report (u key) shows current and average utilization of each disk drive on the system. The report also shows the current I/O queue length for each disk. This display shows basically the same information as sar d. In the slide, three disks show utilization greater than 50% and queue lengths greater than 3. This is normally a valid reason for further investigation. The 10.6 and 18.2 queue lengths are high, but, because the average utilization of both the drives is 9%, this may just be a spike in disk activity. In this case, monitor the situation further to see if the high queue lengths persist or if they were just spikes in disk usage.
http://education.hp.com
B3692A GlancePlus B.10.12 06:34:41 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------IO BY LOGICAL VOLUME Users= 4 Idx Vol Group/Log Volume Open LVs LV Reads LV Writes -------------------------------------------------------------------------------1 /dev/vg00 10 0.0/ 0.0 0.0/ 0.0 2 /dev/vg00/group 0.0/ 0.0 0.0/ 0.0 3 /dev/vg00/lvol3 0.0/ 0.0 0.2/ 0.0 4 /dev/vg00/lvol2 0.0/ 0.0 0.0/ 0.0 5 /dev/vg00/lvol1 0.0/ 0.0 0.0/ 0.0 9 /dev/vg00/lvol7 0.0/ 0.0 0.0/ 0.0 10 /dev/vg00/lvol4 0.0/ 0.0 0.0/ 0.0 12 /dev/vg01 2 0.0/ 0.0 0.0/ 0.0 13 /dev/vg01/lvol1 0.0/ 0.0 105.6/ 19.2 Open Volume Groups: 2 S - Select a Volume
Student Notes
The glance logical v volume report (v key) shows disk activity on a per logical volume basis. Only physical I/O activity (not logical I/O activity) is shown with this report. In the previous slide, we saw high activity across three disk drives (drives 4, 5, and 6). The logical volume report on the slide shows all this activity is being performed against one logical volume (/dev/vg01/lvol1), which implies that the logical volume is being spread across three disks (a good idea since the I/O to the logical volume is so high).
http://education.hp.com
812. SLIDE: Disk I/O Monitoring glance System Calls per Process
B3692A GlancePlus B.10.12 06:48:15 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------System Calls for PID: 4055, disc PPID: 2410 euid: 0 User:root Elapsed Elapsed System Call Name ID Count Rate Time Cum Ct CumRate CumTime -------------------------------------------------------------------------------write 4 377 754.0 0.10650 12851 477.7 4.10153 open 5 3 6.0 0.05910 100 3.7 0.61923 close 6 3 6.0 0.00006 100 3.7 0.00225 lseek 19 0 0.0 0.00000 75 2.7 0.00204 ioctl 54 3 6.0 0.00007 100 3.7 0.00259 vfork 66 0 0.0 0.00000 25 0.9 0.34908 sigprocmask 185 0 0.0 0.00000 50 1.8 0.00088 sigaction 188 0 0.0 0.00000 150 5.5 0.01340 waitpid 200 0 0.0 0.00000 25 0.9 1.47745
Cumulative Interval:
27 secs
Student Notes
The glance system calls report (L key), available only from the select process report (s key), shows the names of the system calls being generated by the selected process. The system calls report can be viewed for individual processes (as shown on the slide), or globally for all processes on the system (Y key). Significant system calls, which typically consume a lot of time, are the file I/O related calls, such as read(), write(), open(), and close(). In the slide, the write() system call is being invoked heavily by the selected process (754 times/second) and has accounted for 4.1 seconds of the CPU's time over a 27-second period (approximately 15%).
http://education.hp.com
Student Notes
The hardware solutions on the above slide will help to lessen the performance impact of high disk I/O on a system. Add more disk drives and load balance across disks. This spreads the amount of I/O over more drives, decreasing the average number of I/O requests for each disk. Many smaller disks are better than a few large disks. Add more disk controllers and balance load across disk controllers. This spreads the amount of I/O over more controllers, decreasing the likelihood that any one disk controller will become overloaded with I/O requests. Add faster disk drives. This decreases the amount of time it takes to service an I/O request, which decreases the amount of time requests spend waiting in the disk I/O queue. Implement disk striping. This increases the number of disk heads having access to the striped data (the more disks striped across, the more heads accessing the data,
http://education.hp.com
simultaneously). It also allows for overlapping seeks, meaning that one disk head can be seeking the next block, while a second disk head is reading the current data block. Implement disk mirroring. This can increase read performance, as either the primary or mirrored copy of the data can be read. In fact, the data will be read from whichever disk has the fewest I/Os pending against it. However, it will negatively impact write performance. In order to maintain the integrity of the mirrors, duplicate writes must be done to each copy of the mirrored volume/disk. Mirroring is primary a data integrity feature, but under the right circumstances (read-intensive data) it can improve performance, as well.
http://education.hp.com
814. SLIDE: Tuning a Disk I/O-Bound System Perform Asynchronous Meta-data I/O
Process
Memory
Student Notes
Asynchronous I/O significantly improves write performance over synchronous I/O because the write requests (and thus the requesting processes) do not have to wait for the data to be written to the disk platters.
http://education.hp.com
To view the device settings for the controller at SCSI adapter address "0" and SCSI target address 6: # /usr/sbin/scsictl -m ir /dev/rdsk/c0t6d0 immediate_report = 0 To change the value of immediate reporting to ON: # /usr/sbin/scsictl -m ir=1 /dev/rdsk/c0t6d0 To view the changes in the device settings: # /usr/sbin/scsictl -a /dev/rdsk/c0t6d0 immediate_report = 1; queue_depth = 8
http://education.hp.com
815. SLIDE: Tuning a Disk I/O-Bound System Load Balance across Disk Controllers
PVG1
C0
System
C1
PVG2
Student Notes
Another potential solution to a disk I/O performance problem is to spread the write requests across the disk controllers as evenly as possible. This helps ensure no one controller becomes overloaded with I/O requests.
Physical volume groups (PVGs) allow disk drives to be grouped, based on the disk controller to which they're attached. Used in conjunction with LVM mirroring, it ensures the mirrored data not only goes to a different disk, but also goes to a different PVG group (that is, a different disk controller).
http://education.hp.com
The PVG groups are defined in the /etc/lvmpvg file. This file can be manually edited or updated with the -g option to the vgcreate and vgextend commands. A sample /etc/lvmpvg file, based on the four disks on the slide are: VG /dev/vg01 PVG PV_group0 /dev/dsk/c0t6d0 /dev/dsk/c0t5d0 PVG PV_group1 /dev/dsk/c2t5d0 /dev/dsk/c2t4d0
Configuring LVM to Mirror to Different PVGs
The command to configure LVM mirroring for different PVGs is lvchange. The strict option to this command, -s, contains the following three arguments: y n g This indicates all mirrored copies must reside on different disks. This indicates mirrored copies can reside on the same disk as the primary copy. This indicates all mirrored copies must reside with different PVGs.
http://education.hp.com
816. SLIDE: Tuning a Disk I/O-Bound System Load Balance across Disk Drives
1 2 3 4 5 6
1 3 5 7 9 11
System
100 % Util
52% Util
2 4 6 8 10 12
90% Util
5% Util
20% Util
52% Util
20% Util
Without Striping
With Striping
Student Notes
Balancing the disk activity so that the utilization across drives is approximately the same helps to ensure that no one disk becomes overloaded with I/O requests (that is, 50% or greater utilization, with three or more requests in the disk queue). The slide illustrates a situation in which one disk is heavily utilized (100%) while another disk is only 5% utilized. One potential solution is to stripe the heavily utilized logical volume on the first disk to both disks.
LVM Striping
The ability to stripe a logical volume across multiple disks (at a file system block level) was introduced into LVM at the HP-UX 10.01 release. A logical volume must be configured for striping at the time of creation. Once a logical volume is created, it cannot be striped without recreating the logical volume.
http://education.hp.com
The command to create a striped logical volume is lvcreate. The syntax, related to striping, for this command is:
lvcreate -i [number of disks] -I [stripe size] -L [size in MB] vg_name
Example:
lvcreate -i 2 I 8 /dev/vg01 lvextend -L 50 /dev/vg01/lvol2 /dev/dsk/c0t5d0 /dev/dsk/c0t4d0
http://education.hp.com
Defaults dbc_min_pct=5%
0 - 45%
dbc_max_pct=50%
Memory
Student Notes
With the introduction of HP-UX 10.0, the buffer cache becomes dynamic, growing and shrinking between a minimum size and a maximum size. NOTE: Space for the buffer cache is allocated in two different areas of memory: the minimum size is created in the O/S area of memory, and anything above the minimum size is allocated from the User Process area.
http://education.hp.com
The main point is that if there is available memory, the buffer cache will grow into this memory until there is no memory left (or until the buffer cache reaches its maximum size).
http://education.hp.com
If, however, your buffer cache requirements change rapidly over time, you probably would be better served with a fixed-size buffer cache, properly sized to give you adequate buffers most of the time. Only on relatively rare occasions, would buffer cache be a bottleneck and only for short periods. In the long run, your performance would be better than trying to deal with the rapidly changing needs using a dynamic buffer cache.
http://education.hp.com
Next, execute the make_files program to create five 4-MB ASCII files. # cd /vxfs # ./make_files 3. Purge the buffer cache of this data, by unmounting and remounting the file system. # cd / # umount /vxfs # mount /dev/vg00/vxfs /vxfs # cd /vxfs
http://education.hp.com
4. Open a second terminal window and start glance. While in glance, display the Disk Report (d key). Zero out the data with the z key. From the first window, time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null real: user: sys: glance Disk Report Logl Rds: Phys Rds:
5. At this point, all 20 MB of data is resident in the buffer cache. Re-execute the same command and record the results below: # timex cat file* > /dev/null real: user: sys: NOTE: glance Disk Report Logl Rds: Phys Rds: The conclusion is that I/O is much faster coming from the buffer cache, than having to go to disk to get the data.
6. The sar -d report. Exit glance, and in the second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of request in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
7. The glance I/O by Disk report Exit from the sar -d report, and start glance again. While in glance, display the I/O by Disk report (u key). From the first window, re-execute disk_long, timing the execution. Record results below: # ./disk_long Util: glance I/O by Disk Report Qlen:
http://education.hp.com
8. The glance I/O by File System report Reset the data with the z key, and display the I/O by File System report (i key). From the first window, re-execute disk_long, timing the execution. Record results below: # ./disk_long glance I/O by Disk Report Logl I/O: Phys I/O:
9. Performance tuning immediate reporting. Ensure the immediate reporting options are set for the disk that the file system is located on. If immediate reporting is not set, set it. # scsictl -m ir /dev/rdsk/cXtXdX (to report current "ir" status) # scsictl -m ir=1 /dev/rdsk/cXtXdX (ir=1 to set, ir=0 to clear) Purge the contents of buffer cache. # # # # cd / umount /vxfs mount /dev/vg00/vxfs /vxfs cd /vxfs
10. The sar -d report. Exit glance, and in the second window start: # sar -d 5 200 From the first window, execute the disk_long program (which writes 400 MB to the file system and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of requests in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
http://education.hp.com
http://education.hp.com
http://education.hp.com
Tracks
Cylinder Group
Primary Superblock
Cylinder Group 1 Cylinder Group 2 Cylinder Group 3 . . . Cylinder Group N Data Blocks
Red. Cylinder Inode SprBk Grp Hdr Table
Data Blocks
Physical View
Logical View
Student Notes
The HFS model is a foundation for all other file system variants. We will begin our discussion of File System performance using the HFS file system model.
Physical View
From a physical disk perspective, the disk drive upon which a file system is placed contains sectors, tracks, platters, and disk heads. A key behavior of most all disk drives is that the disk heads move in parallel across the platters in such a way that each disk head is over the same track within each platter at the same time. To maximize the file system I/O throughput of the disk, it is desirable to have as many file blocks close to each other as possible, to minimize the time it takes to read or write the various blocks of a file. To help achieve this goal, the blocks on the disk are allocated to
http://education.hp.com
the HFS file system in units call cylinder groups. A cylinder group is all the tracks, from every platter, grouped together, of several adjacent cylinders.
Cylinder Group Analogy
Consider a health spa or gym with three floors. Each floor contains a jogging track, and the three jogging tracks are located directly above or beneath one another from floor to floor. From this point of view, a cylinder group would be the same group of lanes from each floor's jogging track. In other words, all lane 1, 2, and 3 tracks would make up cylinder group 1; all lane 4, 5, and 6 tracks would make up cylinder group 2, etc. By organizing space on disks in cylinder group units, the HFS file system can logically keep all the blocks of a given file close to each other. For example, in the slide above, the first 6 blocks of a file might be allocated as follows: File File File File File File block block block block block block #1: #2: #3: #4: #5: #6: Platter Platter Platter Platter Platter Platter #1, #1, #1, #1, #2, #3, Track Track Track Track Track Track #1, #1, #3, #3, #7, #9, Sector Sector Sector Sector Sector Sector #1 #2 #5 #6 #10 #7
By allocating file system space in this manner, a multiple block read (say 6 blocks) could be read with less than six separate reads. In the example above, file blocks 1 and 2 could be read with one read operation, followed by a head switch (no carriage movement) to track 3, another read for file blocks 3 and 4, a short seek to the next cylinder and a head switch to read file block 5, and repeat for file block 6. Four reads could then read the six blocks. The more contiguous the blocks that make up the file, the more efficient the reads and writes can be.
Logical View
From a logical perspective, an HFS file system contains a series of cylinder groups. Even though the physical cylinder groups are laid out from top to bottom, transcending all the platters, logically, we view the cylinder groups as horizontal units going from left to right. The HFS file system is made up of multiple cylinder groups, where the number of cylinder groups is dependent on the size of the file system. In the slide, we assume the HFS file system takes the whole disk, therefore, there are N cylinder groups in the sample file system. Typically, they are numbered from 0 to N-1. A critical data structure contained with every HFS file system is the primary superblock. The primary superblock is located at the start of every HFS file system at the start of the first cylinder group, and contains the critical header information for the HFS file system. Data structures contained within the superblock include the free block list, the mount flag, the starting address of each cylinder group, and much more.
http://education.hp.com
Redundant Superblock
Inode Table
http://education.hp.com
Inode Structure
File
Student Notes
An inode contains all the header information for a particular file. Every file has a corresponding inode, usually located within the same cylinder group as the file. Fields contained within the inode include: File type File access permissions Number of hard links to the file Owner and group of the file Size of the file in bytes Time stamps (file access, file modification, inode changes) Data block pointers (direct and indirect) Although the size of the inode differs from one type of file system to another, the basic types of data contained is virtually the same, the main differences are in the data pointer structures.
NOTE:
http://education.hp.com
Inode Extension
Student Notes
One of the structures within each HFS inode is the array of data block pointers that reference the data blocks within the file. The size of the data block pointer array is 15 entries, meaning there are a maximum of 15 file system block addresses within the array. The first 12 addresses within the data block pointer array are direct access addresses. The thirteenth entry is a single indirection block address, the fourteenth is a double indirection block address, and the fifteenth (and last) entry is a triple indirection block address.
Direct Access
A direct access address points directly to a file's data block. When accessing a file using a direct access address, a minimum of two logical I/Os are needed: one I/O to access the file's inode (containing the direct access address), and one I/O to access the file's corresponding data block.
http://education.hp.com
Single Indirection
Single indirection implies the address within the inode references a block on disk that acts as an inode extension block. The inode extension block, in turn, contains addresses that point to the file's corresponding data blocks. It should be noted that three logical I/Os are needed to access a file's data blocks using single indirection: one I/O for the file's inode, one I/O for the inode extension block, and one I/O for the data block itself.
Double Indirection
Double indirection means access to a file's data blocks require going through two inode extension blocks. The first inode extension block references the address of a second inode extension block, which contains addresses referencing the file's datablocks. Double indirection is needed only for files above 16 MB (with a default block size of 8KB). When accessing files requiring double indirection, a total of four logical I/Os are required: an I/O for the file's inode, an I/O for each of the two inode extension blocks, and an I/O for the file's data block.
Triple Indirection
Triple indirection (not shown on the slide) adds one more level of indirection when accessing a file's data blocks. Triple indirection is only needed to access files larger than 32 GB (with a default block size of 8KB). NOTE: Every level of indirection adds an additional logical I/O when accessing the file's data. In the case of triple indirection, five logical I/Os are needed compared to two I/Os for direct access data blocks.
As you can see, the performance of an HFS file system tends to favor small files (12 blocks or less), and tends to penalize large files that have to use single, double, or even triple indirection. You can delay this performance degradation somewhat, by building the file systems with larger block sizes. (More on that later in the module.)
http://education.hp.com
94. SLIDE: How Many Logical I/Os Does It Take to Access /etc/passwd?
Inode 2 / (root)
74
ATime
717
2240
Student Notes
The above slide illustrates how a file within the HFS file system is accessed. It may surprise some people when they find out how many logical I/Os are needed to access the /etc/passwd file.
http://education.hp.com
From this information, the kernel discovers the inode for the etc directory (in /) is 504. Inode 504 is then read (third logical I/O) and from that the kernel learns the etc directory is located at file system block 717. Block 717 is read (4th logical I/O) and the file names and inodes contained within that directory are now known. One of the entries within block 717 is the passwd file and its corresponding inode number 1824. The inode 1824 is read (5th I/O), and from this the kernel finally learns that block 2240 is the one that contains the contents of the /etc/passwd file. Block 2240 is read (6th I/O) and the kernel finally has the data it set out to access. So, the answer to the question at the top of the slide, How many logical I/Os does it take to access /etc/passwd? is . . . 6.
http://education.hp.com
End of Disk
Student Notes
The concept of blocks and fragments was introduced when the HFS file system was designed. There is always a tradeoff when managing a resource based on a fixed allocation unit size (the file system "block" in this case). If the block size is large we can manage them with fewer pointers (system overhead) but if it is too large there is an opportunity for inefficient utilization of the resource (very small files still require a block). In the case of the HFS file system this concern was addressed by making the block capable of uniform subdivision. The fragment was created for this purpose.
Definitions
Sector A sector is the smallest unit of space addressable on the physical disk. The sector size is used when the disk is formatted to appropriately place timing markers on the platter. The default sector size for HP-UX and most UNIX systems is 512 bytes. A fragment is the increment in which space is allocated to files within the HFS file system. The default fragment size is 1 KB. This can be tuned when the HFS file system is initially created. Allowable sizes are 1K, 2K, 4K, and 8K.
Fragment
http://education.hp.com
A file system block is the minimum amount of data transferred to/from the disk when performing a disk I/O on an HFS file system. The default file system block size is 8 KB. This can be tuned when the HFS file system is initially created. Allowable sizes are 4K, 8K, 16K, 32K, and 64K.
File D (size 4 KB): The kernel searches for the first four contiguous 1KB fragments available (within the same file system block). This is in the second file system block. The kernel does not allocate 3 fragments from the first file system block and 1 fragment from the second file system block, because that would require two logical I/Os to read in the entire 4 KB. This is inefficient, as only one I/O is required if the file is contained within the same file system block. The second basic rule is: If the size of a file is 8 KB or less, the kernel will fit the entire file within a single file-system block. File E (size 5 KB) and File F (size 6 KB): The kernel searches for the first available file-system block that can hold the entire file. On the slide, FileE is allocated in file system block 3, and FileF is allocated in file system block 4.
http://education.hp.com
http://education.hp.com
Student Notes
As an HFS file system becomes full, the performance impact of creating a new file becomes significant. This is due to the behavior of the kernel when creating a new file: When a new file is created on HFS file systems, the kernel tries allocates a block-sized buffer in buffer cache for the file to grow into. Upon the file being closed, the kernel allocates the file's fragments to an already allocated file system block, if possible.
FileG Is Created
In the example on the slide, FileG is opened/created as a new file. Not knowing the size to which FileG will grow, the kernel allocates a block-sized buffer in buffer cache for FileG to grow into. When FileG is closed, the kernel searches for a set of four contiguous 1KB fragments in a block. Since there are no shared blocks that have four contiguous fragments, the file is written to a new, empty block.
http://education.hp.com
As it turns out, FileH is closed after writing only 1 KB worth of data. Upon closure, FileH is moved to file system block 1, first fragment. NOTE: Performance on HFS file systems typically degrades when free space falls below 10%, due to the length of time it takes to find free file system blocks for new files. For this reason, it is recommended that MINFREE always be 10% or greater, even for large file systems (greater than 4 GB).
The fourth basic rule is: No fragment belonging to another file will be moved to make room for this file.
http://education.hp.com
Student Notes
When monitoring disk I/O activity, the main metrics to monitor are: Percent utilization of the file systems: As utilization of the file system increases, so does the amount of time it takes to perform an I/O. According to the performance queuing theory, it takes twice as long to perform an I/O when the file system is 50% busy, than it does when the file system is idle. Requests in the file system I/O queue: The number of requests in the file system I/O queue is one of the best indicators of a file system performance problem. If the average number of requests is three or greater, then requests are having to wait in the queue longer than the amount of time needed to service those requests. Amount of physical I/O: If the amount of file system activity is high, it is important to investigate on which file system the activity is occurring. File system free space: As an HFS file system becomes full (greater than 90%), it takes longer and longer to find an available free fragment for a new file or to grow an existing file. This creates additional disk activity, leading to slow file system performance.
http://education.hp.com
Files opened with heavy access: For each process performing large amounts of file system I/O, the names of the files to which they are reading or writing should be inspected. For files receiving high I/O activity, (hit <CR> frequently, then inspect how quickly the offset to each file changes) consider relocating these files to other disks that are less busy.
http://education.hp.com
98. SLIDE: Activities that Create a Large Amount of File System I/O
Student Notes
Common causes of disk-related performance problems are shown on the slide. Full file system cause excessive I/O due to locating free fragments. Long, inefficient PATH variables cause excessive directory I/O (especially when the command is found in the last directory within the PATH variable). Deep subdirectories cause lots of logical I/Os (two logical I/Os for each subdirectory in the full path name). Sequential file access, with a small file system block size, causes excessive amounts of physical I/O. Accessing lots of files on one file system, versus many, creates an imbalance of utilization. This leads to performance problems with the busy file systems and under utilization with the others.
http://education.hp.com
kbytes used 81920 38018 47829 22403 286720 257116 360448 346127 1177626 1113204 122880 102098 53248 22589
avail %used Mounted on 40901 48% / 20643 52% /stand 28003 90% /usr 13444 96% /opt 0 100% /disk 19257 84% /var 28549 44% /tmp
Student Notes
The bdf report shows how much file system space is being used (and how much is free) for all file systems currently mounted on the system. The key fields are: avail %used Indicates the amount of disk space available on the file system (in KB). Indicates the percentage of disk space used.
The slide shows there are three file systems with 90% usage or more, and one of the file systems is at 100% utilization. Recall that when an HFS file system becomes full, performance on that file system suffers due to fragments being moved. The good news is that the amount of free space which is being held back by the file system parameter, MINFREE, is already subtracted from the values. In fact, if you compare the kbytes, used, and avail columns, youll see that something is missing. used + avail do not add up to be kbytes. The difference is MINFREE. For example, look at /stand. Clearly, 22403 + 20643 does not equal 47829. In fact, 22403+20643 divided by 47829 equals 90%, indicating that MINFREE must be set to 10% for this file system.
http://education.hp.com
B3692A GlancePlus B.10.12 06:39:52 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------IO BY FILE SYSTEM Users= 4 Idx File System Device Type Logl IO Phys IO -------------------------------------------------------------------------------1 / /dev/root vxfs 0.3/ 0.6 0.0/ 0.0 2 /stand /dev/vg00/lvol1 hfs 0.0/ 0.0 0.0/ 0.0 3 /var /dev/vg00/lvol9 vxfs 1.0/ 1.8 0.1/ 0.3 4 /usr /dev/vg00/lvol8 vxfs 9.2/ 2.8 1.5/ 0.6 5 /tmp /dev/vg00/lvol7 vxfs 0.0/ 0.0 0.1/ 0.0 6 /opt /dev/vg00/lvol6 vxfs 0.0/ 0.0 0.0/ 0.0 7 /home.lvol5 /dev/vg00/lvol5 vxfs 0.0/ 0.0 0.0/ 0.0 8 /export /dev/vg00/lvol4 vxfs 0.0/ 0.0 0.0/ 0.0 9 /disk /dev/vg01/lvol1 vxfs 463.8/ 86.4 105.8/ 20.1 10 /cdrom /dev/dsk/c1t2d0 cdfs 0.0/ 0.0 0.0/ 0.0 11 /net e2403roc:(pid604) nfs 0.0/ 0.0 0.0/ 0.0 Top disk user: PID 3603, disc 104.0 IOs/sec S - Select a Disk
Student Notes
The glance file system I/O report (i key) shows activity on a per file system basis. Only total I/O activity (not reads versus writes) is shown with this report. This report is similar to the logical volume report (discussed in the previous module) except this report shows logical I/O compared to physical I/O, and does not distinguish between read and write activities. The logical volume report shows reads compared against writes, but does not distinguish between logical and physical activities. From the report on the slide, we note that all the file system activity is being performed against one file system. Note: The file system I/O report shows I/O activity for all types of mounted file systems, including CDFS file systems and NFS-mounted file systems.
http://education.hp.com
911. SLIDE: HFS I/O Monitoring glance File Opens per Process
B3692A GlancePlus B.10.12 06:44:39 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U |100% 100% 100% Cpu Util S F Disk Util F | 83% 22% 84% Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------Open Files for PID: 3911, disc PPID: 2410 euid: 0 User:root Open Open FD File Name Type Mode Count Offset -------------------------------------------------------------------------------0 /dev/pts/1 chr rd/wr 6 13582826 1 /dev/pts/1 chr rd/wr 6 13582826 2 /dev/pts/1 chr rd/wr 6 13582826 3 <reg,vxfs,inode:3024,/...ol9,vnode:0x00f9e000> reg read 1 85 4 /stand/file5 reg write 1 32768 10 /dev/null chr read 2 0
Student Notes
The glance open files report (F key), available only from the select process report (s key), shows the names of files opened for the currently selected process. Sometimes, the full path name of the file is shown. Otherwise, the inode number, and device name are shown and you would have to translate that information into the filename. NOTE: To determine the full pathname of a file, given its inode number and logical volume name, use the ncheck command: ncheck -F vxfs -i [inode #] [device name] Another way to determine the full pathname of a file, given its inode number and logical volume name, is to use the find command: find [mountpoint of device] inum [inode #] -xdev
http://education.hp.com
To determine whether I/O activity is occurring against a file, enter the open file report for a particular process, and press <CR> multiple times in succession. Watch the offset field for each file. If the offset field is constantly changing, it indicates the file is currently being accessed.
Performance Scenario
A system is experiencing slow performance due to high file system utilization. Upon further investigation, not all file systems are heavily utilized. In fact, some show no activity at all. By sorting the processes within glance by disk I/O activity, then selecting those processes to obtain further details, you can determine which files are getting the majority of the activity. To take advantage of the underutilized file system, move the heavily accessed files to this file system and create a symbolic link to the file from its original location, thereby removing a heavily accessed file from a busy file system and putting it on an underutilized file system.
http://education.hp.com
912. SLIDE: Tuning a HFS I/O-Bound System Tune Configuration for Workload
Student Notes
Every workload and every application is different. Each has different resource requirements and each places different demands on the system. There is no one configuration that is optimal for all applications. For example, CAD-CAM application stress memory (and graphics); accounting applications that do forecasting stress CPU; NFS-based applications stress the disks (and the network); and RDBMS applications stress all resources.
Fragment sizes can be 1, 2, 4, or 8 KB in size. Fragments can be 1/8, , , or equal to the file system block size. For large files which are opened and closed a lot during their growth, large fragments are recommended. For file systems with lots of small files, small fragments are recommended.
http://education.hp.com
File system blocks sizes can be 4, 8, 16, 32, or 64 KB. For file systems with large files, large file system blocks are recommended. For file systems with large files, increase maxbpg (maximum blocks per group). For applications which perform a lot of sequential I/O (with read-aheads and write-behinds), large file system blocks are recommended.
no_fs_async Force rigorous (synchronous) posting of file system metadata to disk. This is the default.
mkfs Options
mkfs is usually not executed directly, but is called by newfs -F hfs instead. File system tuning is best accomplished when the file system is created. The workload for a file system should be well understood and dedicated before serious attempts are made to tune one. Many options are also dependent on the type of physical device on which a file system is being created. The HFS specific options include: size largefiles The size of the file system in DEV_BSIZE blocks (the default is the entire device). The maximum size of a file can be up to 128 GB.
nolargefiles The maximum size of a file will be limited to 2 GB. ncpg minfree The number of cylinders per cylinder group (range 1-32, the default is 16). the minimum percentage free disk space reserved for non-root processes (default is 10%). Beginning with HP-UX 10.20 the bdf command does not conceal this free space and as a result will report free disk space accurately. This means that a file system cannot show 111% utilization anymore.
http://education.hp.com
nbpi
The number of bytes per inode. This value determines how many inodes are allocated given a file system of a certain size. (The default is 6144.)
tunefs Options Some parameters can be changed once the file system has been created, with tunefs(1m). There are minfree and maxbpg. minfree is explained above. maxbpg The maximum number of data blocks that a large file can use out of a cylinder group, before it is forced to continue to grow in a different cylinder group. This value does not apply to any file which size is 12 blocks or less.
tunefs can also be used to display the contents of an HFS file system: # tunefs v /dev//
Other Configurations
Optimize $PATH
The PATH variable in a user's environment specifies a list of directories to search when a command is entered. Having an excessive number of directories or duplicate directories to search can increase disk access, particularly when the user makes a mistake typing a command. This problem can be greatly exacerbated if the user's PATH variable contains directories that are mounted automatically with the NFS automount utility, causing the network mount of a file system because of a typographical error.
Use Flat Directory Structures
Long directory path names create more work for the system because each directory file and its associated inode entry require a disk I/O in order to bring them into memory. Recall that six logical I/Os were required to read the /etc/passwd file. Conversely, you dont want thousands of files in the same directory, as it would take many I/O operations to read and search the directory.
Ensure Sufficient Freespace
As the file system becomes full (greater than 90%), the kernel begins to take longer and longer to find available free fragments. The algorithm gets very lengthy when the file system free space falls below 10%. Of course, if you do not have any files that grow and you are not adding any new files, this would waste 10% of your file system free space for no reason.
http://education.hp.com
Blk. 74 /data 12
Owner Group 74
Size
12
HP Fast Links
Student Notes
There are two ways symbolic links can be stored on HFS file systems.
HP Fast Links
HP fast links allow symbolic links to be resolved with one logical I/O instead of two. HP fast links store the name of the referenced file in the inode of the symbolic link itself, rather than in a data block that the inode references. In the example, when the inode (12) of the symbolic link is retrieved, the contents of the inode contain the name of the referenced file.
http://education.hp.com
HP fast links can be configured by setting the tunable OS parameter create_fastlinks to 1, and recompiling the kernel. Upon booting from the new kernel, all future symbolic links created will use HP fast links. No existing standard symbolic links will be automatically converted to fast symbolic links. The standard symbolic links would have to be removed and then recreated to convert them. Fast symbolic links will only work for link destinations that can be expressed in 59 characters or less as this is the limit of the space within the inode where the fast link information is stored. If a symbolic link contains more than 59 characters, it will be stored as a standard symbolic link, regardless of the value of create_fastlinks.
Transition Links
Saving one logical I/O when accessing a symbolic link may not seem significant, until considering that HP-UX makes heavy use of transition links (which are an implementation of symbolic links). Transition links allow an HP-UX file system to contain older 9.x directory paths. The 9.x directory names are symbolic links that point to the correct, current location (for example, /bin > /usr/bin). Many HP-UX installations have applications (including HP-UX applications), which rely on and make heavy use of transition links. A quick performance gain for all HP-UX systems is to convert these transition links from standard symbolic links to HP fast links. The procedure for making this conversion is: 1. Recompile the kernel to use HP fast links (i.e. set the create_fastlinks to 1). 2. Shut down and reboot the system. 3. Execute tlremove to remove all the transition links from the system. Over 500 links will be removed. 4. Execute tlinstall to reinstall (that is, recreate) the transition links. When the links are reinstalled, they will be created with HP fast links.
http://education.hp.com
Next, execute the make_files program to create five 4-MB ASCII files. # cd /hfs # ./make_files 3. Purge the buffer cache of this data, by unmounting and remounting the file system. # cd / # umount /hfs # mount /dev/vg00/hfs /hfs # cd /hfs
http://education.hp.com
4. Time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null real: user: sys: 5. In a second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of request in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
6. Performance tuning recreate the file system with larger fragment and file system block sizes. Tuning the size of the fragments and file system blocks can improve performance for sequentially accessed files. The procedure for creating a new file system with customized fragments of 8 KB and file system blocks of 64 KB is shown below: # lvcreate -n custom-lv vg00 # lvextend L 512 /dev/vg00/custom-lv /dev/dsk/cXtYdZ # newfs -F hfs -f 8192 -b 65536 /dev/vg00/rcustom-lv # mkdir /cust-hfs # mount /dev/vg00/custom_lv /cust-hfs 7. Copy the lab files to the customized HFS file system, execute the make_files program, and purge the buffer cache. # cp /hfs/disk_long /cust-hfs
http://education.hp.com
# mount /dev/vg00/custom-lv /cust-hfs # cd /cust-hfs 8. Time how long it takes to read the files with the cat command. Record the results below:
# timex cat file* > /dev/null
real: user: sys: How do the results of step 8 compare to the default HFS block and fragment results from step 4? _______________________________________________________________________ 9. Performance tuning change file system mount options. The manner in which the file system is mounted can impact performance. The fsasync mount option can improve performance, but data (metadata) integrity is not as reliable in the event of a crash, and fsck could run into difficulties. # cd / # umount /hfs # mount -o fsasync /dev/vg00/hfs /hfs # cd /hfs 10. In a second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of requests in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
How do the results of step 10 compare to the default mount options in step 5? _____________________________________________________________________
http://education.hp.com
http://education.hp.com
http://education.hp.com
Objectives
Upon completion of this lesson, you will be able to: Understand JFS structure and version differences Explain how to enhance JFS performance Set block sizes to improve performance Set Intent-Log size and rules to improve performance Understand and manipulate synchronous and asynchronous I/O Identify JFS tuning parameters Understand and control fragmentation issues Evaluate the overhead of online backup snapshots
Student Notes
Upon completion of this module, you will be able to do the following: Understand JFS Structure and version differences These course notes are based on the JFS Version 3.5 file system, built on Version 4 disk layout. The next few slides will describe the basic differences between versions and relate them to HP-UX releases. HP JFS 3.5 and HP OnlineJFS 3.5 are available for HP-UX 11i and later systems. The standard (base) version of HP JFS has been bundled with HP-UX since release 10.01. The advanced HP OnlineJFS is a purchasable product with additional administrative features for higher availability and tunable performance. These notes will make clear which features belong to the base product and which belong to the OnlineJFS version. The Operating Environment delivery model of HP-UX 11i includes JFS as follows: HP-UX 11i OE HP-UX 11i Enterprise OE HP-UX 11i Mission Critical OE BaseJFS 3.3 OnlineJFS 3.3 OnlineJFS 3.3
http://education.hp.com
You can download JFS 3.5 for HP-UX 11i for free from the HP Software Depot (http://www.software.hp.com), or you can request a free JFS 3.5 CD from the Software Depot. You can purchase HP OnlineJFS 3.3 (product number B3929CA for servers and product number B5118CA for workstations) for HP-UX 11.0 or HP-UX 11i from your HP sales representative. JFS 3.5 is included with HP-UX 11i systems. Explain how to enhance JFS performance The HFS file system uses block based allocation schemes, which provide adequate random access and latency for small files but limit throughput for larger files. As a result, the HFS file system is less than optimal for commercial environments. VxFS addresses this file system performance issue through an alternative allocation scheme and increased user control over allocation, I/O, and caching policies. Set Block Sizes to improve performance It is often advantageous to match the block size of a file system to the I/O size of the application. We will show you how! Set Intent Log size to improve performance The JFS intent log provides for rapid fsck recovery after a system crash. In general the intent log is not protecting your data; the focus is on structural integrity and not data integrity! Fast fsck comes at a price and that price is performance. Setting the correct intent log size is important as it cannot be changed once a file system is created. Understand and manipulate synchronous and asynchronous I/O Programmers and data base providers do different types of I/Os to obtain the best possible balance between data integrity and performance. We will investigate all the gray areas and tune the JFS file system to meet our administrative and performance goals which might be quite different to those of the programmer! Identify JFS tuning parameters The JFS is tunable through mount options, the command line, configuration files and kernel parameters. We will learn where and how to tune. Understand and control fragmentation issues The extent based file allocation design of JFS is ideal for performance of large files. One weakness of this approach is the potential fragmentation of files and free space over the life of the file system. In general this will only occur in dynamic work file orientated JFS file systems (e.g. a mail server) and is unlikely in fixed large file file systems where major I/O rates occur to static files (e.g. a data base). We will investigate ways of measuring and fixing fragmentation.
http://education.hp.com
Evaluate the overhead of online backup snapshots OnlineJFS supports online backups via snapshot mounts. We will discuss the performance issues involved when working with snapshots.
http://education.hp.com
Student Notes
The HP-UX Journaled File System (JFS) was introduced by HP in August, 1995, on the HP-UX 10.01 release. The journaled file system attempts to improve on the high-performance file system (HFS) by offering the following enhancements: Extent-based allocation of disk space Fast file system recovery through an Intent Log Greater control and flexibility of file system behavior through new mount options and tunable options.
http://education.hp.com
Version 3
Version 4
10.10 10.20 11.00 with JFS 3.1 11.00 with JFS 3.3 11i v1 11i v2
2 3 3 3 4 4
vxupgrade(1M)
The vxupgrade command can upgrade an existing Version 3 VxFS file system to the Version 4 layout while the file system remains online. vxupgrade can also upgrade a Version 2 file system to the Version 3 layout. See vxupgrade(1M) for details on upgrading VxFS file systems. You cannot downgrade a file system that has been upgraded.
http://education.hp.com
NOTE:
You cannot upgrade the root (/) or /usr file systems to Version 4 on an 11.00 system running JFS 3.3. Additionally, we do not advise upgrading the /var or /opt file systems to Version 4 on an 11.00 system. These core file systems are crucial for system recovery. The HP-UX 11.00 kernel and emergency recovery media were built with an older version of JFS that does not recognize the Version 4 disk layout. If these file systems were upgraded to Version 4, your system might have errors booting with the 11.00 kernel as delivered, or booting with the emergency recovery media.
http://education.hp.com
8 10:59:
8|OBJT |LOCAL|0| .rodata|S$704$vx_workli 8|OBJT |LOCAL|0| .rodata|S$705$vx_workli 8|OBJT |LOCAL|0| .rodata|S$706$vx_workli 8|OBJT |LOCAL|0| .rodata|S$707$vx_workli 8|OBJT |LOCAL|0| .rodata|S$708$vx_workth 232|FUNC |GLOB |0| 96|FUNC |GLOB |0| 40|OBJT |GLOB |0| 16|OBJT |GLOB |0| 40|OBJT |GLOB |0| 436|FUNC |GLOB |0| 196|FUNC |GLOB |0| 8|OBJT |GLOB |0| 84|FUNC |GLOB |0| 48|FUNC |GLOB |0| 232|FUNC |LOCAL|0| .text|vx_worklist_enq .text|vx_worklist_get .bss|vx_worklist_hig .bss|vx_worklist_lk .bss|vx_worklist_low .text|vx_worklist_pro .text|vx_worklist_thr .sbss|vx_worklist_thr .text|vx_worklist_wak .text|vx_workthread_c .text|vx_workthread_s
http://education.hp.com
JFS Extents
Start Length
Student Notes
JFS allocates space to files in the form of extents - adjacent blocks of disk space treated as a unit. Extents can vary in size from a single block (minimum 1 KB in size) to many megabytes. Organizing file storage in this manner allows JFS to better support large I/O requests, with more efficient reading and writing to continuous disk space areas. JFS extents are represented by a starting block number and a block count. In the example on the slide, the first extent starts at block 40 and contains a length of 128 blocks (or 128 KB, assuming blocks are 1KB in size). When the file grew past the 128 KB size, JFS tried to increase the size of the last extent. Since another file was already occupying this location, a new extent was allocated, starting at block 200. This extent grew to a size of 64 KB, before encountering another file. At this point, a third extent was allocated at block 8. Initially, 8 KB were allocated to the third extent, but upon closing the file, any space not used by the last extent is returned to the operating system. Since only 5 KB were used, the extra 3 KB were returned.
http://education.hp.com
http://education.hp.com
Student Notes
Disk Space Allocation: The Block Size
Disk space is allocated by the system in 1024 byte device blocks (DEV_BSIZE). An integral number of device blocks are grouped together to form a file system block. VxFS supports file system block sizes of 1024, 2048, 4096, and 8192 bytes. The default block size is: 1024 bytes for file systems less than 8 gigabytes; 2048 bytes for file systems less than 16 gigabytes; 4096 bytes for file systems less than 32 gigabytes; 8192 bytes for file systems 32 gigabytes or larger.
The block size may be specified as an argument to the mkfs or newfs utility and may vary between VxFS file systems mounted on the same system. VxFS allocates disk space to files in extents. An extent is a set of contiguous blocks (up to 2048 blocks in size).
http://education.hp.com
Preferred Allocation
The following rules are satisfied wherever possible starting with the preferred rules at the top and working down to less preferred rules. Allocate files using contiguous extent of blocks Attempt to allocate each file in one extent of blocks If not possible, attempt to allocate all extents for a file close to each other If possible, attempt to allocate all extents for a file in the same allocation unit
An allocation unit is an amount of contiguous (and therefore close together) file system space equal to 32 MB in size. It is roughly analogous to the HFS cylinder group, but is not dependent on the geometry of the disk drive in any way.
http://education.hp.com
= Sync
= System Crash
Student Notes
A key advantage of JFS is that all file system transactions are written to an Intent Log. The logging of file system transactions helps to ensure the integrity of the file system, and allows the file system to be recovered quickly in the event of a system crash.
http://education.hp.com
updates related to the transactions being logged, and other metadata updates related to the same transaction not being logged). The logging of only COMPLETED transactions prevents the file system from being out-of-sync due a crash occurring in the middle of a transaction. Either the entire transaction is logged or none of the transaction is logged. This allows the JFS intent log to be used in a recovery situation as opposed to a standard fsck. The JFS recovery is done in seconds, as opposed to a standard fsck that (on a big file system) could take minutes, or even hours.
Example
Using the example on the slide, assume that each file transaction requires from one to four metadata updates. After each successful file transaction, all the related metadata updates are written to the JFS intent log. After 30 seconds, all the metadata updates are written out to disk by the sync daemon, and a corresponding DONE record is written to the JFS intent log for each JFS transaction that was flushed during the sync. The system can now reuse that space in the JFS intent log for new JFS transactions. When a crash occurs (in our example, in the middle of a file transaction), the uncompleted transaction never has any metadata written to the JFS intent log; therefore only one transaction is in the JFS intent log since the last sync. Only this transaction needs to be redone and then the file system is recovered and in a stable state. Compare this with having to do a standard fsck.
Performance Impacts
The intent log size is chosen when a file system is created and cannot be subsequently changed. The mkfs utility uses a default intent log size of 1024 blocks. The default size is sufficient for most workloads. If the system is used as an NFS server, for intensive synchronous write workloads, or for dynamic work file loads with many metadata changes, performance may be improved using a larger log size. File data is not normally written to the intent log. However, if the application has designated to do synchronous writes and the writes are 32 KB or smaller, the file data will be written to the intent log, along with the meta-data. This behavior can be modified by mount options (discussed later in this module). With larger intent log sizes, recovery time is proportionately longer and the file system may consume more system resources (such as memory) during normal operation. There are several system performance benchmark suites for which VxFS performs better with larger log sizes. As with block sizes, the best way to pick the log size is to try representative system loads against various sizes and pick the fastest. The performance degradation occurs when the entire JFS intent log becomes filled with pending JFS transactions. In these situations, all new JFS transactions must wait for DONE records to arrive for the existing JFS transactions. Once the DONE records arrive, the space used by the corresponding transactions can be freed and reused for new transactions. Having to wait for DONE records to arrive can significantly decrease performance with JFS. In these cases, it is suggested the JFS file system be reinitialized with a larger JFS intent log.
http://education.hp.com
CAUTION:
Network file systems (NFS) can generate a large number of metadata updates if accessed currently by multiple systems. For JFS file systems being exported for network access via NFS, it is strongly recommended these file systems have an intent log size of 16 MB (maximum size for intent log).
http://education.hp.com
1
Superblock Inodes Bitmaps
3
SB Inode Intent Log Allocation Unit
Buffer Cache
JFS Transaction
Process
Memory
Disk
Student Notes
The following slide shows a graphical representation of how JFS transactions are processed. System call is issued (for example, write call). 1. All in-memory data structures related to the transaction are updated. These in-memory structures would include the superblock, the inode table, and the bitmaps. 2. Once the in-memory structures are updated, a JFS transaction is packaged containing the modifications to the in-memory structures. This packaged transaction contains all the data needed to reproduce the transaction (should that be necessary). 3. Once the JFS transaction is created, it is written to the intent log. (When it is written depends on mount options.) At this point, control is returned to the system call. 4. Since the transaction is now stored on disk (in the intent log), there is no hurry to flush the in-memory data structures to their corresponding disk-based data structures.
http://education.hp.com
Therefore, the in-memory structures are transferred to the buffer cache, and the sync daemon flushes out these transactions within the next 30 seconds. 5. After the metadata structures are flushed out, a DONE record is written to the intent log indicating the transaction has been updated to disk, and the corresponding transaction no longer needs to be kept in the intent log.
http://education.hp.com
Is it attribute-intensive? many files, small chunks being shuffled Is the access pattern random or sequential I/O? Check for read(), write(), and lseek() system calls What is the bandwidth and size of the I/Os? Are these consistent?
Student Notes
Understand your I/O Workload
Tuning the file systems parameters to optimize performance can only be done effectively when you know what type of I/Os the application is doing. It would be wrong to tune for large block size and maximum contiguous space allocation if the application does many small random I/Os to many small files.
Data Intensive?
Commercial data base applications generally deal with very large files in the table space and large I/Os to those files. Any high degree of small random I/O should be taken care of by the data bases own buffers (System Global Area) and the HP-UX buffer cache (if it is being used). We may choose to increase the block size in this situation and tune for maximum read ahead/write behind. The following slides will cover this type of tuning.
http://education.hp.com
Attribute Intensive?
Some applications generate many small I/Os to many small files. In this situation a large block size and maximum read ahead/write behind would be inappropriate, generating more I/O than is necessary. A Mail Server or Web Server could be regarded as such an application.
Disk Bandwidth
In the end we can only get so much performance out of a single spindle. Modern fast disks (10,000+ RPM, 5ms access time) can only provide an absolute maximum of approx 10 MB/s for very sequential I/O and around 150 I/Os Per Second. Once your file system is extracting these sorts of numbers (or even 50% of them!) you can consider that the hardware has become the limiting factor. Stop tuning and buy more disks! Remember that spindles win prizes. LVM or VxVM striping will help in this situation as the single spindle performance can be aggregated by the number of spindles. Using expensive RAID technology like the HP XP256, XP512, or XP1024 Disk Arrays will also improve apparent spindle performance. The author has seen a single XP512 logical device provide a sustained 60MB/s read performance for sequential I/O and over 1500 I/Os per second for a single threaded random application test to a single logical device.
http://education.hp.com
Performance Parameters
Things that an administrator can change to optimize JFS: Choosing a Block Size Choosing an Intent Log Size Choosing Mount Options Kernel Tunables
Monitoring Free Space and Fragmentation Changing extent attributes on individual files I/O Tuning
Student Notes
We will discuss the following choices over the next slides. Note that the some parameters can only be set when the file system is created. At file system creation time (only):
Choosing Mount Options Kernel Tunables Kernel Inode Table Size Monitoring Free Space and Fragmentation Changing extent attributes on individual files I/O Tuning Tunable VxFS I/O Parameters
http://education.hp.com
Small files will waste space System overhead will be less Files approaching 1GB are large
Consider minimum block size (1K) for small file mail server or web server
Use large block size for sequential I/O application Use small block size for random I/O application
Student Notes
You specify the block size when a file system is created; it cannot be changed later. The standard HFS file system defaults to a block size of 8K with a 1K fragment size. This means that space is allocated to small files (up to 12 blocks) in 1K increments. Allocations for larger files are done in 8K increments. Because many files are small, the fragment facility saves a large amount of space compared to allocating space 8K at a time. The unit of allocation in VxFS is a block. There are no fragments because storage is allocated in extents that consist of one or more blocks. The smallest block size available is 1K, which is also the default block size for VxFS file systems created on devices of less than 8 gigabytes. Choose a block size based on the type of application being run. For example, if there are many small files, a 1K block size may save space. For large file systems, with relatively few files, a larger block size is more appropriate. The trade-offs of specifying larger block sizes are: 1) a decrease in the amount of space used to hold the free extent bitmaps for each allocation unit, 2) an increase in the maximum extent size, and 3) a decrease in the number of extents used per file versus an increase in the amount of space wasted at the end of files that are not a multiple of the block size.
http://education.hp.com
Larger block sizes use less disk space in file system overhead, but consume more space for files that are not a multiple of the block size. The easiest way to judge which block sizes provide the greatest system efficiency is to try representative system loads against various sizes and pick the fastest.
http://education.hp.com
Student Notes
The intent log size is chosen when a file system is created and cannot be changed afterwards. The default intent log size chosen by mkfs is 1024 blocks and is suitable in most situations. For some types of applications (NFS server or intensive synchronous write loads), performance may be improved by increasing the size of the intent log. Note that recovery time will also be proportionally longer as the log size increases. Memory requirements for the log maintenance will also increase as the log size increases. Ensure that the log size is not more than 50% of the physical memory size of the system or fsck will not be able to fix it after a system crash. Ideal log size for NFS is 2048 with a file system block size of 8192.
http://education.hp.com
While logsize is specified in blocks, the maximum size of the intent log is 16384 KB. This means the maximum values for logsize are: 16384 for a block size of 1024 bytes 8192 for a block size of 2048 bytes 4096 for a block size of 4096 bytes 2048 for a block size of 8192 bytes
http://education.hp.com
Student Notes
JFS offers mount options to delay or disable transaction logging to the intent log. This allows the system administrator to make trade-offs between file system integrity and performance. Following are the logging options: Mount Option Full logging (log) Description File system structural changes are logged to disk before the system call returns to the application (synchronously). If the system crashes, fsck(1M) will complete logged operations that have not completed. Some system calls return before the intent log is written. This improves the performance of the system, but some changes are not guaranteed until a short time later when the intent log is written. This mode approximates traditional UNIX system guarantees for correctness in case of system failure. The intent log is almost always delayed. This improves
Temporary logging
http://education.hp.com
(tmplog)
performance, but recent changes may disappear if the system crashes. This mode is only recommended for temporary file systems. The intent log is disabled. The other three logging modes provide for fast file system recovery; nolog does not provide fast file system recovery. With nolog mode, a full structural check must be performed after a crash. This may result in loss of substantial portions of the file system, depending upon activity at the time of the crash. Usually, a nolog file system should be rebuilt with mkfs(1M) after a crash. The nolog mode should only be used for memory resident or very temporary file systems. The nodatainlog mode should be used on systems with disks that do revectoring. Normally, a VxFS file system uses the intent log for synchronous writes. The inode update and the data are both logged in the transaction, so a synchronous write only requires one disk write instead of two. When the synchronous write returns to the application, the file system has told the application that the data is already written. If a disk error causes the data update to fail, then the file must be marked bad and the entire file is lost. If a disk supports bad block revectoring, then a failure on the data update is unlikely, so logging synchronous writes should be allowed. If the disk does not support bad block revectoring, then a failure is more likely, so the nodatainlog mode should be used. A nodatainlog mode file system should be approximately 50 percent slower than a standard mode VxFS file system for synchronous writes. Other operations are not affected.
No logging (nolog)
nodatainlog
blkclear
The blkclear mode is used in increased data security environments. The blkclear mode guarantees that uninitialized storage never appears in files. The increased integrity is provided by clearing extents on disk when they are allocated to a file. Extending writes are not affected by this mode. A blkclear mode file system should be approximately 10 percent slower than a standard mode VxFS file system, depending on the workload.
http://education.hp.com
* NOTE: This is the only additional option available with BaseJFS, all other options require OnlineJFS.
Student Notes
Understanding asynchronous, data synchronous (O_DSYNC) and fully synchronous (O_SYNC) application I/O.
When an application program opens a file with the open() system call, the programmer makes a decision on how the I/Os will occur between the application memory and the file system. The following three options are available, in order, ranging from highest performance (lowest integrity) to lowest performance (best integrity). In this discussion integrity refers to the potential damage to file system structures and customer data during a system crash. 1. Asynchronous I/O Standard Mode High performance / Low integrity
In asynchronous mode, all application I/Os are done to buffer cache including data and inode modifications. The write() system call will return quickly to the application which can continue in faith that the data will make it to the disk. Data integrity will be fully compromised by a system crash and new just created files may even disappear.
http://education.hp.com
If the file is opened with the O_DSYNC flag, the file is in Data Synchronous mode. In this situation, write() system calls that modify data do not return until the disk has acknowledged the receipt of the data. However, some inode changes (time stamps, etc.) are still performed asynchronously and may not have arrived at the disk in the case of a system crash.
3. Synchronous I/O
O_SYNC
Fully synchronous behavior is obtained by opening the file with O_SYNC. All operations are now synchronous and write() system calls block for both data and inode modifications. Minimal damage will now occur in the event of a system crash.
mincache
mincache=closesync Flush data to disk synchronously when file is closed.
The mincache=closesync mode is useful in desktop environments where users are likely to shut off the power on the machine without halting it first. In this mode, any changes to the file are flushed to disk synchronously when the file is closed. To improve performance, most file systems do not synchronously update data and inode changes to disk. If the system crashes, files that have been updated within the past minute are in danger of losing data. With the mincache=closesync mode, if the system crashes or is switched off, only files that are currently open can lose data. A mincache=closesync mode file system should be approximately 15 percent slower than a standard mode VxFS file system, depending on the workload. mincache=direct Bypass the buffer cache for all data and inode changes, forces fully synchronous behavior and totally skips buffer cache. Bypass the buffer cache for data only. Inode changes are cached. Forces data synchronous-like behavior with no data in cache.
mincache=unbuffered
http://education.hp.com
mincache=dsync
Equivalent to normal data synchronous behavior. Write does not return until data is on disk but data does go through buffer cache.
The mincache=direct, mincache=unbuffered, and mincache=dsync modes are used in environments where applications are experiencing reliability problems caused by the kernel buffering of I/O and delayed flushing of non-synchronous I/O. The mincache=direct and mincache=unbuffered modes guarantee that all nonsynchronous I/O requests to files will be handled as if the VX_DIRECT or VX_UNBUFFERED caching advisories had been specified. The mincache=dsync mode guarantees that all nonsynchronous I/O requests to files will be handled as if the VX_DSYNC caching advisory had been specified. Refer to vxfsio(7) for explanations of VX_DIRECT, VX_UNBUFFERED, and VX_DSYNC. The mincache=direct, mincache=unbuffered, and mincache=dsync modes also flush file data on close as mincache=closesync does. mincache=tmpcache Speeds up file growth by breaking data initialization rules.
The -o mincache=tmpcache option only affects write extending calls and is not available to files performing synchronous I/O. write extending calls refer to write calls that cause new file system blocks to be assigned to the file, extending the size of the file in blocks. The normal behavior for write extending calls is to write the new user data first, and insist on metadata to be written only after the user data. Write extending calls are expensive from a performance standpoint, because the write call has to wait for the user data and the metadata to be written. A non-extending write call only requires the call to wait for the metadata. With the -o mincache=tmpcache option, write extending calls do not have to wait for the user data to be written. This option allows the metadata to be written before user data (and the write call to return before the user data is written), significantly improving performance. CAUTION: The -o mincache=tmpcache option significantly increases the likelihood of non-initialized file system blocks (i.e. junk) appearing in files during a system crash. This is due to the file pointing to data blocks before the data is actually there. If the system crashes between the file's inode being updated (done first) and the user data being written (done second), then uninitialized data will appear in the file. The tmpcache option should only be used for memory resident or very temporary file systems.
http://education.hp.com
convosync
NOTE: Use of the convosync=dsync option violates POSIX guarantees for synchronous I/O.
The convert osync (convosync) mode has five values: convosync=closesync, convosync=direct, convosync=dsync, convosync=unbuffered, and convosync=delay. The convosync=closesync mode converts synchronous and data synchronous writes to non-synchronous writes and flushes the changes in the file to disk when the file is closed. The convosync=delay mode causes synchronous and data synchronous writes to be delayed rather than to take effect immediately. No special action is performed when closing a file. This option effectively cancels any data integrity guarantees normally provided by opening a file with O_SYNC. See open(2), fcntl(2), and vxfsio(7) for more information on O_SYNC. Caution! Extreme care should be taken when using the convosync=closesync or convosync=delay mode because they actually change synchronous I/O into non-synchronous I/O. This may cause applications that use synchronous I/O for data reliability to fail, if the system crashes and synchronously written data is lost.
The convosync=direct and convosync=unbuffered mode convert synchronous and data synchronous reads and writes to direct reads and writes, bypassing the buffer cache. The convosync=dsync mode converts synchronous writes to data synchronous writes. As with closesync, the direct, unbuffered, and dsync modes flush changes in the file to disk when it is closed. These modes can be used to speed up applications that use synchronous I/O. Many applications that are concerned with data integrity specify O_SYNC in order to write the file data synchronously. However, this has the undesirable side effect of updating inode times and therefore slowing down performance. The convosync=dsync, convosync=unbuffered, and convosync=direct modes alleviate this problem by allowing applications to take advantage of synchronous writes without modifying inode times as well. NOTE: Before using convosync=dsync, convosync=unbuffered, or convosync=direct, make sure that all applications that use the file system do not require synchronous inode time updates for O_SYNC writes.
http://education.hp.com
Buffer Cache
Buffer Cache
ORACLE Database
ORACLE Database
Oracle Process
Oracle Process
Student Notes
The above slide illustrates the impact of setting the -o mincache=direct option. By default, all JFS file system I/O goes through the system's buffer cache. When an application does its own caching (e.g. an Oracle database application), there are two levels of caching. One cache is managed by the application; the other cache is managed by the kernel. Using two caches is inefficient from both a performance and a memory usage standpoint (data exists in both caches). When the file system is mounted with the -o mincache=direct option, it causes bypassing of the system's buffer cache and the data is written directly to disk. This improves performance and keeps the buffer cache available for other file systems that do not go through an application cache.
http://education.hp.com
CAUTION:
Use of the -o mincache=direct option can lead to a significant decrease in performance if used in the wrong situation. This option should only be used if: 1. An application creates and maintains its own data cache, and 2. All the files on the file system are cached in the application's data cache. If there are some files being accessed on the mounted file system and these files are not being cached by the application, this option should not be used.
NOTE:
http://education.hp.com
2
SB Inode AU SB Intent Log SB Inode AU Allocation Unit Buffer Cache JFS Transaction Process Buffer Cache
1
SB Intent Log Allocation Unit
Memory
Memory
File
File
Disk
Disk
default
mincache=tmpcache
Student Notes
By default, when a process performs a write extending call, the new data is written to disk before the file's inode is updated. In the slide above, the left side shows the default behavior: 1. Write data to newly allocated file system block. 2. Write JFS transaction meta-data out to the disk. The system call returns. The advantage of this behavior is that uninitialized data will not be found within the file should a system crash occur. This is important from a data integrity standpoint. The disadvantage of this behavior is slow performance, because the JFS transaction must wait for the user data I/O to complete before it can be written to the intent log.
http://education.hp.com
1. Write JFS transaction out to disk. (The system call returns). 2. Write data to newly allocated file system block. The advantage of this behavior is performance of write extending calls is fast. The system does not wait for the user data to be written to disk. The disadvantage of this behavior is data integrity of the file is jeopardized, especially if the file is being updated at the time of a system crash. By updating the file's inode first, the file points to uninitialized data blocks which contains unknown data. The uninitialized file system blocks are expected to be initialized soon after the inode is updated; however, there still exists a small window of time when the file's inode references unknown data. If the system crashes during this small window, then the file will still be referencing the uninitialized data after the crash. CAUTION: The -o mincache=tmpcache option should only be used for memory resident or very temporary file systems.
http://education.hp.com
Kernel Tunables
VxFS inodes are cached in memory, separate from HFS. Kernel parameter ninode has no effect on VxFS. When vx_ninode is zero (default), inode cache is set in proportion to system memory (see table). vx_ncsize sets directory name lookup cache (1KB)
Student Notes
Internal Inode Table Size
VxFS caches inodes in an inode table (see Table below, Inode Table Size). There is a tunable in VxFS called vx_ninode that determines the number of entries in the inode table. A VxFS file system obtains the value of vx_ninode from the system configuration file used for making the kernel (/stand/system for example). This value is used to determine the number of entries in the VxFS inode table. By default, vx_ninode is set to zero. The kernel then computes a value based on the system memory size.
http://education.hp.com
Module 10 VxFS Performance Issues Total Memory in Mbytes 8 16 32 64 128 256 512 1024 2048 8192 32768 131072 MaximumNumber of Inodes 400 1000 2500 6000 8000 16000 32000 64000 128000 256000 512000 1024000
If the available memory is a value between two entries, the value of vx_ninode is interpolated.
http://education.hp.com
Fragmentation
Keep file system free space over 10% Maintain free space distribution goals Monitor with df(1M) or fsadm(1M) Repack files and free space with fsadm e
Reduces the number of extents in large files Makes small files contiguous (one extent)
Moves small recently used file closer to inode structures Optimizes free space into larger extents Repack directories with fsadm d
Remove empty entries from directories Place recently used files at beginning of directory lists Pack small directories directly in inode if possible
Student Notes
Keep file system free space over 10% In general, VxFS works best if the percentage of free space in the file system does not get below 10 percent. This is because file systems with 10 percent or more free space have less fragmentation and better extent allocation. Regular use of the df(1M) command to monitor free space is desirable. Full file systems should therefore have some files removed, or should be expanded (see fsadm(1M) for a description of online file system expansion). Maintain free space distribution goals 3 factors which can be used to determine the degree of fragmentation: percentage of free space in extents < than 8 blocks in length percentage of free space in extents < than 64 blocks in length percentage of free space in extents of 64 blocks or greater
http://education.hp.com
less than 1% of free space in extents < 8 blocks in length less than 5% of free space in extents < 64 blocks in length more than 5% of total file system size available as free extents in length of 64 or more blocks A fragmented file system will have the following characteristics: greater than 5% of free space in extents < 8 blocks in length more than 50% of free space in extents < 64 blocks in length less than 5% of total file system size available as free extents in lengths of 64 or more blocks in size Using df(1M) The following example shows how to use df to map free space:
# df -F vxfs -o s /usr /usr (/dev/vg00/lvol7 ) : Free Extents by Size 1: 823 2: 206 4: 16: 158 32: 61 64: 256: 23 512: 14 1024: 4096: 1 8192: 1 16384:
55 48 3 0
206 43 3 0
Repack files and freespace fsadm e has the following goals for files and free data space Make small files (default: <64k) one contiguous extent Ensure that large files are built from large extents Move small and recently used (default: <14 days) files near inode area Move large or old (>14 days since last access) files to end of allocation unit Consolidate free space in center of data area
Repack directories fsadm d has the following goals for directories Remove unused space from between used directory entries Pack directories and symbolic links into inode immediate area if possible Place directories and symbolic links first, then other files Sort each area by time of last access
fsadm(1M) Overview
Because blocks are allocated and deallocated as files are added, removed, expanded, and truncated, block space can become fragmented. This can make it more difficult for JFS to take advantage of the benefits provided by a contiguous extent allocation. To remove fragmentation, HP OnlineJFS includes a utility called fsadm, which will take fragmented blocks and reallocate them as contiguous extents. The fsadm utility can be run on a live file system (including one containing active databases) safely without interrupting data access.
http://education.hp.com
The fsadm utility will bring the fragmented extents of files closer together, group them by type and frequency of access, and compact and sort directories. The fsadm utility is typically run as a recurring scheduled job and is an effective tool for the management of a highperformance online file system. Even if database software used on top of the file system has its own defragmenter, this additional defragmentation is necessary to make the storage that the database engine sees as contiguous as possible. You can defragment (reorganize) your HP OnlineJFS file system using SAM or with fsadm(1M), directly from the command line. To use SAM: 1. Invoke SAM. 2. Select the Disks and File Systems functional area. 3. Select the File Systems application. 4. Select the JFS file system that you wish to reorganize from the directories' list. 5. Select the Actions menu. 6. Select the VxFS Maintenance menu item. 7. View reports on extent and directory fragmentation, then select Reorganize Extents or Reorganize Directories to defragment your JFS file system.
http://education.hp.com
Reorganizing Options
-F vxfs -D Specify the JFS file system type. Report on directory fragmentation. If specified in conjunction with the -d option, the fragmentation report is produced both before and after the directory reorganization. Report on extent fragmentation. If specified in conjunction with the -e option, the fragmentation report is produced both before and after the extent reorganization. Reorganize directories. Directory entries are reordered to place subdirectory entries first, then all other entries in decreasing order of time of last access. The directory is also compacted to remove free space. Extent reorganization. Attempt to minimize fragmentation. Aged files are moved to the end of the allocation units to produce free space. Other files are reorganized to have the minimum number of extents possible. Print a summary of activity at the end of each pass. Verbose. Report reorganization activity. Consider files not accessed within the specified number of days as aged files. The default is 14 days. Aged files are moved to the end of the directory by the -d option and reorganized differently by the -e option. Maximum number of passes to run. The default is 5 passes. Reorganizations are processed until reorganization is complete or until the specified number of passes have been run. Maximum time to run. Reorganizations are processed until reorganization is complete or the time limit has expired. time is specified in seconds.
-E
-d
-e
-s -v -a days
-p passes
-t time
http://education.hp.com
If both the -t and -p options are specified, the utility exits if either of the terminating conditions is reached. If both the -e and -d options are specified, the utility will run all the directory reorganization passes before any extent reorganization passes. fsadm uses the file .fsadm in the lost+found directory as a lock file. When fsadm is invoked, it opens the file lost+found/.fsadm in the root of the file system specified by mount_point . If the file does not exist, it is created. The fcntl(2) system call is used to obtain a write lock on the file. If the write lock fails, fsadm will assume that another fsadm is running and will fail. fsadm will report the process ID of the process holding the write lock on the .fsadm file.
au 0 au 1 total
The Dirs Searched column contains the total number of directories. A directory is associated with the extent-allocation unit containing the extent in which the directory's inode is located. The Total Blocks column contains the total number of blocks used by directory extents. The Immed Dirs column contains the number of directories that are immediate, meaning that the directory data is in the inode itself as opposed to being in an extent. Immediate directories save space and speed path name resolution. The Immeds to Add column contains the number of directories that currently have a data extent, but that could be reduced in size and contained entirely in the inode. The Dirs to Reduce column contains the number of directories for which one or more blocks can be freed, if the entries in the directory are compressed to make the free space in the directory contiguous. Because directory entries vary in length, large directories may contain a block or more of total free space, but with the entries arranged in such a way that the space cannot be made contiguous. As a result, it is possible to have a non-zero Dirs to
http://education.hp.com
Reduce calculation immediately after running a directory reorganization. The -v (verbose) option of directory reorganization reports occurrences of failure to compress free space. The Blocks to Reduce column contains the number of blocks that can be freed if the entries in the directory are compressed.
Directory Reorganization
If the -d option is specified, fsadm will reorganize the directories on the file system whose mount point is mountpoint_dir. Directories are reorganized in two ways: compressing and sorting. For compression, the valid entries in the directory are moved to the front of the directory and the free space is grouped at the end of the directory. If there are no entries in the last block of the directory, the block is released and the directory size is reduced. If the directory entries are small enough, the directory is placed in the inode immediate data area. The entries in a directory are also sorted to improve path name lookup performance. Entries are sorted based on the last access time of the entry. The -a option is used to specify a time interval; 14 days is the default if -a is not specified. The time interval is broken up into 128 buckets, and all times within the same bucket are considered equal. All access times older than the time interval are considered equal, and those entries are placed last. Subdirectory entries are placed at the front of the directory and symbolic links are placed after subdirectories, followed by the most recently accessed files. The directory reorganization runs in one pass across the entire file system. The command line to reorganize directories of a file system is:
fsadm -d [-s] [-v] [-p passes] [-t timeout] [- r rawdev] [-D] /mountpoint_dir
The following example illustrates the output of the command fsadm -d -s command: # fsadm -d -s /home Directory Reorganization Statistics Dirs Searched 2343 582 142 88 3155 Dirs Changed 1376 254 26 24 1680 Total Ioctls 2927 510 38 29 3504 Failed Ioctls 1 0 0 1 2 Blocks Reduced 209 47 21 5 282 Blocks Immeds Changed Added 3120 72 586 28 54 16 36 2 3796 118
au au au au total
0 1 2 3
http://education.hp.com
The Dirs Searched column contains the number of directories searched. Only directories with data extents are reorganized. Immediate directories are skipped. The Dirs Changed column contains the number of directories for which a change was made. The Total Ioctls column contains the total number of VX_DIRSORT ioctls performed. Reorganization of directory extents is performed using this ioctl. The Failed Ioctls column contains the number of requests that failed. The reason for failure is usually that the directory being reorganized is active. A few failures should be no cause for alarm. If the -v option is used, all ioctl calls and status returns are recorded. The Blocks Reduced column contains the total number of directory blocks freed by compressing entries. The Blocks Changed column contains the total number of directory blocks updated while sorting and compressing entries. The Immeds Added column contains the total number of directories with data extents that were compressed into immediate directories.
Determining Fragmentation
To determine whether fragmentation exists for a given file system, the free extents for that file system need to be examined. If a large number of small extents are free, there is fragmentation. If more than half of the amount of free space is taken up by small extents, (smaller than 64 blocks) or there is less than 5 percent of total file system space available in large extents, then there is serious fragmentation.
http://education.hp.com
they are considered immovable. How large an extent must be to qualify as immovable can be controlled with the -l option. By default, largesize is 64 blocks, meaning that any extent larger than 64 blocks is considered to be immovable. For the purposes of the extent fragmentation report, the value chosen for largesize will affect which extents are reported as being immovable extents. The following is an example of the output generated by the fsadm -E command: # fsadm -E /home
Extent Fragmentation Report
au au au au total
0 1 2 3
Consolidatable Extents Blocks 928 2539 461 5225 729 8781 139 1463 2257 18008
au 0 Free 1: 16: 256: 4096: au 1 Free 1: 16: 256: 4096: au 2 Free 1: 16: 256: 4096: au 3 Free 1: 16: 256: 4096:
Blocks 217, Smaller Than 8 - 48%, Smaller Than 15 2: 15 4: 15 8: 0 32: 0 64: 0 128: 0 512: 0 1024: 0 2048: 0 8192: 0 16384: 0 Blocks 286, Smaller Than 8 - 41%, Smaller Than 16 2: 21 4: 15 8: 4 32: 0 64: 0 128: 0 512: 0 1024: 0 2048: 0 8192: 0 16384: 0 Blocks 510, Smaller Than 8 - 15%, Smaller Than 10 2: 14 4: 10 8: 8 32: 6 64: 0 128: 0 512: 0 1024: 0 2048: 0 8192: 0 16384: 0 Blocks 6235, Smaller Than 8 - 3%, Smaller Than 29 2: 33 4: 27 8: 18 32: 8 64: 4 128: 2 512: 2 1024: 1 2048: 0 8192: 0 16384: 0
http://education.hp.com
au 4 Free Blocks 8551, Smaller Than 8 - 2%, Smaller Than 64 - 22% 1: 29 2: 33 4: 30 8: 38 16: 28 32: 29 64: 26 128: 11 256: 8 512: 3 1024: 0 2048: 0 4096: 0 8192: 0 16384: 0 total Free 1: 16: 256: 4096: Blocks 15799, Smaller Than 8 - 4%, Smaller Than 64 - 24% 99 2: 116 4: 97 8: 109 58 32: 43 64: 30 128: 14 10 512: 5 1024: 1 2048: 1 0 8192: 0 16384: 0
The numbers in the Files with Extents column indicate the total number of files that have data extents. A file is considered to be in the extent-allocation unit that contains the extent holding the file's inode. The Total Extents column contains the total number of extents belonging to files in the allocation unit. The extents themselves are not necessarily in the same allocation unit. The Total Blocks column contains the total number of blocks used by files in the allocation unit. If the total number of blocks is divided by the total number of extents, the resulting figure is the average extent size. The Total Distance column contains the total distance between extents in the allocation unit. For example, if a file has two extents, the first containing blocks 100 through 107 and the second containing blocks 110 through 120, the distance between the extents is 110107, or 3. In general, a lower number means that files are more contiguous. If an extent reorganization is run on a fragmented file system, the value for Total Distance should be reduced. The Consolidatable Extents column contains the number of extents that are candidates to be consolidated. Consolidation means merging two or more extents into one combined extent. For files that are entirely in direct extents, the extent reorganizer will attempt to consolidate extents into extents up to size largesize. All files of size largesize or less typically will be contiguous in one extent after reorganization. Since most files are small, this will usually include about 98 percent of all files. The Consolidatable Blocks column contains the total number of blocks in Consolidatable Extents. The Immovable Extents column contains the total number of extents that are considered to be immovable. In the report, an immovable extent appears in the allocation unit of the extent itself, as opposed to in the allocation unit of its inode. This is because the extent is considered to be immovable, and thus permanently fixed in the associated allocation unit. The Immovable Blocks column contains the total number of blocks in immovable extents. The figures under the Free Extents by Size heading indicate per-allocation unit totals for free extents of each size. The totals are for free extents of size 1, 2, 4, 8, 16, . . . up to a maximum of the number of data blocks in an allocation unit. The totals should match the output of df -o s unless there has been recent allocation or deallocation activity (as this utility acts on
http://education.hp.com
mounted file systems). These figures give an indication of fragmentation and extent availability on a per-allocation-unit basis. For each allocation unit, and for the complete file system, the total free blocks and total free blocks by category are shown. The figures are presented as follows: The Free Blocks figure indicates the total number of free blocks. The Smaller Than 8 figure indicates the percentage of free blocks that are in extents less than 8 blocks in length. The Smaller Than 64 figure indicates the percentage of free blocks that are in extents less than 64 blocks in length.
In the preceding example, 4 percent of free space is in extents less than 8 blocks in length, and 24 percent of the free space is in extents less than 64 blocks in length. This represents a typical value for a mature file system that is regularly reorganized. The total free space is about 10 percent.
Extent Reorganization
If the -e option is specified, fsadm will reorganize the data extents on the file system whose mount point is mountpoint_dir. The primary goal of extent reorganization is to defragment the file system. To reduce fragmentation, extent reorganization tries to place all small files in one contiguous extent. The -l option is used to specify the size of a file that is considered large. The default is 64 blocks. Extent reorganization also tries to group large files into large extents of at least 64 blocks. In addition to reducing fragmentation, extent reorganization improves performance. Small files can be read or written in one I/O operation. Large files can approach raw-disk performance for sequential I/O operations. Extent reorganization also tries to improve the locality of reference on the file system. Extents are moved into the same allocation unit as their inode. Within the allocation unit, small files and directories are migrated to the front of the allocation unit. Large files and inactive files are migrated towards the back of the allocation unit. (A file is considered inactive if the access time on the inode is more than 14 days old. The time interval can be varied using the -a option.) Extent reorganization should reduce the average seek time by placing inodes and frequently used data closer together. fsadm will try to perform extent reorganization on all inodes on the file system. Each pass through the inodes will move the file system closer to the organization considered optimal by fsadm . The first pass might place a file into one contiguous extent. The second pass might move the file into the same allocation unit as its inode. Then, since the first file has been moved, a third pass might move extents for a file in another allocation unit into the space vacated by the first file during the second pass. When the file system is more than 90 percent full, fsadm shifts to a different reorganization scheme. Instead of attempting to make files contiguous, extent reorganization tries to defragment the free-extent map into chunks of at least 64 blocks or the size specified by the -l option.
http://education.hp.com
The following example illustrates the output from the fsadm -F vxfs -e -s command: # fsadm -F vxfs -e -s Allocation Unit 0, Pass 1 Statistics Extents Searched 2467 0 0 0 0 2467 Consolidations Number Extents 11 30 0 0 0 0 0 0 0 0 11 30 Performed Total Errors Blocks File Busy Not Free 310 0 0 0 0 0 0 0 0 0 0 0 0 0 0 310 0 0
au au au au au total
0 1 2 3 4
In Proper Location Extents Blocks au 0 1379 8484 au 1 0 0 au 2 0 0 au 3 0 0 au 4 0 0 total 1379 8484 Moved to Free Area Extents Blocks au 0 231 4851 au 1 0 0 au 2 0 0 au 3 0 0 au 4 0 0 total 231 4851
Moved to Proper Location Extents Blocks 794 10925 0 0 0 0 0 0 0 0 794 10925 In Free Area Extents Blocks 4 133 0 0 0 0 0 0 0 0 4 133 Could not be Moved Extents Blocks 0 0 0 0 0 0 0 0 0 0 0 0
Allocation Unit 0, Pass 2 Statistics Extents Searched 2467 0 0 0 0 2467 Consolidations Number Extents 0 0 0 0 0 0 0 0 0 0 0 0 Performed Total Errors Blocks File Busy Not Free 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
au au au au au total
0 1 2 3 4
In Proper Location
http://education.hp.com
au au au au au total
0 1 2 3 4
Note that the default five passes were scheduled, but the reorganization finished in two passes. This file system had not had much activity since the last reorganization, with the result that little reorganization was required. The time it takes to complete extent reorganization varies, depending on fragmentation and disk speeds. However, in general, extent reorganization may be expected to take approximately one minute for every 10 megabytes of disk space used. In the preceding example, the Extents Searched column contains the total number of extents examined. The Number column (located under the Consolidations Performed heading) contains the total number of consolidations or merging of extents performed. The Extents column (located under the Consolidations Performed heading) contains the total number of extents that were consolidated. (More than one extent may be consolidated in one operation.) The Blocks column (located under the Consolidations Performed heading) contains the total number of blocks that were consolidated. The File Busy column (located under the Total Errors heading) contains the total number of reorganization requests that failed because the file was active during reorganization. The Not Free column (located under the Total Errors heading) contains the total number of reorganization requests that failed because an extent that the reorganizer expected to be free was allocated at some time during the reorganization. The In Proper Location column contains the total extents and blocks that were already in the proper location at the start of the pass. The Moved to Proper Location column contains the total extents and blocks that were moved to the proper location during the pass. The Moved to Free Area column contains the total number of extents and blocks that were moved into a convenient free area in order to free up space designated as the proper location for an extent in the allocation unit being reorganized. The In Free Area column contains the total number of extents and blocks that were in areas designated as free areas at the beginning of the pass. The Could not be Moved column contains the total number of extents and blocks that were in an undesirable location and could not be moved. This occurs when there is not
http://education.hp.com
enough free space to allow sufficient extent movement to take place. This often occurs on the first few passes for an allocation unit if a large amount of reorganization needs to be performed. If the next to last pass of the reorganization run indicates extents that cannot be moved, then the reorganization fails. A failed reorganization may leave the file system badly fragmented, since free areas are used when trying to free up reserved locations. To lessen this fragmentation, extents are not moved into the free areas on the final two passes of the extent reorganizer and the last pass of the extent reorganizer only consolidates free space. To defragment a BaseJFS you need to perform the same steps you would for an HFS: 1. Back up the file system (with fbackup). 2. Make a new file system (with newfs). 3. Restore the data from tape (with frecover).
http://education.hp.com
Using setext
The setext command can manipulate the extent allocation policies of the JFS file system on a file by file basis: Use setext to override default VxFS extent allocation policies Specify the extent size Force files to be continuous Pre-reserve space for future continuous growth Prevent files from growing past reservation Use getext to view file parameters Use ls le to view extent parameters
Student Notes
setext specifies a fixed extent size for a file, and reserves space for a file. The file must already exist.
setext [-F vxfs] [-e extentsize] [-r reservation] [[-f flag]... ] file
Specify fixed extent size (in file system blocks) Pre-allocate space (in file system blocks) All extents aligned on extentsize boundaries relative the start of allocation units Reservation must be allocated contiguously File may not be extended once pre-allocated space has been used Reservation is incorporated into file; on-disk inode updated with size and block count information that includes the reserved space
http://education.hp.com
-f noreserve
-f trim
Reservation made as non-persistent allocation to file; on-disk inode not updated; associated with file until last close, then trimmed to current file size Reservation is trimmed to current file size upon last close by all processes that have the file open
Example using setext # touch bigfile.0 bigfile.1 bigfile.2 # /usr/sbin/setext -F vxfs -r 4096 -f contig bigfile.1 # /usr/sbin/setext -F vxfs -f align -e 128 bigfile.2 # cp bigfile bigfile.0 # cp bigfile bigfile.1 # cp bigfile bigfile.2 # ls -l bigfile* # ls -l bigfile* -rw-r--r-- 1 root other 2691000 Nov 2 10:52 bigfile.0 -rw-r--r-- 1 root other 2691000 Nov 2 10:53 bigfile.1 -rw-r--r-- 1 root other 2691000 Nov 2 10:53 bigfile.2 # /usr/sbin/getext -F bigfile.0: Bsize 1024 bigfile.1: Bsize 1024 bigfile.2: Bsize 1024 vxfs bigfile.* Reserve 0 Extent Size 0 Reserve 4096 Extent Size 0 Reserve 0 Extent Size 128
http://education.hp.com
Student Notes
JFS Tunable Parameters
I/O Parameter read_pref_io Description The preferred read request size. The file system uses this in conjunction with the read_nstream value to determine how much data to read-ahead. Default value is 64K. The preferred write request size. The file system uses this in conjunction with the write_nstream value to determine how to do flush-behind on writes. Default value is 64K. The number of parallel read requests of size read_pref_io to have outstanding at one time. The file system uses the product of read_nstream multiplied by read_pref_io to determine its readahead size. Default value for read_nstream is 1.
write_pref_io
read_nstream
http://education.hp.com
write_nstream
The number of parallel write requests of size write_pref_io to have outstanding at one time. The file system uses the product of write_nstream multiplied by write_pref_io to determine when to do flush-behind on writes. Default value for write_nstream is 1.
(Only the first four parameters are described here. Refer to man page for vxtunefs(1m) for the remainder.)
http://education.hp.com
# vxtunefs /tondir Filesystem i/o parameters for /tondir read_pref_io = 65536 # Preferred read request size is 64k read_nstream = 1 # Desired number of parallel read_pref_ios read_unit_io = 65536 write_pref_io = 65536 # Preferred write request size 64k write_nstream = 1 # Desired number of parallel write_pref_ios write_unit_io = 65536 pref_strength = 10 buf_breakup_size = 131072 discovered_direct_iosz = 262144 # Large I/Os treated like direct for speed max_direct_iosz = 131072 default_indir_size = 8192 qio_cache_enable = 0 max_diskq = 1048576 initial_extent_size = 8 max_seqio_extent_size = 2048 max_buf_data_size = 8192
Student Notes
The slide shows the output of the vxtunefs command being used to query the configuration of a VxFS file system.
-o parameter=value Specify parameters for the file systems listed on the command line. The parameters are listed below. -p Print the tuning parameters for all the file systems specified on the command line.
http://education.hp.com
-s
Set the new tuning parameters for the VxFS file systems specified on the command line or in the tunefstab file.
vxtunefs sets or prints tunable I/O parameters of mounted file systems. vxtunefs can set parameters describing the I/O properties of the underlying device, parameters to indicate when to treat an I/O as direct I/O, or parameters to control the extent allocation policy for the specified file system. With no options specified, vxtunefs prints the existing VxFS parameters for the specified file systems. vxtunefs works on a list of mount points specified on the command line, or all the mounted file systems listed in the tunefstab file. The default tunefstab file is /etc/vx/tunefstab. You can change the default using the -f option. vxtunefs can be run at any time on a mounted file system, and all parameter changes take immediate effect. Parameters specified on the command line override parameters listed in the tunefstab file. If /etc/vx/tunefstab exists, the VxFS-specific mount command invokes vxtunefs to set device parameters from /etc/vx/tunefstab.
http://education.hp.com
/etc/vx/tunefstab Configuration
File is read every time a VxFS is mounted. Automatic permanent vxtunefs options implemented here File format as follows: block-device tunefs-options system-default tunefs-options Options set for individual file systems or globally all VxFS file systems
Student Notes
The tunefstab file contains tuning parameters for VxFS file systems. vxtunefs sets the tuning parameters for mounted file systems by processing command line options or by reading parameters in the tunefstab file. Each entry in tunefstab is a line of fields in one of the following formats: block-device tunefs-options system-default tunefs-options block-device is the name of the device on which the file system exists. If there is more than one line that specifies options for a device, each line is processed and the options are set in order. In place of block-device, system-default specifies tunables for each device to process. If there is an entry for both a block device and a system default, the system default value takes precedence. Lines in tunefstab that start with the pound (#) character are treated as comments and ignored.
http://education.hp.com
The tunefs-options correspond to the tuneable parameters that vxtunefs and mount_vxfs set on the file system. Each option in this list is a name=value pair. Separate the options by commas, with no spaces or tabs between the options and commas. See the vxtunefs(1M) manual page for a description of the supported options.
Examples
If you have a four column striped volume, /dev/vg01/lvol3, with a stripe unit size of 128 kilobytes per disk, set the read_pref_io and read_nstream parameters 128 and four, respectively. You can do this in two ways: /dev/vg01/lvol3 or: /dev/vg01/lvol3 /dev/vg01/lvol3 read_pref_io=128k read_nstream=4 read_pref_io=128k,read_nstream=4
To set the discovered direct I/O size so that it is always lower than the default, add the following line to the /etc/vx/tunefstab file:
/dev/dsk/c3t1d0
discovered_direct_iosz=128K
http://education.hp.com
Read performance should not be affected. Any writes after the snap will be 2-3 times slower Subsequent writes to the same area will perform normally Have the snapshot on a separate physical disk Tests of OLTP show 15-20% degradation
Issues for the backup Snapshot File System Snapshot performance should be equivalent to normal JFS
Student Notes
Performance of the Advanced (Snapped) File System.
The write performance of the online (snapped) file system will be degraded but the read performance will stay the same. It is important to ensure that the snapshot file system (the backup) resides on a different physical disk, otherwise backup I/O will use up valuable bandwidth. Initial writes to a block after the snapshot is started will be 2 to 3 times slower. 1. Read the old data 2. Write the old data to the snapshot 3. Write of the new data Multiple snapshots would cause this process to be even slower. Only the initial write suffers, subsequent changes are not recorded in the snapshot and therefore would proceed at normal speed.
http://education.hp.com
Overall impact will depend on the read to write ratio and the mixing of the I/O operations. For example, Oracle running an OLTP workload on a snapped file system was measured about 15 to 20% slower than a file system that was not being snapped.
http://education.hp.com
3. Change directory to /vxfs. Time the execution of the disk_long program, which writes 400 MB of data to the file system in 20 MB increments. After each 20 MB is written, the files are deleted. Run the command three times and record the middle results. # cd /vxfs # timex ./disk_long # timex ./disk_long # timex ./disk_long Record middle results: Real: _____________ User: ____________ Sys: ____________ 4. Remount the JFS file system using delaylog option. This helps performance of noncritical transactions. Run the command three times and record the middle results. # # # # # # # cd / umount /vxfs mount -o delaylog /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long
http://education.hp.com
Based on the results, does the disk_long program perform many noncritical transactions?
5. Remount the JFS file system using tmplog option. This causes the system call to return after the JFS transaction is updated in memory (step 1 from lecture), and before the transaction is written to the intent log. Run the command three times and record the middle results. # # # # # # # cd / umount /vxfs mount -o tmplog /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long
Record middle results: Real: _____________ User: ____________ Sys: ____________ Based on the results, why does the disk_long program show little improvement when mounted with tmplog?
6. Remount the JFS file system using tmpcache option. This allows the JFS transaction to be created without having to wait for the user data to be written in extending write calls. Run the command three times and record the middle results. # # # # # # # cd / umount /vxfs mount -o mincache=tmpcache /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long
http://education.hp.com
7. Remount the JFS file system using direct option. This option requires all user data and all JFS transactions to bypass the buffer cache and go directly to disk. Run the command just once and record the results. # # # # # cd / umount /vxfs mount -o mincache=direct /dev/vg00/vxfs /vxfs cd /jfs timex ./disk_long
Record results: Real: _____________ User: ____________ Sys: ____________ Based on the results, why does the disk_long program show poor performance results when mounted with mincache=direct? When would this option be appropriate to use?
http://education.hp.com
http://education.hp.com
nfsd, mountd telnetd, ftpd XDR RPC UDP/TCP IP Data Link Physical
Server
Client
Student Notes
Networking allows one computer (server) to communicate with and share its local files and directories with other computers (clients) in a homogeneous environment.
Network Protocols
NFSD, MOUNTD, FTPD, TELNETD The Networking server daemons respond to requests from clients and perform the requested operations. BIOD, FTP, TELNET The Networking user applications request operations to be performed for them on the server. XDR External data representation is a machine-independent data format used by applications to translate machine-dependent data formats to a universal format that can be used by other networking hosts using XDR.
http://education.hp.com
RPC/Session Layer The remote procedure call mechanism allows a server machine to define a procedure that a client program can call. This is how a client can perform file system operations, such as creating, deleting, modifying, and viewing a directory, creating, deleting, modifying, and copying a file, and so on. UDP/TCP Network protocols that efficiently move large amounts of data. Because there is no acknowledgement from the receiver, UDP is considered unreliable, whereas TCP is considered reliable. However, TCP generally has more overhead and therefore does not perform as well as UDP. IP Internet protocol is a network protocol which is responsible for getting packets between hosts on one or more networks that are linked together. Data Link The data link defines how the packets are assembled on the physical wire. Examples of data link protocols include IEEE 802.3 (CSMA/CD), IEEE 802.4 (Token Bus), IEEE 802.5 (Token Ring). Physical The Physical layer describes the actual transfer media, and how data is transferred on the network. Examples of physical layer protocols include Twisted Pair, Coaxial, and Fiber Optics.
http://education.hp.com
mount
3
server:/data
/
/data
6
/
kernel
7
NFS
data
kernel
NFS
5
data File
Buffer Cache
8 1 2
Buffer Cache
4
biod
process
nfsd
Memory - Client
Memory - Server
Student Notes
As a prime example of how network performance can affect applications, lets look at how NFS works. The above slide shows a high level overview of the sequence of events which occur when an NFS client attempts to access data on an NFS server: 1. A user process issues the read() system call against an NFS mounted file system. The user process goes into a wait state, waiting for the system call to return. 2. Upon checking the buffer cache for the requested data (assume data is not in the buffer cache), the biod daemon immediately follows the original read with a read-ahead request. This is done by biod so subsequent I/O requests have a better chance of being satisfied through the buffer cache. 3. The NFS subsystem within the kernel on the client issues an RPC read request on behalf of the process (and a second on behalf of biod) to the NFS server. 4. The NFS server receives the request and schedules an nfsd process to handle it.
http://education.hp.com
5. The nfsd daemon performs the file system read and the data is returned to the nfsd daemon through the servers' buffer cache. 6. The NFS subsystem within the kernel on the server schedules a reply to the client containing the requested data. 7. The data is returned to the client process through the buffer cache on the client. The data, plus the data read ahead by the biod daemon, is stored in both the client's and server's buffer caches to allow future I/O requests to come from the buffer caches. 8. The read system call is returned (along with the data) to the client process. As you can see, NFS initiates a fair amount of traffic over the network. Other services, such as telnet and ftp, have their own performance profiles. Some are interactive and response time is important. Others are task-oriented and rely mostly on throughput.
http://education.hp.com
/ data
The UDP socket is a 256-KB FIFO queue. The UDP socket is emptied by the nfsds. Not enough nfsds cause NFS packets to be backed up in the queue.
port 2048 port 2049 port 2050 nfsd nfsd nfsd nfsd
File
Memory - Server
Student Notes
NFS packets come into the NFS server through the UDP receive queue (port 2049). The size of this queue is 256 KB. The NFS packets are processed sequentially, FIFO. Upon receipt of an NFS packet, an nfsd daemon is awakened, removes the request from the queue, and processes the request. If requests come into the server faster than the daemons can process them, the UDP queue quickly begins to back up with requests. If the UDP queue is full when a new request arrives, the new request is dropped off the back of the queue. This is known as a UDP socket overflow. To prevent this, always have a sufficient number of daemons running. Regardless of how many nfsd daemons are running, only one will be awakened for each incoming request. This allows a site to meet the demands of peak workload without suffering performance problems during periods of light demand. NFS tuning can thus focus on file system and network performance than CPU performance, since the number of nfsd daemons does not impact performance.
http://education.hp.com
kernel
port 2049 TCP Socket
NFS
data
File
nfsktcpd
Memory - Server
Student Notes
Network File System (NFS) is now supported over the connection-oriented protocol, TCP/IP for NFS versions 2 and 3, in addition to running over User Datagram Protocol (UDP). TCP transport increases dependability on wide-area networks (WANs). Generally, packets are successfully delivered more consistently because TCP provides congestion control and error recovery. As a result, with this new functionality, NFS is now supported over WANs. As long as TCP is supported on the WAN, then NFS is supported also. The mount_nfs command now supports a proto= option on the command line where the value for proto can be either UDP or TCP. (In the past, this option was ignored.) This change allows the administrator to specify which transport protocol they wish to use when mounting a remote file system. If the proto= option is not specified, by default, NFS will attempt a TCP connection. If that fails, it will then try a UDP connection. Thus, by default, you will begin using TCP instead of UDP for NFS traffic when you begin using the 11i version of HP-UX. This should have little impact on you. You do, however, have the option to specify either UDP or TCP connections.
http://education.hp.com
If you specify a proto= option, only the specified protocol will be attempted. If the server does not support the specified protocol, the mount will fail. nfsd now opens TCP transport endpoints to receive incoming TCP requests. For TCP, the nfsktcpd is multi-threaded. For UDP, the nfsd is still multi-processed. Kernel TCP threads execute under the process nfsktcpd. When counting the number of nfsd processes, keep in mind the following algorithm: An equal number of nfsd's that support UDP will be created per processor and only one nfsd that supports TCP will be created. In the case of a four-way machine and NUM_NFSDS=4 (set in /etc/rc.config.d/nfsconf), 17 nfsd's will be created: 16 for UDP (4 per processor) and 1 for TCP. nfsstat will now report TCP RPC statistics for both client and server. The TCP statistics will be under the connection-oriented tag and the UDP statistics will be under the connectionless-oriented tag.
http://education.hp.com
biod on Client
Read-ahead request
Read request
Write requests
kernel
NFS
kernel
NFS
Buffer Cache
Buffer Cache
write()
read()
biod
1
read()
biod
write()
process
biod
process
Memory - Client
Memory - Client
Student Notes
The biod daemons allow performance on the NFS client to maintain the illusion of having file systems on the local disks. The biod daemons assist in improving NFS client performance by performing read-aheads and write-behinds for the client processes.
Read-Ahead Requests
The biod daemons help read performance on NFS clients by reading ahead (that is, prefetching) data into the buffer cache so that when the client needs the data, it will be in its buffer cache. When an NFS client initiates a read request, and the data is not in its local buffer cache, the process performs the RPC read, itself. To prefetch data for the buffer cache, the kernel has the biod daemons send additional RPC read requests to the NFS server, just as if the NFS client process had requested this data. Subsequent read requests by the client (especially if reading sequentially) will find the data already in the buffer cache.
http://education.hp.com
Write-Behind Requests
The biod daemons assist in write performance by allowing the NFS client process performing the write() call to return immediately rather than waiting for the write() call to complete. When an NFS client performs a write() call, the data is written to the client's buffer cache. Once the data is in the buffer cache, the kernel schedules an RPC write to occur. If there are available biod daemons, the kernel can schedule the write to occur for the biod daemons, rather than the NFS client process. This allows the client process to continue its execution without having to wait for the write() call to return. Instead of the client process waiting for the write call, the biod daemon waits for the write call. NOTE: Without any biod daemons on the client, NFS still works. The difference is no read-aheads are done, causing NFS read performance to suffer. All NFS clients performing writes are forced to wait for the RPC write requests to return, causing NFS write performance to suffer.
http://education.hp.com
TELNET
kernel
2 6
kernel
3
? port
1 7
23 port
telnet
telnetd
Memory - Client
Memory - Server
Student Notes
Telnet also uses sockets. A socket is simply a system-port pair. A connection is a pair of sockets. On the client (when the user enters the telnet command), a port is assigned to the process from a pool of available ports. Thus a socket is formed on the client. A connection is established between that port and port 23 on the server (used exclusively to handle incoming telnet requests). On the server (as a result of the connection), a telnetd daemon is spawned and linked to port 23. Now, the telnet process running on the client (1) issues a request to execute some command on the server. The command is placed in a packet and sent through the socket on the client (2) to the socket on the server (3). The command is removed from the packet and given to the telnetd daemon (4) to execute. The telnetd daemon executes the command and places the result in a packet. That packet is sent through the socket on the server (5) to the socket on the client (6). The results are removed from the packet and sent to the telnet process (7).
http://education.hp.com
By default, telnet uses TCP for its transfers, since it needs to establish a firm connection between the client process and the server daemon.
http://education.hp.com
FTP
kernel
2 6 10 12
kernel
3 5
?/? ports
20/21 ports
9 13
8 14
ftp
11
ftpd
Memory - Client
Memory - Server
Student Notes
FTP also uses sockets. It uses a pair of connections to perform all its operations one connection passes the commands and their results back and forth while the other connection passes file data back and forth. On the client (when the user enters the ftp command), a port is assigned to the process from a pool of available ports. Thus a socket is formed on the client. A connection is established between that port and port 21 on the server (used exclusively to handle incoming ftp requests). On the server (as a result of the connection), a ftpd daemon is spawned and linked to port 21. Now, the ftp process running on the client (1) issues a request to execute some command on the server. The command is placed in a packet and sent through the socket on the client (2) to the socket on the server (3). The command is removed from the packet and given to the ftpd daemon (4) to execute.
http://education.hp.com
The ftpd daemon executes the command and places the result in a packet. That packet is sent through the socket on the server (5) to the socket on the client (6). The results are removed from the packet and sent to the ftp process (7). If the command involves the transfer of some file data, the ftp process on the client (or the ftpd daemon on the server) initiates the transfer of the data from one socket to the other using port 20 on the server and another available port on the client. For example, lets say that the user entered the ftp command: get /etc/hosts /tmp/hosts When the command arrives at the ftpd daemon, it triggers a read of the /etc/hosts file from the servers file system into the servers buffer cache. Once there, the daemon (8) places the contents of the file into one or more packets (as necessary) and sends them to port 20 (9). The packets arrive at the socket in the client (10) and are reassembled into the image of the file in a network buffer. Then it is copied into a buffer in buffer cache. The ftp process (11) acknowledges the receipt of the file by sending a packet to the socket (12) across the network to the socket on the server (13), where it is extracted and sent on to the daemon (14). By default, ftp uses TCP for its transfers, since it needs to establish two firm connections between the client process and the server daemon.
http://education.hp.com
Student Notes
Number of nfsd daemons
Too few nfsd daemons can hinder performance on the NFS server. If all the nfsd daemons are busy when new NFS requests come in, then the requests have to wait until one of the daemons become free.
http://education.hp.com
Number of nullrecv
If the nfsd daemons are not being kept busy this counter will be incremented. If this counter is incrementing try reducing the number of nfsd daemons on the system until nullrecv is static.
http://education.hp.com
Number of badcalls
Bad calls indicate that the NFS server cannot process RPC requests. This could be due to authentication problems caused by having a user in too many groups, attempts to access exported file systems as root, or an improper secure RPC configuration. This can also be due to the server being down, or soft-mounted NFS file systems timing out.
http://education.hp.com
Student Notes
Some key metrics to monitor from an overall network perspective include: Amount of Traffic. The amount of network traffic should be monitored across the entire LAN. However, unless network probes are available, this is very difficult to do. At a minimum, the amount of traffic into and out of the servers should be monitored with the netstat command. When monitoring network traffic, it is important to know the maximum packets per second on the LAN. In the case of a 10 MB Ethernet, this would be: 10 MB / 8 bits_per_byte = 1.2 MB per second 1.2 MB / 1 KB_average_packet_size = 1,200 packets 1,200 * 30%_saturation_point = 360 packets (Total MB per second) (Total packet per second) (Max packets per second with minimal collisions)
http://education.hp.com
Type of Network Topology: Each network topology has different limitations. Ethernet is the most common, but it is the slowest. More recent Ethernet technologies are faster, offering 100 Mbits/sec or even 1000 Mbits/sec. FDDI is the fastest, but it is somewhat expensive. Token Ring has no collision issues (since it is token based), but it is not as pervasive. Number of Subnets: Subnetting is a method for localizing traffic to help reduce packet congestion. If too much traffic exists on a network, it may need to be split into multiple subnets. Number of Routers: Routers are another possible solution to help segment network traffic. In addition, routers can help with network security issues and routing of diverse packet types.
http://education.hp.com
Student Notes
The NFS workload on a server is defined as the total number of NFS packets received and processed. The NFS workload on a client is defined as the total number of NFS requests initiated from the client. It is important to establish a baseline regarding the NFS workload being placed on an NFS machine. This allows the system administrator to determine periods when the NFS workload is particularly high or low.
http://education.hp.com
nullrecv 0
badlen 0
xdrcall 0
nfsdrun 549734423
Client rpc: Connection oriented: N/A Connectionless oriented: calls badcalls retrans 17547240 0 0 badverfs timers toobig 0 7 0
3. Calculate the average number of NFS calls per second by dividing the total RPC calls by 5 days, 8 hours per day, 60 minutes per hour, and 60 seconds per minute: ((((171792344 calls/ 5 days) / 8 hours) / 60 min) / 60 sec) = 1193 RPC calls/sec
http://education.hp.com
http://education.hp.com
# nfsstat s Connection oriented: calls badcalls nullrecv 0 0 0 Connectionless oriented: calls badcalls nullrecv 428 0 6 # nfsstat -c Client rpc: Connection oriented: calls badcall badxids 0 0 0 cantconn nomem interrupts 0 0 0 Connectionless oriented: calls badcalls retrans 25345 304 1109 badverfs timers toobig 0 16 0
badlen 0 badlen 0
xdrcall 0 xdrcall 0
dupchecks 0 dupchecks 0
dupreqs 0 dupreqs 0
timeouts 0
newcreds 0
badverfs 0
timers 0
badxids 49 nomem 0
waits 0 bufulocks 0
newcreds 0
Student Notes
The nfsstat -s report shows NFS statistics on an NFS server. The report shows overall RPC statistics and detailed NFS type packets received.
nullrecv
The example on the slide shows all the RPC packets received are NFS related. The six nullrecvs explain the difference between the RPC calls and NFS calls.
http://education.hp.com
The reason for the nullrecv may be due to a client retransmission duplicate request. For example, if a client sends an NFS read request and does not receive a response within its time-out period, it will re-send the same request, which causes a duplicate entry to be in the server's UDP queue. When the first nfsd daemon removes the first NFS read request, it will also remove the duplicate request. This causes the second nfsd daemon to find an empty UDP queue when it executes. nfsstat s (Full Output)
# nfsstat -s Server rpc: Connection oriented: calls badcalls 0 0 badlen xdrcall 0 0 dupreqs 0 Connectionless oriented: calls badcalls 0 0 badlen xdrcall 0 0 dupreqs 0 Server nfs: calls 0 Version 2: (0 calls) null 0 0% root 0 0% read 0 0% create 0 0% link 0 0% rmdir 0 0% Version 3: (0 calls) null 0 0% lookup 0 0% read 0 0% mkdir 0 0% remove 0 0% link
nullrecv 0 dupchecks 0
nullrecv 0 dupchecks 0
badcalls 0 getattr 0 0% lookup 0 0% wrcache 0 0% remove 0 0% symlink 0 0% readdir 0 0% getattr 0 0% access 0 0% write 0 0% symlink 0 0% rmdir 0 0% readdir setattr 0 0% readlink 0 0% write 0 0% rename 0 0% mkdir 0 0% statfs 0 0% setattr 0 0% readlink 0 0% create 0 0% mknod 0 0% rename 0 0% readdir+
http://education.hp.com
The nfsstat -c report shows NFS statistics on an NFS client. The report shows the amount of RPC calls generated by the client, as well as the specific NFS calls.
waits
timeouts
badxids
http://education.hp.com
retrans
This indicates the number of NFS requests retransmitted due to timeouts. Keep in mind, not every timeout causes a retransmission, as most clients error out after two to three retries. This indicates an NFS request has reached its retry count and has returned an error. This is most often due to the NFS client not being able to reach the NFS server (either because the NFS server is down, or the network link between the client and server is down).
badcalls
badcalls 0
clgets 55
http://education.hp.com
Module 11 Network Performance lookup 0 0% read 0 0% mkdir 0 0% remove 0 0% link 0 0% fsstat 0 0% commit 0 0% access 0 0% write 0 0% symlink 0 0% rmdir 0 0% readdir 0 0% fsinfo 0 0% readlink 0 0% create 0 0% mknod 0 0% rename 0 0% readdir+ 0 0% pathconf 0 0%
http://education.hp.com
Network Management ID =4 Description = lan0 Hewlett-Packard LAN Type (value) = ethernet-csmacd(6) MTU Size = 1500 Speed = 10000000 Station Address = 0x800097bfb43 Administration Status (value) = up(1) Operation Status (value) = up(1) Last Change = 4834 Inbound Octets = 426550151 Inbound Unicast Packets = 3380123 Inbound Non-Unicast Packets = 1992200 Inbound Discards =0 Inbound Errors = 1277 Inbound Unknown Protocols = 53618 Outbound Octets = 1653363768 Outbound Unicast Packets = 2626023 Outbound Non-Unicast Packets = 1454 Outbound Discards =1 Outbound Errors =0 Outbound Queue Length =0 Specific = 655367 Press <Return> to continue
Interface Hw Rev 0
Ethernet-like Statistics Group Index =4 Alignment Errors =0 FCS Errors =0 Single Collision Frames = 6221 Multiple Collision Frames = 10151 Deferred Transmissions = 116267 Late Collisions =0 Excessive Collisions =0 Internal MAC Transmit Errors = 0 Carrier Sense Errors =0 Frames Too Long =0 Internal MAC Receive Errors = 0 LAN Interface test mode. LAN Interface Net Mgmt ID = 4
Student Notes
The lanadmin command displays general network packet transmission statistics for a single system.
Inbound/Outbound
The primary metric for determining if you have a network bottleneck is the ratio of collisions to out-bound packets. In this example, you would take the total number of collisions (6221 + 10151 = 16371) and divide it by the total number of outbound packets (2626023 + 1454 = 2627477) to get the percentage of collisions per outbound packet (16371 / 2627477 = 0.6%). The commonly used threshold is 5%. Any network experiencing greater than a 5% collision
http://education.hp.com
rate is said to have a bottleneck. This system is well below that threshold. Of course, this metric only works on networks that experience collisions. Standard Ethernet does. Token rings do not. The procedure for producing this report is: 1. Execute the lanadmin command. 2. From the main menu, select lan. 3. From the lan menu, select display. Following is a complete output from this tool:
# lanadmin
LOCAL AREA NETWORK ONLINE ADMINISTRATION, Version 1.0 Thu, Mar 25,2004 11:22:51 Copyright 1994 Hewlett Packard Company. All rights are reserved. Test Selection mode. lan menu quit terse verbose = = = = = LAN Interface Administration Display this menu Terminate the Administration Do not display command menu Display command menu
Enter command: lan LAN Interface test mode. LAN Interface PPA Number = 0 clear display end menu ppa quit reset specific = = = = = = = = Clear statistics registers Display LAN Interface status and statistics registers End LAN Interface Administration, return to Test Selection Display this menu PPA Number of the LAN Interface Terminate the Administration, return to shell Reset LAN Interface to execute its selftest Go to Driver specific menu
Enter command: display LAN INTERFACE STATUS DISPLAY Thu, Mar 25,2004 11:23:02 PPA Number Description TX,FD, AUTO,TT=1500] Type (value) MTU Size Speed Station Address = 0 = lan0 HP PCI 10/100Base-TX Core [100BASE-
= = = =
http://education.hp.com
Module 11 Network Performance Administration Status (value) Operation Status (value) Last Change Inbound Octets Inbound Unicast Packets Inbound Non-Unicast Packets Inbound Discards Inbound Errors Inbound Unknown Protocols Outbound Octets Outbound Unicast Packets Outbound Non-Unicast Packets Outbound Discards Outbound Errors Outbound Queue Length Specific Press <Return> to continue <CR> Ethernet-like Statistics Group Index Alignment Errors FCS Errors Single Collision Frames Multiple Collision Frames Deferred Transmissions Late Collisions Excessive Collisions Internal MAC Transmit Errors Carrier Sense Errors Frames Too Long Internal MAC Receive Errors = = = = = = = = = = = = 1 0 0 0 0 0 0 0 0 0 0 0 = = = = = = = = = = = = = = = = up(1) up(1) 780 1144058672 3513729 2575374 0 0 13895 784916247 3600289 379474 0 0 0 655367
LAN Interface test mode. LAN Interface PPA Number = 0 clear display end menu ppa quit reset specific = = = = = = = = Clear statistics registers Display LAN Interface status and statistics registers End LAN Interface Administration, return to Test Selection Display this menu PPA Number of the LAN Interface Terminate the Administration, return to shell Reset LAN Interface to execute its selftest Go to Driver specific menu
http://education.hp.com
# netstat -i Name Mtu Network lan0 1500 156.153.208.0 lo0 4136 loopback
Address Ipkts Ierrs Opkts Oerrs Coll r265c75.cup.edunet.hp.com 4546682 0 4138618 0 0 localhost 1178171 0 1178171 0 0
# netstat udp: 0 0 0 0 0
-p udp incomplete headers bad data length fields (Deleted from later versions) bad checksums socket overflows data discards (Deleted from later versions)
Student Notes
The netstat command can be used to monitor total collisions and total packet traffic in and out of a LAN card, as well as any UDP socket overflows. The -i option monitors input packets, input errors, output packets, output errors, and collisions for every LAN card on the system. Some versions of this tool did not show the Input Errors, Output Errors, and the Collisions. The output of this tool can also be used to calculate the collision/outbound packet ratio described in the previous topic. The -p udp option monitors overflows related to the UDP socket queue. If there are not enough nfsd daemons, the volume of incoming client NFS requests can exceed the server's ability to drain these requests from the UDP socket queue. When the socket queue becomes full and new NFS requests are received, the NFS request falls off the queue and a UDP socket overflow occurs.
http://education.hp.com
B3692A GlancePlus B.10.12 10:47:57 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U Cpu Util S |100% 100% 100% Disk Util F | 83% 22% 84% F Mem Util S S U | 94% 95% 96% U B B Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------NFS BY SYSTEM Users= 13 Server (inbound) Client (outbound) Idx System ReadRt WriteRt SvcTm ReadRt WriteRt SvcTm NetwkTm -------------------------------------------------------------------------------1 e2403roc 0.0 0.0 0.00 0.0 0.0 0.00 0.00 2 e2403sto 0.0 0.0 0.00 0.0 0.0 0.00 0.00 3 e2403alf 0.0 0.0 0.00 0.0 0.0 0.00 0.00
S - Select a System
C - cum/interval toggle
Page 1 of 1
Student Notes
The glance NFS report (the n key) monitors total inbound requests for NFS servers and total outbound requests for NFS clients. For NFS servers, the total number of inbound read/write requests received from each client is shown, along with the average amount of time for the server to service each request. For NFS client systems, the total number outbound read/write requests sent to each NFS server is shown, along with the average amount of time, from the client perspective, for the requests to be serviced. For a detailed inspection on the types of NFS requests being sent (client) or the types of NFS requests being received (server), the specific client or server can be selected with the S key.
http://education.hp.com
B3692A GlancePlus B.10.12 10:45:26 e2403roc 9000/856 Current Avg High -------------------------------------------------------------------------------S R U U Cpu Util S |100% 100% 100% F Disk Util F | 83% 22% 84% U B B Mem Util S S U | 94% 95% 96% Swap Util U | 21% 21% 22% U R R -------------------------------------------------------------------------------NFS OPERATIONS for: e2403sto Address = 15.19.83.75 PID = 1275 NFS GLOBAL ACTIVITY Users= 1 Server (inbound) Client (outbound) Current Cum Current Cum -------------------------------------------------------------------------------Read Rate 0.0 0.0 0.0 0.0 Write Rate 0.0 0.0 0.0 0.0 Read Byte Rate 0.0 0.0 0.0 0.0 Write Byte Rate 0.0 0.0 0.0 0.0 NFS Call Count 0 0 0 0 Bad Call Count 0 0 0 0 Service Time 0.00 0.00 0.00 0.00 Network Time na na 0.00 0.00 Read/Write Qlen na na 0 0 Idle biods na na 16 na Page 1 of 3
Student Notes
The glance NFS system report (the N key) displays the activity of NFS packets that are being received by an NFS server, or being sent as an NFS client. If a system is both a client and a server, separate columns are maintained for each. Fields of most interest in this report are the read and write rates, as these typically put the greatest load on a system. Note that this is page one of three. On the following two pages, the individual RPCs are broken down by type and counted. There are version2 and version 3 counts to accommodate earlier and later versions of NFS.
http://education.hp.com
S - Select an Interface
Page 1 of 1
Student Notes
The glance Netwrok by Interface report (the l key) displays the activity of inbound and outbound packets. Fields of most interest in this report are the inbound and outbound packets rates, as well as the KB transferred in and out by each network card. The lo0 interface is the internal loopback interface used for diagnostics.
http://education.hp.com
Tuning NFS
Tune number of nfsd daemons Turn on sticky bit for exported executables Export file systems with asynchronous write option Avoid using symbolic links on exported file systems Tune number of biod daemons Tune mount options when mounting NFS file system: rsize and wsize options
Student Notes
There are a number of NFS tuning solutions that can help to improve performance on NFS servers: Tune number of nfsd daemons: The default number of nfsd daemons in HP-UX 11.00 and earlier was four. This most likely is too small. The best recommendation for performance is to have two nfsd daemons for each simultaneous disk operation that can be performed. This allows a request to be received, while another is awaiting disk service. For example, on a system with four SCSI controllers and NFS-exported file systems spanning disks on these controllers, schedule eight nfsd daemons. In 11i, the default number of nfsd was raised to 16. This seems a more reasonable number. The best indicator of too few nfsd daemons is UDP socket overflows. Increase the number of nfsd daemons s if even one UDP socket overflow occurs. The size of the UDP socket queue can be viewed with the netstat -an | grep udp |grep 2049 command. Another indicator of too few nfsd server daemons is a high total of badxids being returned to NFS clients. Remember, only UDP requires the number of nfsds to be tuned. TCP uses multiple threads in the same daemon.
http://education.hp.com
Turn on sticky bit for exported executables: By default, text segments are not paged to swap, as their pages already exist on the file system. In the case of an executable program being loaded from an NFS server across the network to an NFS client, it is desirable to page the text locally, rather than return to the NFS server when the text page is needed again. This behavior can be achieved by setting the sticky bit to ON for the executable program. Below is an example of setting the sticky bit to ON for an executable: # chmod 1555 prgm # ls -l prgm -r-xr-xr-t 1 root
bin
This also requires modifying the following tunable kernel parameter on the client: page_text_to_local = 1 There are a number of NFS tuning solutions that can help to improve performance on NFS clients: Tune number of biod daemons: The default number of biod daemons in HP-UX 11.00 and earlier was four. This most likely is too small. The best recommendation is to have a minimum of two biod daemons for every client process performing I/O to and from the NFS file system. Each biod daemon has, at most, one NFS request outstanding at any time, and as the number of biod daemons increase, the more disk requests the client can send. If the client has x process performing file system I/O and y biod daemons, then the client could have x+y RPC requests outstanding at one time: one for each of the biod daemons, and one for each of the clients. In 11i, the default number of biod was raised to 16. This seems a more reasonable number. The best indicator of too few biod daemons is the number of waits shown in the netstat -c command. Tune the NFS Mount Options: There are a number of NFS mount options that can affect client performance, among them the NFS read and write buffer sizes. The NFS buffer size (specified with the rsize and wsize mount options) determines the increment in which data is transferred to and from the NFS file system. For example, if the file system block size is 8192 bytes, and the NFS buffer size was 8,500 bytes, two file system I/Os would be required before any NFS packet could be sent. The recommendation for NFS buffer size is to match the size of the file system block size. The default NFS buffer size is 8192 bytes, and this does match the default file system block size on HFS. For JFS, try to match the buffer size to the size of a typical extent size.
http://education.hp.com
Student Notes
Subnetting a network is an effective way to reduce congestion on a LAN. Using routers (as compared to bridges, Ethernet switches, and Ethernet to FDDI concentrators) provides a great deal of flexibility in the form of security, network segmentation, and routing of diverse types of packets. Routers usually provide good throughput and performance at a relatively low cost. Using an existing computer system as a gateway for traffic between NFS clients and the file server is often inefficient and limits the performance of the NFS clients. By making sure the maximum transmission unit MTU on the client system, the file server, and all routers in between them, the overhead on the routers caused by packet fragmentation and re-assembly can be avoided.
http://education.hp.com
# netstat -i Name Mtu Network ni0* 0 none ni1* 0 none lo0 4608 loopback lan0 1500 156.153.192 lan1* 1500 none
Address Ipkts Ierrs Opkts Oerrs Coll none 0 0 0 0 0 none 0 0 0 0 0 localhost 6055 0 6055 0 0 pr1w1 3724729 0 1705240 10 34739 none 0 0 0 0 0
The route command (-p option) can be used to set the Path Mtu size for a host route only.
http://education.hp.com
Add Subnets
100 Mb/s
Student Notes
If the average client demand on an NFS server is measured to be greater than the network bandwidth, and assuming 100 clients demand 10 NFS requests per second, then a single 10-MB Ethernet segment (with a calculated 360 packets maximum per second) could not handle this workload, even though the server itself may be able to (from a processing stand point). To allow this client workload to be processed by the single NFS server, the following network configurations can be implemented: 1. Use at least three network interface cards, one for each segment, distributing 33-34 clients on a segment. 2. Use one or more high-speed network connections, which connect to multiple lower bandwidth LAN segments. In the first example, we have added multiple LAN interfaces to our NFS server. In the second example, we have a 100-MB/second/ FDDI card on the NFS file server. We also have a router on the same segment as the server that has an FDDI interface, as well as
http://education.hp.com
several regular 10-MB/second Ethernet interfaces. Here, the issue of the routers ability to do packet fragmentation and reassembly efficiently, may become important. In our last example, we have a 100-MB/second FDDI card on the NFS file server and a 100-MB/second translating Ethernet switch on the same FDDI segment. Since this is not routed, the file server and clients share the same subnet address. There are many other possible network topologies.
http://education.hp.com
2. Export the JFS file system so the client can mount it. # exportfs -i -o root=client_hostname # exportfs 3. From the client system, mount the NFS file system. # mount server_hostname:/vxfs /vxfs 4. Time how long it takes to read the 20 MB of files from the mounted file system. Record the results: # timex cat /vxfs/file* > /dev/null Record results: /vxfs
Real: _____________ User: ____________ Sys: ____________ 5. Now that the data is in the client's buffer cache, time how long it takes to read the exact same files again. Record the results: # timex cat /vxfs/file* > /dev/null Record results:
Real: _____________ User: ____________ Sys: ____________ Moral: Try to have a big enough buffer on the client system for a lot of data to be cached. Also, biod daemons will help prefetching data.
http://education.hp.com
6. Test to see if fewer biod daemons will change the initial performance. # # # # # # cd / umount /vxfs kill $(ps -e |grep biod|cut -c1-7) /usr/sbin/biod 4 mount server_hostname:/vxfs /vxfs timex cat /vxfs/file* > /dev/null
Record results:
Real: _____________ User: ____________ Sys: ____________ 7. Once finished, remove the files and umount the file system. # rm /vxfs/file* # umount /vxfs
http://education.hp.com
During this lab, the monitoring tools shown below should be used on the client and server CLIENT SERVER
# nfsstat -c # nfsstat -s # glance NFS report (n key) # glance NFS report (n key) # glance Global Process (g key) # glance Global Process (g key) - monitor biod daemons -monitor nfsd daemons # glance Disk report (d key) - monitor Remote Rds/Wrts 1. From the NFS client, mount the NFS file system as a version 2 file system. # mount -o vers=2 server_hostname:/vxfs /vxfs 2. Terminate all the biod daemons on the client. # kill $(ps -e |grep biod|cut -c1-7) 3. Time how long it takes to copy the vmunix file to the mounted NFS file system. Record the results: The first command buffers the file. # cat /stand/vmunix >/dev/null # timex cp /stand/vmunix /jfs Record results:
Real: _____________ User: ____________ Sys: ____________ 4. Now, start up the biod daemons, and retry timing the copy. Record the results: # /usr/sbin/biod 4 # timex cp /stand/vmunix Record results: /jfs
http://education.hp.com
5. Change the mount options to version 3 and retime the transfer: # # # # # cd / umount /vxfs mount o vers=3 server_hostname:/vxfs /vxfs cd / timex cp /stand/vmunix /vxfs
Record results:
Real: _____________ User: ____________ Sys: ____________ 6. Compare the speed of FTP to NFS. Transfer the file to the server using the ftp utility. # ftp server_hostname # put /stand/vmunix /vxfs/vmunix.ftp How long did the FTP transfer take? _________ Explain the difference in performance. 7. Test the potential performance benefit of turning off the new TCP feature of HPUX 11i. First, mount the file system with UDP protocol rather than the default TCP. # umount /vxfs # mount -o vers=3 o proto=udp server_hostname:/vxfs /vxfs Perform the copy test again and compare the results with the TCP version 3 mount data in part 3. Is UDP quicker than TCP? # timex cp /stand/vmunix /vxfs
http://education.hp.com
http://education.hp.com
Dynamic
Automatic
constantly being tuned by the kernel can be set manually to a fixed value
Student Notes
There are a number of tunable parameters within the kernel that can have a big impact on performance. When making changes to these parameters, it may require that a new kernel be compiled. As of 11i v1, about 12 parameters were converted to dynamically tunable parameters. That is, their values could be changed without rebuilding the kernel and without rebooting the system. As of 11i v2, there are now around 36 dynamically tunable parameters, plus a few traditional parameters that are now tuned by the kernel, so no manual tuning of them need be done at all. Static kernel parameters have been around since UNIX was first designed. In order to change one of these parameters, it was necessary to alter the contents of a system configuration file, system, rebuild the kernel using this altered configuration file, move the new kernel into place, and reboot the system to activate the new kernel. This tended to be time consuming and forced the system to become unavailable for a time. Recently, with HP-UX 11i v1, a few kernel parameters were converted to dynamic tuning. These parameters could be altered, using SAM or kmtune, and the changes would become effective immediately. There was no longer a need to rebuild the kernel or reboot the system. However, this only applied to those few kernel parameters. The vast majority of kernel parameters were still static. The dozen parameters that were made dynamically tunable,
http://education.hp.com
were ones that tended to be tuned by system administrators more frequently, but were relatively easy to convert to dynamic. More recently, with HP-UX 11i v2, several more parameters were converted to dynamic tuning. These parameters were also tuned fairly frequently by system administrators, but were more difficult to convert to dynamic. At the same time, a new class of parameters was introduced automatic. These parameters were tuned by the kernel constantly in response to changing conditions in the system. However, the system administrator could override the automatic handing by the kernel and force the parameter to some fixed value, if needed. At HP-UX 11i v1, the following kernel parameters became dynamic: core_addshmem_read core_addshmem_write maxfiles_lim maxtsiz maxtsiz_64bit maxuprc msgmax msgmnb scsi_max_qdepth semmsl shmmax shmseg At HP-UX 11i v2, the following additional kernel parameters became dynamic: aio_listio_max aio_max_ops aio_monitor_run_sec aio_prio_delta_max aio_proc_thread_pct aio_proc_threads aio_req_per_thread alloc_fs_swapmap alwaysdump dbc_max_pct dbc_min_pct dontdump fs_symlinks ksi_alloc_max max_acct_file_size max_thread_proc maxdsiz maxdsiz_64bit maxssiz maxssiz_64bit nfile nflocks nkthread
http://education.hp.com
nproc nsysmap nsysmap64 physical_io_buffers shmmni vxfs_ifree_timelag Also at HP-UX 11i v2, the following kernel parameters are obsolete or automatic: bootspinlocks clicreservedmem maxswapchunks maxusers mesg ncallout netisr_priority nni ndilbuffers sema semmap shmem spread_UP_drivers
http://education.hp.com
Student Notes
Some general rules and notes regarding tuning and recompiling the kernel: View the existing, tunable parameters with the kctune command (HP-UX 11i v2), the kmtune command (HP-UX 11.00 and 11i v1) or the sysdef or system_prep commands (HP-UX 10.x). You can also use SAM with any version of HP-UX to view the current values. Examples of outputs are shown below. Use the System Administration Manager (SAM) to tune the kernel parameters and rebuild the systems. SAM has the advantage of displaying all available, tunable parameters, their current values, and a range of acceptable values. SAM also knows which parameters can be tuned dynamically and will make changes to them immediately. As of HP-UX 11i v2, SAM calls a separate utility to do the actual tuning. When tuning performance by modifying kernel parameters, modify only one value with each kernel rebuild. By changing several parameters at once, you may cloud the picture and make it much more difficult to determine what helped and what hurt the systems performance.
http://education.hp.com
Avoid setting the tunable parameters too large. Many of the parameters create in-core memory data structures whose size is dependent upon the value of the tunable parameter (for example, nprocs to the size of the process table). Generally, it is a good rule of thumb to increase or decrease a parameter by no more than 20%, while trying to find the best setting for it. Of course, if you are changing a parameters value to accommodate some new application you are installing, always follow the manufacturers suggested changes. Use glance to monitor system table sizes. Ensure the system tables are not running out of entries. In general, there should be around 20% of unused entries in any table. This will ensure that you have enough entries to handle any high demand periods.
The step-by-step procedure for tuning and recompiling the kernel manually on HP-UX 11.X is shown below: 1. Log in as superuser. 2. Change directory. cd /stand/build 3. Create a system file from your current kernel. /usr/lbin/sysadm/system_prep -v -s system 4. Modify the /stand/build/system file as desired. 5. Build the kernel: /usr/sbin/mk_kernel -s system. 6. Save your old system and kernel files, just in case you want to go back. cp /stand/system /stand/system.prev cp /stand/vmunix /stand/vmunix.prev cp /stand/dlkm /stand/dlkm_vmunix.prev 7. Schedule the kernel update on the next reboot. kmupdate 8. Shut down and reboot from your new kernel. /sbin/shutdown -ry 0
http://education.hp.com
Example using kmtune to set and then activate a new value for a dynamic kernel variable.
# kmtune -q shmseg Parameter Current Dyn Planned Module Version ===================================================== shmseg 120 Y 120 # kmtune -s shmseg=155 # kmtune -l -q shmseg Parameter: shmseg Current: 120 Planned: 155 Default: 120 Minimum: Module: Version: Dynamic: Yes # kmtune -u shmseg shmseg has been set to 155 (0x9b).
http://education.hp.com
Student Notes
The next few slides will present the tunable kernel parameters in these categories.
http://education.hp.com
Student Notes
dbc_min_pct dbc_min_pct specifies the minimum size that the system's buffer cache may shrink to as a percentage of physical memory. It is now dynamic in 11i v2. dbc_max_pct specifies the maximum size that the system's buffer cache may grow to as a percentage of physical memory. It is now dynamic in 11i v2. nbuf is used to specify the number of file system buffer cache headers. Set nbuf to zero if you want to use the system's ability to grow and shrink this important table dynamically, based on demand. It is not yet obsolete, but expect it to be so in a future release. bufpages specifies the number of 4-KB pages in memory that will be allocated for the file system buffer cache. Like nbuf, this parameter should be set to zero if you want to use the dynamic form of buffer cache allocation. If this value is non-zero, enough nbufs (one for every two bufpages) will be created as well, unless otherwise specified. It is not yet obsolete, but expect it to be so in a future release.
dbc_max_pct
nbuf
bufpages
http://education.hp.com
fs_async
fs_async specifies that file system data structures may be posted to disk asynchronously. While this can speed file system performance for some applications, it increases the risk that a file system will be corrupted in the event of system power loss. maxfiles specifies the soft limit to the number of files that a single program may have open at one time. A program may exceed this soft limit up to the value of maxfiles_lim. In 11i v2, maxfiles is computed at boot and is set to 512, if memory is less than 1 GB. Otherwise its set to 2048. maxfiles_lim is the hard limit to the number of files that a single program can open up at one time. This parameter was made dynamic in 11i v1 and the default value was set to 4096. nfile is the size of the file table in memory, and therefore defines the maximum number of files that may be open at any one time on the system. Every process uses at least three file descriptors. Be generous with this number, as the required memory is minimal. nfile depends on the parameters nproc, maxusers, and npty. This parameter was made dynamic in 11i v2 and was no longer dependent on maxusers. Its value is computed at boot time and is set to 16384 if memory is less than 1 GB; otherwise its set to 65536. ninode is the size of the HFS in-core inode table. By caching inodes in memory the amount of physical I/O is decreased when accessing files. Each unique HFS file open on the system has a unique inode. This table is hashed for performance. At boot time in 11i v2, its set to 4880, if memory is less than 1GB; otherwise its set to 8196. nflocks is the number of file locks available on the system. File locks are a kernel service to enable applications to safely share files. Databases or other applications that make use of the lockf() system call can be large consumers of file locks. Note that one file may have several locks associated with it. This parameter was made dynamic in 11i v2 - at boot time, if memory is less than 1 GB, its set to 1200; otherwise its set to 4096. Along with ninode, this parameter controls the size of the DNLC (directory name lookup cache). Recent directory path names are stored in memory to improve performance. This parameter is set in bytes. This parameter has been obsoleted in 11i v2. VxFS 3.5 now uses its own internal DNLC.
maxfiles
maxfiles_lim
nfile
ninode
nflocks
vx_ncsize
http://education.hp.com
Student Notes
Message queues are used by applications to transfer a small to medium amount of information from exactly one process to another process. This information could be in the form of a structure, a string, a numerical value, or any combination thereof. SVIPC message queues have been around for a long time. They are controlled by a number of tunable kernel parameters. mesg msgmap msgmax msgmnb msgmni mesg when set (mesg = 1) enables the message queue services in the kernel. This parameter is obsolete as of 11i v2. msgmap specifies the size of the free-space map used in allocating message buffer segments for messages. msgmax specifies the maximum size in bytes of an individual message. This parameter is dynamic at HP-UX 11i v1. msgmnb specifies the maximum total space consumed by all messages in a queue. This parameter is dynamic at HP-UX 11i v1. msgmni specifies the maximum number of message queue identifiers allowed on the system at one time. Each message queue has an associated message
http://education.hp.com
queue identifier stored in non-swappable kernel memory. In 11i v2, the default was raised to 512. msgseg msgssz msgtql msgseg is the number of segments in the system-wide message buffer. In 11i v2, the default was raised to 8192. msgssz is the size in bytes of each message buffer segment. In 11i v2, the default was raised to 96. msgtql is the total number of messages that can reside on the system at any on time. In 11i v2, the default was raised to 1024.
Any of these parameters could affect the performance of an application, simply by virtue of not having enough of the message queue resources available when needed. However, the msgssz and the msgseg parameters also control the size in an in-memory message buffer that is shared by all SVIPC message queues. It needs to be large enough to handle all the messages that may be pending at any one time, but by the same token, should not be much larger than that. It could be taking up far more memory than is necessary. It is not dynamic; it is fixed in size. There also exist in HP-UX 11.x POSIX message queues. There are no tunable parameters for them. POSIX message queues have been shown to consistently out-perform SVIPC message queues.
http://education.hp.com
semvmx semmsl
Student Notes
Semaphores are another form of interprocess communication. Semaphores are used mainly to keep processes properly synchronized to prevent collisions when accessing shared data structures. Semaphores are typically incremented or decremented by a process to block other processes while it is performing a critical operation or using a shared resource. When finished, it decrements or increments the value, allowing blocked processes to then access the resource. Semaphores can be configured as binary semaphores with only two values: 0 and 1, or they can serve as general semaphores (or counters), where one process increments/decrements the semaphore and one or more cooperating processes decrement/increment it. SVIPC semaphores have been around for a long time. They are controlled by several tunable parameters. sema semaem sema (Series 700 only) enables or disables IPC semaphores at system boot time. This parameter is obsolete as of 11i v2. semaem is the maximum value by which a semaphore can be changed in a semaphore undo operation.
http://education.hp.com
semmap
semmap is the size of the free-semaphores resource map for allocating requested sets of semaphores. This semaphore is obsolete as of 11i v2. semmni is the maximum number of sets of IPC semaphores allowed on the system at any given time. In 11i v2, the default was raised to 2048. semmns is the total system-wide number of individual IPC semaphores available to system users. In 11i v2, the default was raised to 4096. semmnu is the maximum number of processes that can have undo operations pending on any given IPC semaphore on the system. In 11i v2, the default was raised to 256. semume is the maximum number of IPC semaphores on which a given process can have undo operations pending. In 11i v2, the default was raised to 100. semvmx, the maximum value any given IPC semaphore is allowed to reach, prevents undetected overflow conditions). Until 11i v2, semmsl was an untunable value in the kernel. It specified the maximum number of semaphores that could be allocated to a specific semaphore set. In 10.X it was set to 500. In 11.00, it was set to 2048. Now it is a dynamic tunable.
semume semvmx
semmsl
Any of these parameters could affect the performance of an application, simply by virtue of not having enough of semaphore resources available when needed. There also exist in HP-UX 11.x POSIX semaphores. There are no tunable parameters for them. POSIX semaphores have been shown to consistently out-perform SVIPC semaphores.
http://education.hp.com
Student Notes
Shared memory is reserved memory space for storing data shared between or among cooperating processes. Sharing a common memory space eliminates the need for copying or moving data to a separate location before it can be used by other processes, reducing processor time and overhead, as well as memory consumption. Shared memory is allocated in swappable, shared memory space. Data structures for managing shared memory are located in the kernel. Shared memory segments are much preferred by memory intensive applications, such as Data Bases, since they can be very large and can be accessed without using system calls. SVIPC shared memory use the following tunable parameters. shmem shmmax shmmni shmem ,when set to true, enables the shared memory subsystem at boot time. This parameter is obsolete in 11i v2. shmmax specifies the maximum shared memory segment size. Dynamic in 11i v1. Also in 11i v2, the default was raised to 1GB. shmmni specifies the maximum number of shared memory segments allowed on the system at any one time. Dynamic in 11i v2. Also in 11i v2, the default was raised to 400.
http://education.hp.com
shmseg
shmseg specifies the maximum number of shared memory segments that can be simultaneously attached (shmat()) to a single process. Dynamic in 11i v1. Also in 11i v2, the default was raised to 300.
Any of these parameters could affect the performance of an application, simply by virtue of not having enough shared memory resources available when needed. There also exist in HP-UX 11.x POSIX shared memory. There are no tunable parameters for them. POSIX shared memory segments are implemented through the memory-mapped file architecture, so it could be affected by some of the file system tunable parameters described earlier.
http://education.hp.com
8 MB
64 MB
8 MB
Maximum 32 and 64 bit process RSE stack size (IA-64 only) Maximum number of concurrent processes per user ID Maximum number of processes system wide Maximum time a process can have the CPU before yielding to next highest priority. Set in ticks (10ms).
50 formula 8
Student Notes
Manage the number of processes on the system and processes per user to keep system resources effectively distributed among users for optimal overall system operation. Manage allocation of CPU time to competing processes at equal and different priority levels. Allocate virtual memory among processes, protecting the system and competing users against unreasonable demands of abusive or run-away processes. maxdsiz maxdsiz defines the maximum size of the static data storage segment of an executing 32-bit process. In 11i v2, this default has been raised to 1 GB. maxdsiz_64bit defines the maximum size of the static data storage segment of an executing 64-bit process. In 11i v2, this default has been raised to 4 GB. maxssiz defines the maximum size of the dynamic storage segment (DSS), also called the stack segment, of an executing 32bit process.
maxdsiz_64bit
maxssiz
http://education.hp.com
maxssiz_64bit
maxssiz_64bit defines the maximum size of the dynamic storage segment (DSS), also called the stack segment, of an executing 64-bit process. In 11i v2, this default has been raised to 256 MB. maxtsiz defines the maximum size of the shared text segment (program storage space) of an executing process. Note maxtsiz_64bit for 64 bit HP-UX 11. maxressiz defines the maximum size of the register stack engine (RSE), also called the RSE stack segment, of an executing 32-bit process. This parameter is only found on an IA-64 kernel. maxressiz_64bit defines the maximum size of the register stack engine (RSE), also called the RSE stack segment, of an executing 64-bit process. This parameter is only found on an IA-64 kernel. maxuprc establishes the maximum number of simultaneous processes available to each user on the system. The user ID number identifies a user. The superuser is immune to this limit. In 11i v2, this default is now set to 256. nproc specifies the maximum total number of processes that can exist simultaneously in the system. This parameter has been made dynamic in 11i v2, and the new default setting is 4200. The timeslice interval is the amount of time one thread is allowed to accumulate before the CPU is given to the next thread at the same priority. The value of timeslice is specified in units of (10 millisecond) clock ticks.
maxtsiz
maxressiz
maxressiz_64bit
maxuprc
nproc
timeslice
http://education.hp.com
Student Notes
Configurable kernel parameters for memory paging enforce operating rules and limits related to virtual memory (swap space). vps_ceiling This parameter is provided as a means to minimize lost cycle time caused by TLB (translation look-aside buffer) misses on systems using newer PA-RISC devices such as the PA-8000 and the Itanium family that have smaller TLBs and may not have a hardware TLB walker. If a user application does not use the chatr command to specify a page size for program text and data segments, the kernel selects a page size that, based on system configuration and object size, appears to be suitable. This is called transparent selection. vps_chatr_ceiling User applications can use the chatr command to specify a page size for program text and data segments, providing some flexibility for improving overall performance, depending on system configuration and object size. The specified size is then
http://education.hp.com
compared to the page-size value limit defined by vps_chatr_ceiling that is defined in the kernel at systemboot time. If the value specified is larger than vps_chatr_ceiling, vps_chatr_ceiling is used. vps_page_size Specifies the default user-page size (in Kbytes) that is used by the kernel if the user application does not use the chatr command to specify a page size. swapmem_on swapmem_on enables or disables the creation of pseudo-swap, which is swap space designed to increase the apparent total swap space, so that real swap can be used completely, or large memory systems dont need corresponding swap space. nswapdev specifies an integer value equal to the number of physical disk devices that can be configured for device swap up to the maximum limit of 25. nswapfs specifies an integer value equal to the number of file systems that can be made available for file-system swap, up to the maximum limit of 25. swchunk defines the chunk size for swap. This value must be an integer power of two. When the system needs swap space, one swap chunk is obtained from a device or file system. When that chunk has been used and another is needed, a new chunk is obtained. If the swap space is full or if there is another swap space at the same priority, the new chunk is taken from a different device or file system, thus distributing swap use over several devices. maxswapchunks specifies the maximum amount of configurable swap space on the system. In 11i v2 this parameter is obsolete. page_text_to_local allows NFS clients to write the text segment to local swap and retrieve it later. This eliminates two separate text-segment data transfers to and from the NFS server, thus improving NFS client program performance. This parameter does not seem to be defined in 11i v2, even though it has not been identified as an obsolete parameter.
nswapdev
nwapfs
swchunk
maxswapchunks
page_text_to_local
http://education.hp.com
Student Notes
Two configurable kernel parameters are provided that relate to kernel interaction with the logical volume manager. maxvgs maxvgs defines the maximum number of volume groups configured by the logical volume manager on the system. no_lvm_disks flag notifies the kernel when no logical volumes exist on the system, i.e. LVM is disabled. This parameter does not seem to be defined in 11i v2, although it is not identified as an obsolete parameter.
no_lvm_disks
http://education.hp.com
netmemmax
10% of mem
Student Notes
Two configurable kernel parameter are related to the kernel's interaction with the networking subsystems: netisr_priority netisr_priority sets the real-time interrupt priority for the networking interrupt service routine daemon. By default, it is set to 1 on Uniprocessor systems and 100 on Multiprocessor systems. This parameter is obsolete in 11i v2. netmemmax specifies how much memory is reserved for use by networking for holding partial Internet protocol (IP) messages which are typically held in memory for up to 30 seconds. When messages are transmitted using Internet protocol, they are sometimes broken into multiple, "partial" messages (fragments). netmemmax simply establishes a maximum amount of memory that can be used for storing network-message fragments until they are reassembled. This parameter does not seem to be defined in 11i v2, although it is not identified as an obsolete parameter.
netmemmax
http://education.hp.com
Student Notes
The following parameters are more or less unrelated. create_fastlinks When create_fastlinks is non-zero, it causes the system to create HFS symbolic links in a manner that reduces the number of disk-block accesses by one for each symbolic link in a pathname lookup. default_disk_ir enables or disables immediate reporting. With Immediate Reporting ON, disk drives that have data caches return from a write() system call when the data is cached, rather than returning after the data is written on the media. This sometimes enhances write performance, especially for sequential transfers. In 11i v2, this parameter is set to 0, by default. maxusers does not itself determine the size of any structures in the system; instead, the default value of other global system parameters depends on the value of maxusers. When other configurable parameter values are defined in terms of maxusers, the kernel is made smaller and more efficient by minimizing wasted space due to improperly balanced resource allocations. In
default_disk_ir
maxusers
http://education.hp.com
11i v2, the use of maxusers has been eliminated from the formula of every parameter that was dependent on it. Changing its value has no effect on 11i v2. ncallout ncallout specifies the maximum number of timeouts that can be scheduled by the kernel at any given time. A general rule is that one callout per process should be allowed unless you have processes that use multiple callouts. In 11i v2 this parameter is obsolete. npty specifies the maximum number of pseudo-tty data structures available on the system. rtsched_numpri specifies the number of distinct priorities that can be set for POSIX real-time processes running under the realtime scheduler. unlockable_mem defines the minimum amount of memory that always remains available for virtual memory management and system overhead.
npty rtsched_numpri
unlockable_mem
http://education.hp.com
http://education.hp.com
Student Notes
The above slide recaps the characteristics related to the three main performance bottlenecks.
CPU Bottlenecks
CPU bottlenecks often exhibit the following characteristics: High CPU usage due to lots of processes competing for the CPU Large number of processes in the CPU run queue No disk bottleneck problems; disk utilization is low, few to no I/O requests in the disk queues No memory bottleneck problems; vhand not needing much, no paging to swap devices
Disk Bottlenecks
Disk bottlenecks often exhibit the following characteristics: High CPU usage due to the disk device drivers constantly executing to perform the I/O and user/system processes continually running to submit the I/O requests
http://education.hp.com
High disk utilization due to lots of I/O requests being continually submitted. No memory bottleneck problems; vhand not needing much, no paging to swap devices
Memory Bottlenecks
Memory bottlenecks often exhibit the following characteristics: High CPU usage (system) due to vhand constantly running to free memory pages, the kernel spending lots of time in the memory management subsystem, and the device drivers for the disk writing memory pages to and from swap High disk utilization due to memory pages being constantly written to and from the swap devices High memory utilization (with swapping) due to free memory falling below LOTSFREE, DESFREE, and MINFREE
Given the above recap, in what order should the three main bottlenecks be checked? When arriving on the scene of an unknown system, where do you start? It would be wise to look for the bottleneck with the most specific symptoms, first. Since the memory bottleneck is the only one to show signs of memory pressure, look for it first. Once you have eliminated that, look for disk bottlenecks. Finally, look for CPU bottlenecks.
http://education.hp.com
Start glance. Look at the memory utilization bar graph. Look at the disk utilization bar graph. Look for other kinds of bottlnecks, e.g. network Look at the CPU utilization bar graph.
Is memory utilization > 95? Yes Is there activity on the swap device? Yes Potential Memory Bottleneck
No
Is disk utilization > 50? No Yes Are there disk I/O requests in the queue? Yes Potential Disk Bottleneck
No
Is CPU utilization > 90? No Yes Are there requests in the CPU run queue? Yes Potential CPU Bottleneck
No
No
Student Notes
The above performance monitoring flow chart assumes glance is being used as the performance-monitoring tool. If glance is not available, the same information can be obtained from a variety of other tools, such as sar and vmstat. The flow chart starts by first looking for symptoms of a memory bottleneck. Is memory utilization high? Is there activity to the swap device?
Memory bottlenecks are checked for first, since memory bottlenecks often exhibit symptoms of high disk and CPU utilization, which could initially be mistaken for disk or CPU bottlenecks. If the system is not bottlenecked on memory, the second bottleneck checked for through the flow chart is a disk bottleneck. Is disk utilization high? Are there disk I/O requests in the disk queue?
http://education.hp.com
Disk bottlenecks are checked for second, as disk bottlenecks often exhibit symptoms of high CPU utilization, but not high memory utilization. If the system is not bottlenecked on disk, the final bottleneck to check for is a CPU bottleneck. Is CPU utilization high? Are there processes in the CPU run queue?
CPU bottlenecks are checked for after memory and disk bottlenecks, as CPU bottlenecks do not exhibit high memory or CPU utilization. If none of these situations appear to exist, then it is time to check the less common bottlenecks. Networks would be a good possibility, but dont neglect other hardware or even software resources, such as file locks and semaphores.
http://education.hp.com
No
Yes
Is there activity on the swap device? (m) Mem Report Look at VM writes. (d) Disk Report Look at Virt Memory (v) I/O by LV Look at swap devices (w) Swap Space Look at Used (ignore pseudo).
No
Student Notes
The primary symptoms of a memory bottleneck include high memory utilization and activity to the swap device. The glance reports that show activity on the swap device include:
(m) (d) (v) (w) Memory Report Disk Report I/O by log. volume Swap Space Report shows currently number of VM reads/writes shows VirtMem I/O shows I/O to the swap logical volumes show currently used swap space
Also look at vhand and swapper as processes. Are they accumulating any CPU time? Look at the output of vmstat S. Are pages being paged out? Are processes being swapped out?
http://education.hp.com
Student Notes
The above slide reviews some of the ways to correct a memory bottleneck: Limit the maximum size of the dynamic buffer cache. This can help to prevent unnecessary paging during periods when the dynamic buffer cache needs to shrink. Identify programs (and users) taking up large amounts of memory, and investigate whether the memory usage is warranted or whether the process has memory leaks. Consider using the serialize command to keep several memory intensive programs from competing with each other. Consider using the Process Resource Manager (PRM) or Work Load Manager (WLM) to favor memory allocation to important processes. Adding more physical memory will always help a memory-constrained system.
http://education.hp.com
No
Yes
Are there Disk I/O requests in the queue ? (u) I/O by Disk Look at File System activity. (B) Global Waits Look at Blocked on Disk I/O. (d) Disk Report Look at Logical I/O to Physical I/O ratio.
No
Yes
Potential Disk Bottleneck
Student Notes
The primary symptoms of a disk bottleneck include high disk utilization and multiple I/O requests in the disk queue. The glance reports that show disk I/O related activity include:
(u) I/O by Phys. Disk (B) Global Waits (d) Disk Report - shows currently number of reads/writes - shows percentage of processes blocked on Disk I/O - shows Logical I/O and Physical I/O activity
Also check the output of sar u (%wio), sar d, and sar b (for read cache hit rate and write cache hit rate).
http://education.hp.com
Student Notes
The above slide reviews some of the ways to correct a disk bottleneck: Spread the I/O activity, as evenly as possible, over the disk drives and disk controllers. Consider using asynchronous I/O so applications do not have to wait for a physical I/O to complete. The trade-off here is a greater exposure to data loss in the event of a system failure. For HFS file systems, increase the fragment and file system block size if large files are being accessed in a sequential manner. For VxFS file systems, increase the block size to improve read-ahead and write-behind. Consider using a fixed extent size. Look at customizing file system mount options (especially for VxFS file systems). Recall that, by default, VxFS is mounted to favor integrity, and HFS is mounted to favor performance. Consider using vxtunefs to tune the performance of VxFS. Match preferred IO size and read ahead to physical stripe depth.
http://education.hp.com
Verify (and tune) the hit ratio on the file system buffer cache. The ratio of logical reads to physical reads should be a minimum of 10 to 1. The ratio of logical writes to physical writes should be a minimum of 3 to 1. Add bigger, better, faster disks and disk controllers.
http://education.hp.com
Is CPU utilization > 90? Yes Are there processes in the CPU run queue? (a) CPU by Proc Look at Load Average.
(g) Global Report Look at Processes Blocked on priority.
No
No
Student Notes
The primary symptoms of a CPU bottleneck include high CPU utilization and multiple processes in the CPU run queue. The glance reports that show CPU activity include:
(a) CPU by Processor (c) CPU Report (g) Process Report - shows CPU load average over last 1, 5, 15 minutes - shows CPU activities - shows CPU hogs in order (see note)
Note
Make sure you are looking at processes in CPU order. Use the Thresholds Page (o) of glance and set CPU as the sort criteria.
Also check sar u and sar q. Use the M option, if you have a multiprocessor.
http://education.hp.com
Student Notes
The above slide reviews some of the ways to correct a CPU bottleneck: Use the nice or renice commands on lower priority processes (set nice value to 2139). As a rule of thumb, favor I/O bound programs over CPU-bound programs. I/O-bound programs will block frequently, allowing the CPU-bound programs to run. Use the nice or renice command on higher priority processes (set nice value to 0-19). Use the rtprio or rtsched commands on highest priority processes. BE CAREFUL! A poorly written process could take over your system and render it useless. Schedule large batch jobs, long compiles, and other CPU intensive activity for non-peak hours. Add an additional CPU or a faster CPU to the system.
http://education.hp.com
Student Notes
Lets summarize the major bottlenecks and their symptoms: Memory Bottleneck: You know that you have a memory bottleneck if both vhand and swapper are active. This indicates severe memory pressure! Disk Bottleneck: A disk bottleneck will be characterized by disk utilization of at least 50% and at least 3 requests waiting in the request queue. If a controller is the bottleneck, you will see multiple disks with lengthy queues on that controller. Their utilization may not be 50%! The queues are more important than the utilization. CPU Bottleneck: If all of your CPUs are at least 90% busy and they each have run queues that have 3 or more processes in them, you have a CPU bottleneck. If one or more of the processors has empty (or mostly empty) queues, either you are at the limit of your CPU resource, or something is unbalancing the loads on your processors. Network Bottleneck: If your ratio between your collisions/sec and your packets-out/sec is greater than 5%, you have a network bottleneck.
http://education.hp.com
As with any bottleneck symptom, it must be a constant condition sustained over time to be considered a true bottleneck. Otherwise, its a momentary spike which we will keep an eye on, but otherwise ignore.
http://education.hp.com
Objectives
Upon completion of this module, you will be able to do the following: Use case studies to demonstrate how GlancePlus screens can be used to analyze system performance. Observe how a performance specialist approaches a tuning task.
http://education.hp.com
Bottlenecks
A bottleneck is the most common type of problem on any system. It occurs whenever a hardware or software resource cannot meet the demands placed on it, and processes must wait until the resource becomes available. This results in blocks and long queues. Your system handles processes much like a freeway system handles traffic. During normal hours, the freeway adequately carries the traffic load, and cars can travel at optimum speed. But, during rush hour, when too many cars try to access the freeway, the lanes become clogged and traffic can slow to a halt. The freeway becomes bottlenecked. Similarly, a bottleneck can occur on your system if the processes you are running need more CPU time than is available or more memory than is configured for the system. A bottleneck also can occur if there isn't enough disk 1/0 bandwidth to move data, or if swap space isn't configured optimally. A bottleneck can be a temporary problem that is easily fixed. The solution may be to rearrange workloads, such as rescheduling batch programs to run late at night. Solving a disk bottleneck may require only spreading disk loads among all the available disks.
http://education.hp.com
A recurring bottleneck, however, can indicate a long-term situation that is worsening. Perhaps the system was configured to serve fewer users than are now using it, or workloads have gradually increased beyond the system's capacity. The only solution may be a hardware upgrade, but how do you know? If you can identify a bottleneck correctly, you can avoid randomly tuning the system (which can worsen the problem), and you can avoid adding extra hardware that doesn't help performance. You can also avoid expending resources solving a corollary bottleneckone that is caused by the primary bottleneck.
Characteristics of Bottlenecks
Common system bottlenecks have several general characteristics or symptoms. By comparing these symptoms with the statistics on your GlancePlus screens, you can analyze the performance of your system and detect potential or existing bottlenecks. Although a single symptom may not indicate a problem, a combination of symptoms generally reflects a bottleneck situation.
You may discover that solving one bottleneck uncovers or creates another. It is possible to have more than one bottleneck on a system. In fact, changing workloads are constantly reflected in changing system performance. The goal is not to seek a final solution, but to seek optimal performance at any given time.
http://education.hp.com
http://education.hp.com
of which is virtual memory 1/0. That looks slow to Jose, so he looks at the Wait States screen to find out what the process is waiting on. Jose learns that the process is spending about 7 percent of its time utilizing the CPU (executing), 27 percent of its time waiting for terminal input, and 66 percent of its time waiting on virtual memory. That's a significant amount of time. Jose checks other processes on the system and discovers that they are experiencing similar waits for virtual memory. He realizes that the new application overloads the system's memory. He makes copies of the relevant screens so he can explain the situation to his manager.
http://education.hp.com
http://education.hp.com
http://education.hp.com
Had the process shown a high percentage of being blocked on priority, it would have meant the process was ready to run but was unable to do so because the CPU was being used by processes with higher dispatching priorities.
http://education.hp.com
Yuki would like to spread the disk I/0 more evenly among the drives to avoid potential I/O bottlenecks. With that goal in mind, he first checks the Disk Detail screen. It shows that logical disk activity is high. For details, he goes to the Logical Volumes screen, where he notices a high write activity on logical volume /dev/vgOO/lvol12. Getting out of Glance and into the UNIX shell, he types vgdisplay -v /dev/vg00 to ascertain the physical disk names associated with the volume. Back in Glance, Yuki views the Disk Queue Lengths screen to determine the busiest disks in the volume. Then he checks the Disk Detail screen to find out whether disk activity was caused by system or user activity. Yuki notices that the Virtual Memory physical accesses are low, indicating application rather than system activity. He checks the Open Files screen to find which application was creating so many writes to the disk. Voila! Fred is running his baseball pool again! Yuki pays a visit to Fred. After discussing Fred's I/0 needs, Yuki returns to his console to balance the I/0 load, using LVM commands to rearrange the logical volumes. Now its time to grab your toolbox, pop the hood, and take a look.
Good Luck!
http://education.hp.com
http://education.hp.com
Solutions
http://education.hp.com
Solutions
Answer: Varies with system configuration in the order of 10s of seconds to minutes. Example output follows from an rp2430 server: # timex ./long The last prime number is : real user sys 3:37.89 3:35.68 0.12 49999
Example output follows from an rx2600 server: # timex ./long The last prime number is : real user sys 2:53.24 2:51.74 0.06 99991
http://education.hp.com
Solutions
4. Time the execution of the med program. Make sure there is no activity on the system. # timex ./med Record Execution Time real: user: sys:
Answer: Varies with system configuration should be about one half of long. Example output follows from an rp2430 server: # timex ./med The last prime number is : real user sys 1:52.68 1:51.55 0.08 49999
Example output follows from an rx2600 server: # timex ./med The last prime number is : real user sys 1:33.71 1:33.02 0.04 99991
5. Time the execution of the short program. Make sure there is no activity on the system. # timex ./short Record Execution Time real: user: sys:
Answer: Varies with system configuration should be about one eigth to one tenth of med. Example output follows from an rp2430 server: # timex ./short The last prime number is : real user sys 10.88 10.70 0.05 49999
http://education.hp.com
Solutions
# timex ./short The last prime number is : real user sys 8.56 8.49 0.03
99991
6. Time the execution of the diskread program. # timex ./diskread Record Execution Time real: user: sys:
Answer: Varies with system configuration in the order of tens of seconds. Example output follows from an rp2430 server: # timex ./diskread DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c1t15d0] DiskRead: Start reading : 1024MB 1024+0 records in 1024+0 records out real user sys 28.01 0.02 0.53
Example output follows from an rx2600 server: # timex ./diskread DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c2t1d0s2] DiskRead: Start reading : 2048MB 2048+0 records in 2048+0 records out real user sys 28.69 0.01 0.13
7. In the case of the long, med, and short programs the real time is the sum of the usr and sys time (approximately). This is not the case with diskread. Explain why. Answer: We first assume that there is no other load on the system. In the case of a classic number crunching CPU hog (long, med, and short are all these) there will be no system calls (except for the final terminal output) and the program only needs CPU time in usr mode.
http://education.hp.com
Solutions
As there is only one process, there is no waiting. This is shown by the real time being very close to the sum of the sys and usr times for the process. long, med, and short only do calculations and make no call on kernel resources during their execution, so the usr time is very high compared to the sys time. This is not the case for diskread. The program makes very little demand on the CPU, shown by the sum of usr and sys being quite small compared to the real or wall clock time. The huge difference between real time and usr+sys time proves that the program is waiting on disk I/O most of the time. Also note that sys is much higher than usr meaning that the program is bound on system calls (disk I/O) rather than computation when it does execute.
http://education.hp.com
Solutions
# timex ./short & [1] 6486 rx2600: root@r265c145:/home/h4262/baseline # The last prime number is : 99991 real user sys 8.59 8.50 0.03
How long did the program take to execute? 8 to 11 secs. How does this compare to the baseline measurement from earlier? A little longer due to the overhead of sar.
http://education.hp.com
Solutions
3. Time how long it takes for three short programs to execute? # timex ./short & timex ./short & timex ./short &
How long did the slowest program take to execute? _____________________ How did the CPU queue size change from first window? __________________ Answer: rp2430: # timex [1] [2] [3] ./short & timex ./short & timex ./short & 10203 10205 10206 49999
# The last prime number is : real user sys 29.86 10.68 0.01
The last prime number is : real user sys 32.07 10.67 0.01
49999
The last prime number is : real user sys 32.35 10.67 0.01 rx2600: # timex [1] [2] [3]
49999
./short & timex ./short & timex ./short & 6690 6692 6694 99991
# The last prime number is : real user sys 25.08 8.48 0.00
The last prime number is : real user sys 25.56 8.48 0.00
99991
http://education.hp.com
Solutions
The last prime number is : real user sys 25.60 8.48 0.00
99991
How long did the slowest program take to execute? 25 to 34 secs, around three times longer than one occurrence of the program. If you have a multiprocessor, the time will be distributed over the number of processors with the lower limit being the time a single process would take. For example, if your system had two processors, the slowest process would complete in one-half the time it would take on a single-processor system. Since were only running three processes here (not including sar), three processors or more than three processors would show the same results. How did the CPU queue size change from first window? The sar q shows that the average cpu queue length (first field) increases by three times when three programs are run concurrently. 4. Time how long it takes for five short programs to execute? # timex ./short & timex ./short & timex ./short & timex ./short & timex ./short & How long did the slowest program take to execute? _________ How did the CPU queue size change from first window?________ Answer: rp2430: # timex timex [1] [2] [3] [4] [5] ./short & timex ./short & timex ./short & \ ./short & timex ./short & 10212 10214 10216 10218 10220 49999 \
# The last prime number is : real user sys 53.98 10.68 0.01
The last prime number is : real user sys 54.08 10.68 0.01
49999
http://education.hp.com
Solutions
The last prime number is : real user sys 54.08 10.68 0.01
49999
The last prime number is : real user sys 54.08 10.67 0.01
49999
The last prime number is : real user sys 54.15 10.68 0.01 rx2600:
49999
# timex ./short & timex ./short & timex ./short & \ ./short & timex ./short & timex ./short & [1] [2] [3] [4] [5] 6737 6739 6741 6743 6745 99991
# The last prime number is : real user sys 42.52 8.49 0.00
The last prime number is : real user sys 42.56 8.48 0.00
99991
The last prime number is : real user sys 42.59 8.48 0.00
99991
99991
http://education.hp.com
Solutions
sys
0.00 99991
The last prime number is : real user sys 42.75 8.48 0.00
How long did the slowest program take to execute? 43 to 54 secs If you have a multiprocessor, the time will be distributed over the number of processors with the lower limit being the time a single process would take. For example, if your system had two processors, the slowest process would complete in one-half the time it would take on a single-processor system. Since were only running five processes here (not including sar), five processors or more than five processors would show the same results. How did the CPU queue size change from first window? It increased by 5 while the test is being run. 5. Is the relationship between elapsed execution (real) time and the number of running programs linear? Answer: Yes very much so. The fastest program in the last case (where 5 programs are running) takes five times longer than with one program. You can draw a graph and go to 10 programs if you are unsure! Typing the command with more than 10 occurrences gets a little tedious! You will find a linear relationship in any case. 6. Comment about the overhead of switching from one process to another. Answer: The overhead of task switching is very low. If it were not, the relationship in the above tests would not be linear. If there is an overhead, it looks like we will not see it unless there are hundreds of processes being switched.
http://education.hp.com
Solutions
Directions
Set up: Change directories to: Execute the setup script: # cd /home/h4262/tools # ./RUN
Use glance (or gpm if you have a bit-mapped display), sar, top, vmstat, and any other available tools to answer the following questions. List as many as possible, and include the appropriate OPTION or SCREEN, which will give the requested information. Specific numbers are not the important goal of this lab. The goal is to gain familiarity with a variety of performance tools. Always investigate what the basic UNIX tools can tell you before running glance or gpm. You may want to run through this lab with the solution from the back of this book for more guidance and discussion. These results were obtained on a C200 workstation running 11i. Remember the absolute numbers are not important here but you should be drawing similar conclusions. 1. How many processes are running on the system? Which tools can you use to determine this? Answer: top ps glance sar gpm Gives the number of running processes in the summary portion of the screen: 119 processes: 96 sleeping, 17 running, 6 zombies ps -e | wc -l and subtract 1 for the headers and 1 each for ps and wc. Look at the table screen (t page) and see the current size of the proc table. sar v 2 10 Look at the proc-sz field. Gives the count at the top of the Process List report.
2. Are there any real-time priority processes running? If so, list the name and priority. What tools can you use to determine this? Answer: syncer, midaemon, lab_proc2, sometimes swapper. ttisr and prm3d will also be seen on 11/11i systems running at pri 32. This is the posix real time range which is even higher than the normal UNIX real time priorities. glance top Global, PRI column (Turn off all filters) PRI column
http://education.hp.com
Solutions
gpm ps el
Use the filters to filter priorities <128 (process list/configure/filters) PRI column
(Try this command: # ps el | grep v PRI | sort k 7,7n | more The highest priority processes will be listed on the top.) Remember a real time priority is anything less than 128 3. Are there any nice'd processes on the system? If so, list the name and priority for each. What tools can you use to determine this? Answer: glance gpm top ps el Go through each single process screen. (Default is 20. 21- 39 is nice; 0 to 19 is nasty, i.e. anti-nice.) Process list, select a process by double clicking. NI column NI column
(Try this command: # ps el | grep v NI | sort k 8,8n | more The nasty processes will be listed at the top and the niced processes will be at the bottom.) On 11i the following were nasty: diagmond, diaglogd, psmctd, memlogd, krsd The following were nice: All 6 <defunct> zombie processes (see below), lab_proc4 4. Are there any zombie processes on the system? If so, how many are there? What tools can you use to determine this? Answer: A zombie is a terminated process whose parent is still running, but has not called wait() for the child. Zombies whose parent has terminated should eventually be adopted by the init process, which will issue a wait() on the zombie. Therefore, a zombie whose parent has terminated should eventually disappear. What resources do zombies consume? entries memory (<= 20 pages), table
http://education.hp.com
Solutions
The number of zombie processes is shown in the summary portion of the screen: 119 processes: 96 sleeping, 17 running, 6 zombies By design, they do not currently report zombies, unless the process entered the zombie state during the interval. Z in S(tate) column and <defunct> in the Comm(and) column
5. What is the length of the run queue? What are the load averages? What tools can you use to determine this? Answer: glance CPU screen (c), page 2, shows a RUNNING LOAD AVERAGE, but has been labeled incorrectly in older versions as run-queue. The All CPUs screen (a) shows the 1-, 5-, and 15-minute load averages whereas the CPU screen shows the interval load average. CPU button or Reports/CPU Graph show the interval load average. Reports/CPU report shows the interval load average. Reports/CPU by processor shows the 1-, 5-, and 15-minute load averages. 1-, 5-, and 15-minute load averages Load averages: 5.39, 5.27, 5.20 and interval load average Average run queue size over interval. r column is the run queue size over the interval. 10-second load average over time.
gpm
The run queue length on the test system was around 5 no matter how it was measured. There is also a hardware dependant approach that can be used on servers using the console Hex display code.
HEX display (front panel or console) shows size of runQ in the second digit. F31F means there are three processes in the runQ and one CPU. FA1F means 10 or more in the runQ. MPE uses this as a percent utilization number. The runQ is an instantaneous value and can never be a fractional number. The load average is based on the runQ, but includes short sleepers (discussed in CPU section).
6. How many system processes are running? What tools can you use to determine this? NOTE: A system process is defined as a process whose data space is the kernel's data space. (i.e. swapper, vhand, statdaemon, unhashdaemon, supsched, etc.) ps reports their size as zero. Others as below.
There are three ways this can be determined. If you get stuck on this question, move on. Don't spend more than a few minutes trying to answer this question.
http://education.hp.com
Solutions
Answer: top glance ps el PA-RISC: RES = 16K (32-bit kernel) or 32K (64-bit kernel) per thread. IA-64: RES = 80K per thread PA-RISC: 16K (32-bit kernel) or 32K (64-bit kernel) per thread on global screen. IA-64: 80K per thread on global screen. The second bit in the F column value indicates a system process. (See the man page for ps.) F column = 3, PPID column = 0, and SZ column = 0.
(Try this command: # ps el | grep 3 | more This will list all the system processes. No, technically, init is NOT a system process.) This amounts to 17 processes on the test 11i system. 7. What percentage of time is the CPU spending in different states? What tools can you use to determine this? Answer: glance Bar graph, CPU screen (c): displays detailed CPU state information. Per-process (S): details per/process CPU utilization. (Display can be toggled between cumulative/interval (C) and percent/absolute (%).) Main window Reports/CPU or Reports/CPU by processor user/system/waiting for io/idle user/nice/system/idle/block/swait/intr/ssys (context switch) SEE /usr/include/sys/dk.h for CPUSTATE (CP_USER, etc.). block is spinlock percentage (on MP systems only). This figure is obsolete at 11i. swait is alpha semaphore percentage (on MP systems only). This figure is obsolete at 11i. user/system/idle us/ni/sy/id
Always watch the first line in vmstat or iostat. It is the average since bootup. Use vmstat -z to clear the sum structures for vmstat. There is no similar option for iostat.
8. What is the size of memory? What is the size of free memory? What tools can you use to determine this? Answer: glance gpm vmstat M(emory) screen, Free Memory, Phys Memory, Avail Memory, Total VM, Active VM, Buf Cache Size Reports/Memory Report Free (in pages) avm (active virtual memory, in pages) includes on-disk pages
http://education.hp.com
Solutions
top /etc/dmesg
Real (real active), virtual (virtual active), free in KB. Amount of physical and available memory.
The memory stats from top are misleading. The values in brackets are figures for processes that are regarded as busy whatever top means by that. In most utilities, busy or active means that the process is in the RUN state, or has executed within the last 20 seconds. The real figures are a summation of the resident set sizes for all processes (sum of the RES field). This is not the amount of physical memory in the system. The only way to get the true physical memory in the system is through glance/gpm or dmesg. The boot info seen in dmesg with the physical memory figure will be lost if general console messages (e.g. file system full) have overwritten the limited buffer space. vmstat figures are generally accurate with the free field agreeing well with glance/gpm and top. Remember top reports memory in 1K units while vmstat reports memory in 4K pages so multiply the vmstat figures by 4 to compare them to top. 9. What is the size of the swap area(s)? What is the percentage of swap utilization? What tools can you use to determine this? Answer: glance gpm Bar graph (reserved/used) w(swap):%used (device and filesys) and MB reserved by swap device, MB available and MB used Reports/Swap space (glance w) Reports/System Table Info/System Tables Graph Report NOTE: Graph shows high water mark (nice!)
N/A (only shows size of swap queue and swapping rates) N/A (only shows paging and swapping rates) N/A KB avail/used/free/% used by swap device File system swap space used/avail
swapinfo can be misleading unless you know what you are looking at. To remove the confusing issue (pseudo swap) enter a swapinfo t to correctly calculate and include pseudo swap issues by taking the total figures. This will be explained in detail in the module on swap space management.
# swapinfo -t TYPE dev reserve memory total # Kb AVAIL 524288 180940 705228 Kb Kb USED FREE 11276 513012 204360 -204360 84488 96452 300124 405104 PCT USED 2% 47% 43% START/ Kb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2
Note that we have used 43% of our swap space and not 2%.
http://education.hp.com
Solutions
10. What is the size of the kernels incore inode table? How much of the inode table is utilized? What tools can you use to determine this? Answer: glance gpm sar v t(able) screen, page two Reports/System Table Info/System Tables Report (NOT Graph) Used/size/overflows
Remember that the inode cache may contain entries for files that are closed, if it doesn't need to flush it out to open a new file Its size is the maximum number of unique files that can be open system wide. 11. Are there any CPU bound processes running (processes using a lot of CPU)? If so, what is the name of the process? What steps did you take to determine this? Answer: glance gpm top ps -el Global screen and single process screen Can sort by CPU utilization and can filter by > 0 Automatically lists the processes by CPU utilization Cpu hogs often have large C counts (<=255)
(Try this command: # ps el | grep v C | sort k 6,6n This will list the most active processes at the end.) lab_proc5 and lab_proc3 are the main CPU users. They are consuming close to 100% of the CPU between them. This is not normal behavior! 12. Are there any processes running which are using a lot of memory? (A "lot" is relative, i.e. a large RSS size compared to other processes.) If so, what is the name of the process? What steps did you take to determine this? Is memory utilization changing? Answer: glance gpm top ps el Global screen: RSS, Per-process screen: RSS and VSS sizes Reports/Process List (glance g) SIZE (KB: text/data/stack), RES (KB: resident size) SZ column (size in 1-K blocks of the core image including only text + data + stack)
http://education.hp.com
Solutions
This will list the largest processes at the end.) lab_proc1 has a much larger SZ (ps el output) size than most other processes. This program is 8MB in core and could be regarded as a memory hog. Remember that SZ is in pages, multiply by 4K to get the actual figure. 13. Are there any processes running which are doing any disk I/O ? If so, what is the name of the process? What steps did you take to determine this? What are the I/O rates of the disk bound processes? What files are open by this (these) process(es)? NOTE: No processes are really doing a lot of physical disk I/O. However, lab_proc3 is doing a LOT of logical I/O.
Answer: glance i screen will periodically show lab_proc3 as largest disk user s(ingle) process screen, open files, will show actual open files and offset, which MIGHT be indicative of the amount of I/O Reports/Process List Reports physical disk I/O for the system overall. Reports and compares logical I/O to physical I/O
Notice sar b reporting very high logical read I/Os. The lab_proc3 process is very busy with disk reads but the system has cached all the data in the buffer cache preventing physical disk I/O.
# sar -b 2 2 HP-UX workstn B.11.11 U 9000/782 01/22/01
15:39:38 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s 15:39:40 0 19646 100 1 2 60 0 0 15:39:42 0 21454 100 0 3 83 0 0
14. What is the current rate of semaphore or message queue usage? What tools can you use to determine this? Answer: sar m glance The ONLY tool to show message and semaphore ops/sec. The single process screen shows messages sent/received.
Semaphore and message usage was effectively zero in the lab test as none of the test programs manipulate semaphores or messages. These resources will be covered in a later module. Relational data bases (Oracle, Informix, Sybase etc) are big users of such resources. # sar -m 2 2
http://education.hp.com
Solutions
HP-UX workstn B.11.11 U 9000/782 15:41:58 15:42:00 15:42:02 msg/s 0.00 0.00 sema/s 3.98 2.00
01/22/01
15. Is there any paging or swapping occurring? What tools can you use to determine this? Answer: glance m(memory) screen: page faults, paging request, KB paged in, KB paged out, deactivations/reactivations, KB swapped in, KB swapped out, VM reads, VM writes Reports/Memory Graph (or Memory button): page OUTS, swap OUTS Reports/Memory Report: in/out/etc. Swapping only Paging (pi/po) Swapping (si/so) & Paging (pi/po)
In terms of simple UNIX commands, vmstat is the way to go. The sar command does not understand paging (more in the module on memory management!) and is measuring the swap rate only. See the pi and po fields below from vmstat. This system is not paging at all so we can be confident that there is no swapping activity.
# vmstat 2 3 procs memory faults cpu r b w avm free sy cs us sy id 4 0 0 78478 1593 2096 171 3 2 94 4 0 0 78478 1552 page re 6 2 at 1 0 pi 0 0 po 0 0 fr 0 0 de 0 0 sr 0 0 in 108 117
glance/gpm do give good paging and swapping detail on the m screen and the data should tie in with vmstat. 16. What is the system call rate? What tools can you use to determine this? Answer: glance: CPU screen, page 2; Single process screen (s), then (L) reads/writes/opens/closes/ioctls/forks/vforks/messages sent and received Reports/CPU Report Reports/Process List, select for single process screen Total/reads/writes/forks/execs First sy column (under faults)
sar and vmstat give good data that should agree here. The system call rate can be used as an indication of how busy your system is once you have established the normal range for its value. There is no absolute good or bad figure as this depends on:
http://education.hp.com
Solutions
a) How powerful your system is. b) How many cpus you have. c) What processes you are running. Example live data from the test system (C200 running 11i):
# sar -c 2 10 HP-UX workstn B.11.11 U 9000/782 16:03:24 scall/s 16:03:26 4417 16:03:28 4623 sread/s 1205 1252 swrit/s 1188 1249 01/22/01 fork/s 0.00 0.00 exec/s 0.00 0.00 rchar/s wchar/s 85899336 2038 88524288 4096
Note the system is doing over 4k system calls per second (scall/s), over half of which can be attributed to reads (sread/s) and writes (swrit/s). See how dramatically this number can be changed by adding a simple extra process (you might like to try this while monitoring sar c).
# dd if=/stand/vmunix of=/dev/null bs=64 & # sar -c 2 10 HP-UX workstn B.11.11 U 9000/782 16:10:37 scall/s 16:10:39 21528 16:10:41 19882 sread/s 10461 7790 swrit/s 9586 6878 01/22/01 fork/s 4.98 5.00 exec/s 4.98 5.00 rchar/s wchar/s 172712128 16302 148155392 7168
The glance/gpm tools become invaluable when you want to know which processes are hitting the system with so many system calls. More on this later. 17. What is the buffer cache hit ratio? What tools can you use to determine this? Answer: sar b glance gpm Read and write hit ratios Disk screen (d), page 2 shows both ratios Reports/Disk report shows both ratios
See answer 13 for example output of sar b. 18. What is the tty I/O rate? What tools can you use to determine this? Answer: sar -y: iostat -t: The quickest tool to use here is iostat t. In general modern system administrators care less and less about terminal I/O as almost all users connect to application servers and services over LAN networks. An exception to this rule would be the case of modems. A
http://education.hp.com
Solutions
system with multiple modems may experience a modem storm with meaningless data being fired at the host by a bad modem line. iostat t will catch this problem as a high tin (tty characters read per second) value.
# iostat -t tty tin tout 0 5 cpu us ni 3 1 sy 3 id 94
19. Are there any traps (interrupts) occurring? What tools can you use to determine this? Answer: vmstat s NOTE: Traps since bootup (should probably zero it out first) examples of trap call page faults overflow/underflow (integer and floating point) HPMC/LPMC floating point emulation traps
When traps occur, the normal flow of a program is interrupted and work is done to take care of a problem before normal program instructions can be continued. For example, trying to access a data page which is not in memory and is out on disk would result in a page fault causing the execution of the program to stop, waiting for the required data to come in from disk. Like with system call rates (see 16), there is no good figure but you are advised to monitor the trap rate as a sanity reference. Clear the vmstat counters with vmstat z first. Note that the numbers seen have been generated in the time between the two vmstat commands! Example data from our C200 running 11i. The parameter list has been reduced and the trap events are in bold.
# vmstat -z # vmstat -s 12 swap ins 12 swap outs 0 pages swapped in 0 pages swapped out 8636 total address trans. faults taken 2633 page ins 0 page outs 20 pages paged in 0 pages paged out 6594 cpu context switches 7640 device interrupts 11335 traps 153724 system calls
http://education.hp.com
Solutions
20. What information can you collect about network traffic? What tools can you use to determine this? Answer: glance l(an) screen: packets in/out, collisions, errors NFS global screen (N): rd/wr rates, calls, response time, etc. nfs by system (n): reads/writes/response time by system for both client and server requests Reports/LAN Graph (or NW button): packets in/out per second, Reports/Network by LAN, Reports/NFS Global Activity, Reports/NFS by system, Reports/NFS by operation Sockets in use -m memory buffers in use (NOTE: No longer works in HP-UX 11.X) -i packets in/out, errors in/out, collisions by interface -s packets, bytes, retransmissions, duplicate, acks, checksum errors, timeouts, etc. -rs routine statistics Server rpc and nfs stats; client rpc and nfs stats
gpm
netstat:
nfsstat
A later module will cover networking performance issues in more detail. The most important performance metric in a CSMA/CD (Ethernet or 802.3) network is collision rate. This is available in glance/gpm and in the Network module we will learn how to extract this data using lanadmin. 21. What information can be gathered on CPUs in an SMP environment? What tools can you use to determine this? Answer: glance gpm: sar Mu/Mq top a(ll)CPU detail utilization and load averages by CPU Reports/CPU Info/CPU by Processor: util, ld avg, CS rate, fork rate, last PID Utilization/queue lengths by CPU (with u or q) CPU number on which a process is assigned and utilization per CPU.
sar M output has changed at 11i. The output of sar M will look identical to sar u (or q) if the system only has one cpu. For MP systems you are presented with the sar ( -u or q) data on a per CPU basis. This becomes helpful in measuring the balance of processes across processors. top displays its MP information by default, giving the cpu reference for each process. This information is hidden if there is only one processor or if the h option is used. The a page of glance also measures the balance per CPU and indicates the last process to run on any given CPU. Very useful.
http://education.hp.com
Solutions
22. What information can be gathered on Logical Volumes? What tools can you use to determine this? Answer: glance gpm vgdisplay, lvdisplay, pvdisplay bdf,mount v(LVM)screen reads/writes/MWC hits and misses by LV or VG Reports/Disk Info/I/O by LV General information on Volume Groups, Logical Volumes and Physical Volumes, Use v for details Information on file systems on logical volumes.
Physical disk layout (the positioning of data on disk) is important for performance. The lvdisplay v and pvdisplay v commands are the best way of finding out the precise layout of logical volumes on physical disks. In a later module we will look in detail at mirroring and striping techniques used to manipulate physical disk layout to our advantage. 23. What information can be gathered on Disk I/O? What tools can you use to determine this? Answer: glance d(isk) logical/physical reads/writes, user/VM/system/raw/NFS i(o): by file system, logical/phys/VM u(queue) queue length and utilization by spindle v(LV): see above press disk bottleneck button (queue) Reports/Disk Info/Disk Report (glance d) I/O by disk (glance u + type [phys, logl, VM, FS, System, RAW]) I/O by fs (glance i + blocksize, util, logl, sys, VM) I/O by LV (glance v) KB/sec, seeks/sec, millisec/seek by spindle (NOTE: millisec/seek no longer reported permanently reports 1.0) %busy, average queue, io per sec, blocks/sec, average wait time, average service time
gpm
iostat sar d
iostat is a redundant tool because its data is not as useful or as accurate as that obtained from sar d. The most important place to start looking for disk I/O info lies with the disks themselves. sar cannot understand LVM layouts and only sees the disk as a whole. Use glance/gpm on the individual disks once you have identified them with sar d. Below is some example data collected at the start of the tools lab. Stop the lab with ./KILLIT and start it again with ./RUN to see some disk I/O.
# ./KILLIT Killing the lab procs Removing the files # ./RUN cc -wall +DAportable cpu_hog.c -o cpu_hog cc -wall +DAportable vm_bnd.c -o vm_bnd cc -wall +DAportable io_bnd.c -o io_bnd
http://education.hp.com
HP-UX workstn B.11.11 U 9000/782 17:22:54 17:22:56 17:22:58 17:23:00 17:23:02 Average device c0t6d0 c0t6d0 c0t6d0 c0t6d0 c0t6d0 %busy 3.50 2.50 3.00 5.50 3.62 avque 0.50 0.50 0.50 0.50 0.50
01/22/01 r+w/s 6 4 5 6 5 blks/s 160 84 84 159 122 avwait 3.55 1.90 2.57 3.76 3.04 avserv 9.24 10.75 8.75 14.07 10.90
We would not consider 3-5% busy as being a bottleneck here. We will see much higher disk loads later!
http://education.hp.com
Solutions
Start a program from another window: # cd /home/h4262/cpu/lab1; # ./RUN & 4. Main Window. Below each graph within the GlancePlus Main window, you will find a button. These buttons display the status color of adviser symptoms. This is a powerful feature of GlancePlus that we will investigate later. Clicking on one of these buttons displays details of that particular graph. To view the advisor symptoms from the main window, select: Adviser -> Edit Adviser Syntax This will display the definitions of the current symptoms being monitored by GlancePlus. Close the Edit Adviser Syntax window.
http://education.hp.com
Solutions
View CPU details: Click the CPU button. To view a detailed report regarding the CPU, select: Reports -> CPU Report Select: Reports -> CPU by Processor This is a useful report, even on a single processor system. 5. On Line Help. One method for accessing online help within GlancePlus is to click on the question mark (?) button. The cursor changes to a ? . Click on the column heading, NNice CPU %. This opens a new window describing the NNice CPU % column. View descriptions for other columns, including the SysCall CPU %. When finished viewing online help for columns, click on the question mark one more time. This returns the cursor to normal. 6. Alarms and Symptoms. A symptom is some characteristic of a performance problem. GlancePlus comes with predefined symptoms, or the user can define his own. An alarm is simply a notification that a symptom has been detected. From the main window, select: Adviser -> Symptom History For each defined symptom, a history of that particular symptom is displayed graphically. The duration is dependent on the glance history buffers, which are user-definable. Close the window. Click on the ALARM button in the main window. This displays a history of all the alarms that have occurred since GlancePlus was started. Up to 250 alarms can be displayed. Close the window. 7. Process Details. Close all windows except for the main window. Select: Reports -> Process List This shows the interesting processes on the system (interesting in terms of size and/or activity). To customize this listing, select: Configure -> Choose Metrics
http://education.hp.com
Solutions
This will display an astonishing number of metrics, which can be chosen for display in this report. This is also a quick way to get an overview of all of the process-related metrics available in GlancePlus. Note that the familiar ? button is also available from this window. Use the scroll bar to find the metric PROC_NICE_PRI. Select this metric and click on OK. Close this window by clicking on OK. 8. Customizations. Most display windows can be customized to sort on any metric, and to arrange the metrics in any user-defined order. To define the sort fields, select Configure -> Sort Fields The sort order is determined by the order of the columns. Placing a particular metric into column one makes it the first sort field. If multiple entries have the same value within this field, then the second column is used to determine the order between those entries. If further sorting is needed, then the third column is used, and so forth down the line. To sort on Cumulative CPU Percentage, click on the column heading CPU % Cum. The cursor will become a crosshair. Scroll window back to column one, and click on column one. This makes CPU % Cum the first sort field. Arrange the sort order so that CPU % is followed by CPU % Cum. Click Done when finished. This sort order is automatically saved so that the next time processes are viewed, this will remain the sort order. In a similar fashion, the order of the columns can also be arranged. To define the column order, select Configure -> Arrange Columns Select a column to be moved (for example, CPU % Cum). The cursor will become a crosshair. Scroll the window to the location where the column is to be inserted. Click on the column where the column is to be inserted. Arrange the first four columns to be in the following order: Process Name, CPU %, CPU % Cum, Res Mem. Click Done when finished. This display order is automatically saved so that the next time processes are viewed, this will remain the display order. 9. More Customizations. It is possible to modify the definition of interesting processes by selecting: Configure -> Filters An easy way to limit the processes shown is to and all the conditions (the default is to OR the conditions). In the Configure Filters window, select AND logic, then click on OK. A much smaller list of processes should be displayed. Return to the Configure Filters window. Modify the filter definition for CPU % Cum as follows: Change Enable Filter to ON
http://education.hp.com
Solutions
Change Filter Relation to >= Change Filter Value to 3.0 Change Change Change Change Enable Highlight to ON Highlight Relation to >= Highlight Value to 3.0 Highlight Color to any LOUD color
Reset the logic condition make to OR, then click OK. Verify the filter took effect. 10. Administrative Capabilities. There are two administrative capabilities with GlancePlus. If working as root, processes in the Process List screen can be killed or reniced. In the Process List window, select the proc8 process. To access the Admintools, select: Admin -> Renice Use the slider to set the new nice value for this process to be +19, then click OK. Note the impact on this process. Now, select the proc8 process again. Select: Admin -> Kill Click OK, and note the process is no longer present. 11. Process Details. Detailed metrics can be obtained on a per process basis. To view process details, go to the Process List window and double click on any process. Much of the details in this report will be explained in the Process Management section of the course. The Reports menu provides much valuable information about the process, including the Files Open and the System Calls being generated. After surveying the information available through this window, close and return to the Main window. There are many other features available in GlancePlus. There are close to 1000 metrics available with it. Notice that when you iconify the GlancePlus Main window, all of the other windows are closed and the GlancePlus active icon is displayed. Alarms and histograms are displayed in this active icon. Exploding this icon will again open up all previously open windows. 12. Exit GlancePlus. From the Main window, select: File -> Exit GlancePlus
http://education.hp.com
Solutions
13. Glance, the ASCII version. From a terminal window, which has not been resized, type glance. NOTE: Never run glance or gpm in the background.
If you are accessing the ASCII version of glance from an X terminal window, make sure you start up an hpterm window to enable full glance softkeys. Do not resize the window as ASCII glance expects a standard terminal size. . You can make the hpterm window longer, but never wider. However, making it longer is frequently of no use. # hpterm & In the new window # glance Display a list of keyboard functions by typing ?. This brings up a help screen showing all of the command keystrokes that can be used from the ASCII version of GlancePlus. Explore these to familiarize yourself with the interface. 14. Display Main Process Screen. Type g to go to the Main Process Screen. This lists all interesting processes on the system. Retrieve online help related to this window by typing h, which brings up a help menu. Select: Current Screen Metrics Use the cursor keys to select CPU Util NOTE: This metric has two values. Use the online help to distinguish the difference between the two values. Use the space bar or the Page Down key to toggle to the next page of help.
Exit the online help CPU Util description by typing e. Exit the Screen Summary topics by typing e. From the main Help menu, select: Screen Summaries Use the cursor keys to select Global Bars From this help description, explain what R, S, U, N, and A mean in the CPU Util Bar. Exit the online help Global Bar description by typing e. Exit the Screen Summary topics by typing e. Exit the main Help menu by typing e. At any time, you can exit help completely, no matter how deep you are, by pressing the F8 key.
http://education.hp.com
Solutions
15. Modify Interesting Process Definition. From the main Process List window, (select g). View the interesting processes. What makes these processes interesting? Type o and select 1 (one) to view the process threshold screen. Cursor down to the Sort Key field, and indicate to sort the processes by CPU usage. Before confirming the other options are correct, note that any CPU usage (greater than zero), or any disk I/Os will cause the process to be considered interesting. Run the KILLIT command to stop all lab loads. 16. Glance Reports. This is the free form part of the lab. Spend the rest of your lab time going through the various Glance screens and GlancePlus windows. Use the table below to produce the different performance reports. Feel free to use this time to ask the instructor "How Do I . . .?" types of questions. Glance
COMMAND *a b *c *d e f *g h *i j *l *m *n o p q r *s *t *u *v *w y z ! ? <CR> FUNCTION All CPUs Performance Stats Back one screen CPU Utilization Stats Disk I/O Stats Exit Forward one screen Global Process Stats Help I/O by Filesystem Change update interval Lan Stats Memory Stats NFS Stats Change Threshold Options Print current screen Quit Redraw screen Single process information OS Table Utilization Disk Queue Length Logical Volume Mgr Stats Swap Stats Renice process Zero all Stats Shell escape Help with options Update screen data
GlancePlus (gpm)
"REPORT" CPU by Processor
Process List I/O by Filesystem Network by LAN Memory Report NFS Report
Process List, double-click process System Table Report Disk Report,double-click disk I/O by Logical Volume Swap Detail Administrative Capabilities
http://education.hp.com
Solutions
The CPU should be balanced between the seven processes with each getting around 14% of the CPU (i.e. 5/7 seconds each for a 5 second interval and 10/7 seconds each for a 10 second interval). This is seen in the CPU Util field of the main glance window. Notice that the programs all have similar priority around 248-249 which is towards the bottom of the pile. If you have a multiprocessor, the processes will quickly distribute themselves among all available processors. However, the overall metrics should stay the same with the exception of the overall length of time that the processes take.
http://education.hp.com
Solutions
How are the processes being context switched (forced or voluntary)? ______________
Answer:
Select one of the long processes using the glance s key. Make sure the PID being suggested is the right one or enter the correct PID. In the first column of info you will find the Forced CSwitch and Voluntary CSwitch metrics. You will notice that (almost!) all context switched are forced when you compare the two figures. This is normal for a CPU hog process. It never leaves the CPU on his own accord and is always told to leave by the scheduler. We saw 7.7-9.6 context switches per second for the period for each of the processes on an rp2430. All of the context switches were forced. On a multiprocessor, there would be the same number of context switches taking place, however fewer processes would be sharing the same processor. How many times over the interval is the process being dispatched? ___________
Answer:
Again, we can look to the first column of the selected process resource summary page. Find the Dispatches metric. This is a measure of how often the process is getting onto the CPU with the summation of Forced CSwitch + Voluntary CSwitch measuring how often the process gets switched out. On a multiprocessor, each processor would have fewer processes wanting its resource, so, each process would be selected more often. What is the ratio of system CPU time to user CPU time? ____________
Answer:
Look to the first column of the selected process info again and you will find the System CPU metric. This will be zero or close to zero on any system. By using the C (upper case) key we can switch between metrics for the last interval (10 seconds if you are following the solutions) or the total over the period of tracking. It makes no difference how you look at it, these processes do not process system calls. They are typical CPU hogs that crunch numbers and do nothing else. All the CPU is User/Nice/RT. What are the processes being blocked on? __________________
Answer PRIority
The most frequent event that is blocking the process is shown by the Wait Reason metric at the bottom of the first column of Process Resource info (the same page we have been looking at all along). In this case it is PRI, short for Priority.
http://education.hp.com
Solutions
The process has been blocked because it is timeslicing with all the other processes. Each time it is switched out, it is placed at the end of the queue in true round-robin fashion. Thus, it is no longer the most eligible process to run and the scheduler has chosen another. For more stats go to the Wait States page for this process (softkey F2 or hit W) notice that the process is blocked on Priority for 80-90% (6/7) of the time and the rest of the time it is on the CPU. There are no other active wait states. The seven long processes are in a circular fight to get to the top of the pile(s). What are the nice values for the processes? _______
Answer 24
A Bourne-based shell (Bourne, Korn, Posix, bash) always places background processes at a nice level 4 higher than the calling shell. The standard nice value of our shell is 20 so the child background jobs inherit 24 as the nice value. One exception is the C shell which runs background processes at the same nice value as the shell. 4. Select one of the processes and favor it by giving it a more favorable nice value. What is the PID of the process being favored? ____________
Answer:
To change the processes nice value, enter: # renice n -5 <PID of selected process> Be careful! This forces a negative offset of 5 from 20 (the standard nice value) and not the current nice value (24). The nice value in this case will end up at 15, which is more favorable than the others, still at 24. Watch that processs percentage of the CPU over several display intervals with glance or top. What effect did it have on the process? _____________________________ _______________________________________________________________________
Answer:
The effect on the process is that it will race away from the others, consuming approx 5060% of the CPU! This might take a little time to settle down at 50-60%. Give it several intervals to complete its adjustment. 5. Select another long process and set the nice value to 30. # renice n 10 <PID of another selected process> What effect did that have on that process? ___________________________________ ______________________________________________________________________
http://education.hp.com
Solutions Answer:
This really turns the process into a loser! The priority of the process drops to 251-252, preventing the process from getting much action. If you select the process and look in the first column of the Process Resource page you will see that it is being dispatched but not very often. You will see the process getting less than 2% of CPU but not much more. Each of the other processes will take up the excess, with the majority of the excess going to the process with the nice value of 15. 6. You can either let the processes finish up on their own as the next module is covered, or you can kill them now with: # kill $(ps el | grep long | cut c18-22)
http://education.hp.com
Solutions
5-24. LAB: CPU Utilization, System Calls, and Context Switches Directions
General Setup
Create a working data file in a separate file system (on a separate disk, if possible). If another disk is available: # # # # vgdisplay v | grep Name (Note which disks are already in use by LVM) ioscan fnC disk (Note any disks not mentioned above, select one) pvcreate -f <raw disk device file> vgextend vg00 <block disk device file>
In either case: # # # # # # lvcreate -n vxfs vg00 lvextend -L 1024 /dev/vg00/vxfs <block disk device file> newfs -F vxfs /dev/vg00/rvxfs mkdir /vxfs mount /dev/vg00/vxfs /vxfs prealloc /vxfs/file <75% of main memory in bytes>
The lab programs are under /home/h4262/cpu/lab0 # cd /home/h4262/cpu/lab0 The tests should be run on an otherwise idle system otherwise results are unpredictable. If the executables are missing, generate them by typing: # make all
http://education.hp.com
Solutions
# timex dd if=/stand/vmunix of=/dev/null bs=2k real __ user __________ system ____________ # timex dd if=/stand/vmunix of=/dev/null bs=64 real
Answer:
__
user __________
system ____________
Results for an rp2430: # timex dd if=/stand/vmunix of=/dev/null bs=64k 282+1 records in 282+1 records out real user sys 0.04 0.00 0.03
# timex dd if=/stand/vmunix of=/dev/null bs=2k 9055+1 records in 9055+1 records out real user sys 0.15 0.02 0.12
# timex dd if=/stand/vmunix of=/dev/null bs=64 289765+1 records in 289765+1 records out real user sys 3.82 0.56 2.95
Results for an rx2600: # timex dd if=/stand/vmunix of=/dev/null bs=64k 728+1 records in 728+1 records out real user sys 0.03 0.00 0.03
# timex dd if=/stand/vmunix of=/dev/null bs=2k 23299+1 records in 23299+1 records out real user 0.18 0.02
http://education.hp.com
Solutions
sys
0.13
# timex dd if=/stand/vmunix of=/dev/null bs=64 745575+1 records in 745575+1 records out real user sys 4.57 0.54 3.39
Notice that the last case is much slower due to the number of system calls being made. The block size is a factor of 1000 times less than in the first case causing 1000 time more calls to the read() and write() system calls. Try a sar c 2 10 in another window while the test is being run to see this effect. None of these effects are anything to do with physical disk I/O as the whole vmunix file is coming from buffer cache. Prove this to yourself with a sar b 2 10 while the test is being run. Notice the 100% read cache hit rate.
1. What is the system call rate when your system is "idle"? ________________
Answer Around 400-500 on our test systems (rp2430) 03/16/04 fork/s 0.00 0.00 0.00 exec/s 0.00 0.00 0.00 rchar/s 203272 4096 103741 wchar/s 8151 512 4341
# sar -c 2 2
HP-UX r206c42 B.11.11 U 9000/800 11:18:56 scall/s 11:18:58 602 11:19:00 264 Average # 434 (rx2600) sread/s 3 4 3 swrit/s 1 1 1
sar -c 2 2
HP-UX r265c145 B.11.23 U ia64 04/06/04 # 10:57:02 scall/s sread/s swrit/s fork/s 10:57:04 719 3 1 0.00 10:57:06 434 3 1 0.00
wchar/s 0 4096
http://education.hp.com
2. Run filestress in the background. What is the system call rate now? What system calls are generated by filestress? Take an average with sar over about 40 seconds i.e. # sar c 10 4 Answer Around 20000-30000 on our test systems
(rp2430) 03/16/04 fork/s 130.07 63.40 192.60 212.10 149.54 exec/s 130.07 63.40 192.60 212.00 149.51 rchar/s wchar/s 29710218 147104 32159540 8192 39581900 17818 40309248 134963 35438766 77037
# sar -c 10 4
HP-UX r206c42 B.11.11 U 9000/800 11:19:43 scall/s 11:19:53 17423 11:20:03 12420 11:20:13 23240 11:20:23 26279 Average 19840 sread/s 3112 3577 4227 3884 3700 swrit/s 1158 2627 1337 700 1456
# sar -c 10 4
(rx2600) 04/06/04 fork/s 290.51 171.70 189.40 222.70 218.60 exec/s 290.51 171.60 189.40 222.60 218.55 rchar/s wchar/s 92426384 77746 69435392 80282 67771592 62259 72799840 91750 75612445 78009
HP-UX r265c145 B.11.23 U ia64 11:02:40 scall/s 11:02:50 39624 11:03:00 28069 11:03:10 27178 11:03:20 31592 Average 31618 sread/s 4530 5618 5214 5057 5105
Answer
3. Terminate the filestress process by entering the following commands: # kill $(ps -el | grep find | cut -c24-28) # kill $(ps -el | grep find | cut -c18-22) 4. Run the syscall program and again answer question 2. Is the system call rate lower or higher than with filestress? Why? Answer Syscall rate is higher than with filestress. Non-blocking system calls Produce rates up to 138,000 per second on an rp2430 and up to 290,000 on an rx2600.
(rp2430)
# sar -c 10 4
http://education.hp.com
Solutions HP-UX r206c42 B.11.11 U 9000/800 11:36:11 scall/s 11:36:21 137619 11:36:31 136788 11:36:41 137887 11:36:51 138224 Average 137629 (rx2600) 04/06/04 fork/s 0.50 0.00 0.00 0.00 0.12 exec/s 0.40 0.00 0.00 0.00 0.10 rchar/s 60560 233472 27853 14746 84104 wchar/s 4092 20480 4096 3277 7985 sread/s 2 2 2 2 2 swrit/s 0 0 0 0 0 03/16/04 fork/s 0.00 0.00 0.00 0.00 0.00 exec/s 0.00 0.00 0.00 0.00 0.00 rchar/s 42863 4506 5734 3686 14171 wchar/s 3376 1946 3277 1229 2457
# sar -c 10 4
HP-UX r265c145 B.11.23 U ia64 11:15:51 scall/s 11:16:01 287322 11:16:11 288439 11:16:21 289239 11:16:31 290331 Average 288832 sread/s 27 7 9 4 12
swrit/s 1 1 1 0 1
The syscall program uses the open() and close() system calls and does no I/O as such. These system calls do not block the process which turns into a CPU hog, only blocking on Priority in the glance Wait States page. Kill the syscall program, before proceeding. # kill $(ps el | grep syscall | cut c18-22) 5. Using cs, compare the number of context switches on an idle system and a loaded system. Idle ________
Answer # sar -w 2 2 (rp2430) 03/16/04
Loaded ______________
11:39:27 swpin/s bswin/s swpot/s bswot/s pswch/s 11:39:29 0.00 0.0 0.00 0.0 86 11:39:31 0.00 0.0 0.00 0.0 83 Average # ./cs & # sar -w 2 2 HP-UX r206c42 B.11.11 U 9000/800 03/16/04 0.00 0.0 0.00 0.0 85
11:41:43 swpin/s bswin/s swpot/s bswot/s pswch/s 11:41:45 0.00 0.0 0.00 0.0 47733
http://education.hp.com
Solutions 11:41:47 0.00 0.0 0.00 0.00 0.0 0.0 47471 47602
04/06/04
11:22:07 swpin/s bswin/s swpot/s bswot/s pswch/s 11:22:09 0.00 0.0 0.00 0.0 150 11:22:11 0.00 0.0 0.00 0.0 177 Average # ./cs& # sar -w 2 2 HP-UX r265c145 B.11.23 U ia64 04/06/04 0.00 0.0 0.00 0.0 164
11:22:57 swpin/s bswin/s swpot/s bswot/s pswch/s 11:22:59 0.00 0.0 0.00 0.0 81912 11:23:01 0.00 0.0 0.00 0.0 82728 Average 0.00 0.0 0.00 0.0 82319
Notice that we go from an idle context switch rate (pswch/s) of approx 100 processes per second up to 47000 or 82000! Additionally, you can look at the glance CPU Report (c). Note how much of the CPU time is spent doing context switching. (About 15%) 6. Kill the cs program, remove the /vxfs/file, and dismount the /vxfs filesystem. # kill $(ps el | grep cs | cut c18-22) # rm f /vxfs/file # umount /vxfs
http://education.hp.com
Solutions
Lab 1
1. Change directory to /home/h4262/cpu/lab1 # cd /home/h4262/cpu/lab1
3. Start a glance session and answer the following questions. What is the CPU utilization? _______
Answer At or near 100%
What are the nice values of the processes receiving the most CPU time? _______
Answer 10
What is the average number of jobs in the CPU run queue? ______
Answer
# uptime 12:05pm #
4. Characterize the 8 lab processes that are running (proc1-8). Which are CPU hogs? Memory hogs? Disk I/O hogs etc. Identify processes that you think are in pairs. Glance global (g) page output (rp2430):
PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------proc8 27425 1 215 root 50.1/49.4 138.6 0.0/ 0.0 168kb 1 proc3 27420 1 221 root 48.4/49.2 138.0 0.0/ 0.0 168kb 1 prm3d 1462 1 168 root 0.0/ 0.2 1125.1 0.0/ 0.0 26.6mb 19 proc5 27422 1 168 root 0.0/ 0.2 0.5 4.0/ 4.0 168kb 1
http://education.hp.com
Solutions
proc2 27419 1 168 root 0.0/ 0.2 0.5 3.8/ 3.9 168kb 1
proc3 and proc8 are the main CPU hogs. They have been run with nice values of 10! The process pair are accounting for almost 100% of the CPU between them. With the same CPU rates and RSS (Resident Set Size), it is likely that these are identical programs. Selecting one of these processes in glance reveals no disc I/O and a context switch profile which is always forced. proc5 and proc2 also manage to execute with 0.2% CPU utilization each. Again these look like a pair. If you select one of these programs and look at the Process Resource page you can see a small amount of write disk I/O, most of which is logical. The main Wait Reason for this process is SLEEP. It would appear that these processes do a small amount of disk I/O and then call sleep() and pause for some time intentionally. proc1 and proc7 are a pair. On selecting one of these we see a nice value of 39! These processes find it nearly impossible to get CPU with the real time pair of proc3 and proc8 taking all the CPU resource. If you watch the Dispatches metric on the Process Resource page they can be seen to get one or two slices of CPU very infrequently. You should also see that for every Dispatch (these are rare), these is always an accompanying Forced Cswitch. You can conclude that these processes would be CPU hogs if they were not so crippled by their own high nice values and the aggression of proc3 and proc8. proc4 and proc6 are the last pair. They have standard nice values of 20 and seem to do nothing but call the sleep() system call. They are being dispatched slightly more frequently than proc1 and proc7 and they are always subject to Voluntary CSwitch. These processes are not CPU hogs. They also do no disk I/O of any kind. None of the above processes had any significant memory size. 5. Determine the impact of this load on user processes. Time how long it takes for the short baseline to execute. # timex /home/h4262/baseline/short & How long did the program take to execute? _______
Answer: # timex /home/h4262/baseline/short & The last prime number is : 49999 (rp2430)
http://education.hp.com
Solutions
# timex /home/h4262/baseline/short & # The last prime number is : real user sys 1:02.38 8.48 0.00 99991
6. Compare your results to the baseline established in the lab exercise in module 1, step 7.
Answer:
http://education.hp.com
Solutions
Lab 2
1. Change directory to /home/h4262/cpu/lab2. # cd /home/h4262/cpu/lab2 2. Start the processes running in the background. # ./RUN 3. In one terminal window, start glance. In a second terminal window run # sar -u 5 200. Answer the following questions: What does glance report for CPU utilization? _______ Answer: Should be greater than 50%. (the more, the merrier!) Output of rp2430 glance (g) page below
PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt -------------------------------------------------------------------------------proc2 27761 1 1 root 92.0/92.3 723.2 0.0/ 0.0 168kb 1 prm3d 1462 1 168 root 0.0/ 0.2 1137.2 0.0/ 0.0 26.6mb 19
http://education.hp.com
Solutions
Answer: sar reports the CPU is mostly idle. Util is less than 10%.
# sar -u 5 200 HP-UX r206c42 B.11.11 U 9000/800 13:45:58 13:46:03 13:46:08 13:46:13 13:46:18 13:46:23 %usr 4 0 1 0 1 %sys 2 1 1 0 1 %wio 0 0 0 0 0 03/16/04 %idle 94 99 98 100 98
This is very strange; the tools totally disagree with each other. sar is reporting over 90% idle with glance reporting over 80% busy! They cannot both be right. Which one do you trust? The output of top is also confused. It sees the busy process but still reports 90% idle!
Load averages: 112 processes: Cpu states: LOAD USER 0.50 0.6% 0.50, 0.56, 1.41 (rp2430) 99 sleeping, 13 running NICE 0.0% SYS 2.2% IDLE 97.2% BLOCK 0.0% SWAIT 0.0% INTR 0.0% SSYS 0.0% Page# 1/8
Memory: 91236K (64076K) real, 365020K (299140K) virtual, 30120K free TTY PID USERNAME PRI NI pts/tb 27761 root 1 20 Load averages: 128 processes: Cpu states: LOAD USER 0.03 0.2% SIZE 1664K RES STATE 148K sleep
0.03, 0.12, 0.68 (rx2600) 107 sleeping, 20 running, 1 zombie NICE 0.0% SYS 0.0% IDLE 99.8% BLOCK 0.0% SWAIT 0.0% INTR 0.0% SSYS 0.0% Page# 1/
Memory: 197664K (154768K) real, 608492K (523032K) virtual, 23516K free 10 TTY PID USERNAME PRI NI tty1p0 26469 root 1 20 SIZE 3304K RES STATE 252K sleep
What is the priority of the process receiving the most CPU time? _______
Answer
The proc2 process is the culprit and is running with the high UNIX real time priority of 1. How much time is the process spending in the sigpause system call? ______
Answer
http://education.hp.com
Solutions
The Wait States for proc2 show that it is blocked on SLEEP when it is not running. This wait state is the result of the process putting itself to sleep. To see the system calls that the process is calling hit the F6 softkey or L key once you have selected the process. glance will collect the data and present it after about 10-20 seconds. rp2430:
System Calls PID: 27761, proc2 1 euid: 0 User: root Elapsed Elapsed System Call Name ID Count Rate Time Cum Ct CumRate CumTime -------------------------------------------------------------------------------sigpause 111 449 99.7 0.35218 1497 74.1 1.17095 sigcleanup 139 450 100.0 0.00166 1500 74.2 0.00553 PPID:
rx2600:
System Calls PID: 26469, proc2 1 euid: 0 User: root Elapsed Elapsed System Call Name ID Count Rate Time Cum Ct CumRate CumTime -------------------------------------------------------------------------------sigpause 111 525 100.9 1.49255 1500 74.2 4.26847 sigcleanup 139 525 100.9 0.00143 1500 74.2 0.00408 PPID:
The sigpause() call is causing the sleep blocks that we see in the Wait States page. The interesting thing is that the rate at which the program calls sigpause() is always 100 times per second. That is 10 ms (milli-seconds) between calls. How can a program be so coordinated with the wall clock and what is it using to achieve this synchronization? Can you tell what it is yet? How is the process being context switched (forced or voluntary)? ______
Answer
Review the Resource Summary page again for proc2 and you will see that all the context switches are Voluntary. This is not the expected case for a CPU hog. How is it that a process can use so much CPU and never be seen by the scheduler and thrown off the CPU? The Bottom Line If you examine the code of the lab you will see that the process arms a trap waiting for the system hardware clock (the tick) to pop. When this occurs the program wakes up and wastes CPU for an amount of time that your instructor has tuned to be just under 10ms (see waste.c). The program then arms the trap again and voluntarily goes to sleep waiting for the next hardware tick. Remember the UNIX scheduler analyzes system activity on the hardware tick intervals and our program has done a good job at never being around at these times! Its a free lunch.
http://education.hp.com
Solutions
The standard UNIX tools (sar and top for example) feed on the schedulers internal statistics for measurement data and so they get the wrong story. glance however uses the midaemon, which recalculates performance stats every time a process returns from a system call. And you cannot play this game without system calls. 4. Determine the impact of this load on user processes. Time how long it takes for the short baseline to execute. # timex /home/h4262/baseline/short & How long did the program take to execute? _______
Answer: (rp2430)
# timex /home/h4262/baseline/short & The last prime number is : 49999 real user sys (rx2600) # timex /home/h4262/baseline/short & # The last prime number is : real user sys 30.86 8.51 0.01 99991 2:32.86 10.88 0.07
Our old benchmark figure was around 10 seconds (real) so this is significantly slower. This program is running in the gaps that the proc2 process is leaving. You could further modify waste.c to use more of the tick period. 5. End the CPU load by executing the KILLIT script. # ./KILLIT
http://education.hp.com
Solutions
http://education.hp.com
Solutions
Description process (bytes) Module Current Value Value at Next Boot Value at Last Boot Default Value Constraints Can Change
vm 1073741824 [Default] 1073741824 [Default] 1073741824 1073741824 maxdsiz >= 262144 maxdsiz <= 4294963200 Immediately or at Next Boot
The number is in decimal = 1073741824 = 1GB The default maxdsiz on 11i v2 is 1 GB. This will make proc1 very slow in reaching its limits. You can change maxdsiz to a more reasonable number for this lab exercise by:
# kctune maxdsiz=0x10000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 1073741824 Default Immed (now) 0x10000000 0x10000000
Also take some vmstat reading to satisfy yourself that the system is not under memory pressure. How much free memory do you have? rp2430:
# vmstat 2 2 procs memory faults cpu r b w avm free sy cs us sy id 3 0 0 75182 92519 408 138 1 0 99 3 0 0 75182 92465 214 75 0 0 100 page re 3 3 at 0 0 pi 0 1 po 0 0 fr 0 0 de 0 0 sr 0 0 in 104 106
http://education.hp.com
Solutions
47856 2 21509 476 0 470 14 19 67 0 124095 3 3 94
96427
137
26
69
26
536
We have around 97000 free pages which equates to 388MB. 3. Use the RUN script to start the background processes: # ./RUN 4. Open another window. Start glance. Sort the processes by CPU utilization (should be the default), and answer the following questions fairly quickly, before the memory leaks get too large. Go for the m page of glance for the best info. You have to be quick off the mark after starting the leak programs!
MEMORY REPORT Users= 1 Event Current Cumulative Current Rate Cum Rate High Rate ------------------------------------------------------------------------------Page Faults 588 1301 113.0 116.1 137.1 Page In 1 33 0.1 2.9 6.1 Page Out 0 0 0.0 0.0 0.0 KB Paged In 0kb 36kb 0.0 3.2 6.9 KB Paged Out 0kb 0kb 0.0 0.0 0.0 Reactivations 0 0 0.0 0.0 0.0 Deactivations 0 0 0.0 0.0 0.0 KB Deactivated 0kb 0kb 0.0 0.0 0.0 VM Reads 0 3 0.0 0.2 0.5 VM Writes 0 0 0.0 0.0 0.0 Total VM : 384.9mb Active VM: 342.1mb Sys Mem : 182.3mb Buf Cache: 32.4mb User Mem: 96.9mb Free Mem: 328.4mb Phys Mem: 640.0mb
What is the current amount of free memory? Answer:Varies with configuration Already this has dropped to 328.4MB What is the size of the buffer cache? Answer:Varies with configuration In our case this is 32.4MB Is there any paging to the swap space? Answer:Varies with configuration No not in the last sample, see KB paged Out above How much swap space is currently reserved? Answer:Varies with configuration Get this from swapinfo. Again you need to do this just after the programs start: In our case around 249MB.
# swapinfo -tm
http://education.hp.com
Solutions Mb TYPE AVAIL dev 2048 /dev/vg00/lvol2 reserve memory 1013 total 3061 Mb USED 0 379 330 709 Mb FREE 2048 -379 683 2352 PCT USED 0% START/ Mb LIMIT RESERVE 0 -
PRI 1
NAME
33% 23%
The total swapspace used (used = really used + reserved) is the figure in bold. More detail on swap management is in Module 7. For now take the bottom line figure in bold above. Which process has the largest Resident Set Size (RSS)? Answer proc1. You can see that from the global process list in glance (the g key). As you watch it, it will grow until vhand kicks in and limits its RSS. However, the VSS will continue to grow. Select that process (with s) and observe to RSS/VSS figure.
PROCESS LIST Users= 1 User CPU Util Cum Disk Thd Process Name PID PPID Pri Name ( 100% max) CPU IO Rate RSS Cnt ------------------------------------------------------------------------------proc1 3267 1 168 root 0.0/ 0.2 1.0 0.0/ 0.0 275.8mb 1 proc2 3268 1 168 root 0.0/ 0.1 0.4 0.0/ 0.0 114.6mb 1 proc3 3269 1 168 root 0.0/ 0.0 0.2 0.0/ 0.0 56.7mb 1 proc4 3270 1 168 root 0.0/ 0.0 0.1 0.0/ 0.0 27.7mb 1 alarmgen 3277 3276 168 root 0.0/ 0.0 0.1 1.3/ 0.1 1.6mb 6 vhand 2 0 128 root 0.4/ 0.2 2.0 81.7/44.2 64kb 1
Resources PID: 3267, proc1 PPID: 1 euid: 0 User: root ------------------------------------------------------------------------------CPU Usage (util): 0.0 Log Reads : 0 Wait Reason : SLEEP User/Nice/RT CPU: 0.0 Log Writes: 0 Total RSS/VSS :275.7mb/479.1mb System CPU : 0.0 Phy Reads : 0 Traps / Vfaults: 0/ 542 Interrupt CPU : 0.0 Phy Writes: 0 Faults Mem/Disk: 0/ 0 Cont Switch CPU : 0.0 FS Reads : 0 Deactivations : 0 Scheduler : HPUX FS Writes : 0 Forks & Vforks : 0 Priority : 168 VM Reads : 0 Signals Recd : 0 Nice Value : 20 VM Writes : 0 Mesg Sent/Recd : 0/ 0 Dispatches : 5 Sys Reads : 0 Other Log Rd/Wt: 0/ 0 Forced CSwitch : 0 Sys Writes: 0 Other Phy Rd/Wt: 0/ 0 VoluntaryCSwitch: 5 Raw Reads : 0 Proc Start Time Running CPU : 0 Raw Writes: 0 Tue Apr 6 14:29:16 2004 CPU Switches : 0 Bytes Xfer: 0kb :
What is the data segment size of the process with the largest RSS? Answer:select the memory regions page for proc1 with the M key.
Memory Regions PID: 3267, proc1 PPID: 1 euid: 0 User: root
Type RefCt RSS VSS Locked File Name ------------------------------------------------------------------------------NULLDR/Shared 87 4kb 4kb 0kb <nulldref> TEXT /Shared 2 4kb 4kb 0kb /home/.../leak/proc1 DATA /Priv 1 301.0mb 716.2mb 0kb /home/.../leak/proc1 MEMMAP/Priv 1 0kb 16kb 0kb /usr/lib/tztab
http://education.hp.com
Solutions
MEMMAP/Priv MEMMAP/Priv MEMMAP/Priv MEMMAP/Priv MEMMAP/Priv Text RSS/VSS: Shmem RSS/VSS: 1 1 1 1 1 4kb/ 0kb/ 4kb 4kb 0kb 24kb 40kb 4kb 0kb 4kb 8kb 8kb 28kb 40kb 0kb 0kb 0kb 0kb 0kb <mmap> <mmap> <mmap> /usr/lib/hpux32/libc.so. <mmap> Stack RSS/VSS: 4kb/ 8kb
The data segment size in this example is 301/716 MB and growing! 5. After a several minutes, the proc1 process should reach its maximum data size. If your maxdsiz is set to 1 GB, this could take a while. Please be patient. Observe the behavior of the system when this occurs. What happens when the process reaches its maximum data size? Answer This is going to take several minutes. The maxdsiz limit is probably either 256MB or 1GB on the test system. Be careful! maxdsiz is a limit on the VSS (Virtual Set Size) and not the RSS (Resident Set Size). System starts doing a LOT of disk I/O. Look for the large F bar in the Disc Util global meter. Why does disk utilization become so high at this point? Answer Kernel is dumping the core file of the user process in our case. You will probably run out of disc space in the /home file system. You may want to remove the /home/h4262/memory/leak/core file! Remember it is not the process that is doing the disk I/O, it is the kernel that is doing it to produce the core file. 6. As the other processes grow towards their maximum data segment size, continue to monitor the following: Free memory
page re 54 1 at 19 0 pi 79 115 po 285 12 fr 16 0 de 0 0 sr 359 0 in 548 397
# vmstat 2 2 procs memory faults cpu r b w avm free sy cs us sy id 2 0 0 321403 91118 4962 326 2 3 95 2 0 0 321403 90413 552 191 0 0 100
Not a lot of free memory now. The system is under memory pressure and is paging out to stabilize the memory system Swap space reserved
Mb USED 715 Mb FREE 1333 PCT USED 35% START/ Mb LIMIT RESERVE 0 -
PRI 1
NAME /dev/vg00/lvol2
http://education.hp.com
Solutions
reserve memory total 1013 3061 341 340 1396 -341 673 1665
34% 46%
All the proc(n) processes continue to grow (see VSS) just like proc1 did and they are aborted in the same way when they cross the line (maxdsiz). The RSS of the processes
The running memory hog processes compete for the limited real memory resource. We didnt have a lot free at the start of the test and the lab procs all want to grow to the maxdsiz limit. They cannot all fit together so they fight. This is a classic memory thrash situation. The number of page-outs/page-ins to the swap space
This depends on when you look! These figures were taken while proc2 was still on the move and free memory was approaching its minimum.
# vmstat 2 10 procs memory faults cpu r b w avm free sy cs us sy id 2 0 0 166464 2692 173 82 0 0 100 2 1 0 170444 1649 209 92 0 0 100 2 1 0 170444 1028 189 88 0 6 94 2 1 0 170444 1146 176 129 0 5 95 2 1 0 170444 1392 175 112 0 0 100 2 1 0 170444 1366 190 156 0 0 100 1 0 0 169455 1090 209 201 0 0 100 1 0 0 169455 1112 193 163 0 1 99 1 0 0 169455 1048 180 133 5 0 95 1 0 0 169455 1600 240 119 0 4 96 page re 0 0 0 8 12 12 9 6 3 5 at 0 0 0 0 0 0 0 0 0 0 pi 0 13 8 6 5 5 5 3 2 0 po 0 0 5 101 263 312 304 351 332 396 fr 0 0 4 109 69 44 28 31 19 12 de 0 0 0 0 0 0 0 0 0 0 sr 0 0 1256 9869 9659 8186 6410 5334 3902 2576 in 103 123 122 225 316 331 316 359 339 370
7. Run the two baseline programs, short and diskread. # timex /home/h4262/baseline/short # timex /home/h4262/baseline/diskread
http://education.hp.com
Solutions
rp2430:
# timex /home/h4262/baseline/short The last prime number is : 49999 real user sys 12.00 10.86 0.02
# timex /home/h4262/baseline/diskread DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c1t15d0] DiskRead: Start reading : 1024MB 1024+0 records in 1024+0 records out real user sys 31.79 0.02 0.53
rx2600:
# timex /home/h4262/baseline/short & # The last prime number is : real user sys 8.54 8.48 0.00 99991
# timex /home/h4262/baseline/diskread & [1] 3841 root@r265c145:/home/h4262/memory/leak # DiskRead: System : [HP-UX] DiskRead: RawDisk : [/dev/rdsk/c2t1d0s2] DiskRead: Start reading : 2048MB 2048+0 records in 2048+0 records out real user sys 29.60 0.01 0.16
How does the performance of these programs compare to their earlier runs? Answer: short takes a little longer. The CPU is not under much pressure at this time so compute bound processes will not be affected (unless they need memory!). It is a different story for diskread, in the first test case, it took noticeably longer due to the disk load already in progress for the paging activity. It is not good to have swap space on your application disks! 8. When finished monitoring the behavior of processes with memory leaks, clean up the processes.
http://education.hp.com
Solutions
# kctune maxdsiz=0x40000000 WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? n NOTE: The backup will not be updated. * The requested changes have been applied to the currently running system. Tunable Value Expression Changes maxdsiz (before) 0x10000000 0x10000000 Immed (now) 0x40000000 0x40000000
http://education.hp.com
Solutions
Directions
The following lab illustrates swap reservation, configures and de-configures pseudo swap, and adds additional swap partitions with different swap priorities. 1. Use the swapinfo -m command to display the current swap space statistics on the system. List the MB Avail and MB Used for the following three items: MB Available dev reserve memory Answer MB Used
512 451
0 139 27
# swapinfo -m TYPE dev reserve memory Mb AVAIL 2048 1013 Mb USED 75 189 339
(rx2600)
Mb FREE 1973 -189 674 PCT USED 4% 33% START/ Mb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2
2. To see total swap space available and total swap space reserved, enter: # swapinfo -mt What is the total swap space available (including pseudo swap)? Answer Varies with configuration, in our case it is 964 Mb or 3 Gb (as seen in the bolded figures below.)
http://education.hp.com
Solutions
# swapinfo -tm Mb TYPE AVAIL dev 512 reserve memory 451 total 963 # swapinfo -mt Mb TYPE AVAIL dev 2048 reserve memory 1013 total 3061
(rp2430)
Mb USED 0 139 27 166 Mb FREE 512 -139 424 797 PCT USED 0% 6% 17% START/ Mb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2
(rx2600)
Mb USED 74 190 339 603 Mb FREE 1974 -190 674 2458 PCT USED 4% 33% 20% START/ Mb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2
What is the total space "reserved"? Answer Varies with configuration. Swap space is first reserved and then it may (or may not) be used by the process that reserved it. The bottom line is that reserved swap space is no more available than used swap space so the only figure that really matters here are the totals underlined (166 Mb and 603 Mb). This figure is unavailable to any other process. 3. Start a new shell process by typing sh. Re-execute the swapinfo command and verify whether any additional swap space was reserved when the new shell process started. In this case, the difference is going to be pretty small, so lets not use the m option. Upon verification, exit the shell. Is the swap space returned upon exiting the shell process? Answer It should and it does. But you have to be careful when you look. It is easy for some other activity on the system to spoil the results You may want to try it 2 or 3 times to see if your results change. What SHOULD happen is that the reserve-USED entries should increase and then decrease by exactly the same amount. rp2430:
# swapinfo TYPE dev reserve memory # sh # swapinfo TYPE dev reserve Kb AVAIL 524288 Kb Kb USED FREE 0 524288 144768 -144768 PCT USED 0% START/ Kb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2 Kb AVAIL 524288 462248 Kb Kb USED FREE 0 524288 144444 -144444 28384 433864 PCT USED 0% 6% START/ Kb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2
http://education.hp.com
Solutions
memory # exit # swapinfo TYPE dev reserve memory Kb AVAIL 524288 462248 Kb Kb USED FREE 0 524288 144444 -144444 28388 433860 PCT USED 0% 6% START/ Kb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2 462248 28384 433864 6%
rx2600:
# swapinfo Kb TYPE AVAIL dev 2097152 reserve memory 1037064 # sh # swapinfo Kb TYPE AVAIL dev 2097152 reserve memory 1037064 # exit # swapinfo Kb TYPE AVAIL dev 2097152 reserve memory 1037064 Kb Kb USED FREE 75652 2021500 194900 -194900 346740 690324 PCT USED 4% 33% START/ Kb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2 Kb Kb USED FREE 75652 2021500 195540 -195540 346740 690324 PCT USED 4% 33% START/ Kb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2 Kb Kb USED FREE 75652 2021500 194900 -194900 346740 690324 PCT USED 4% 33% START/ Kb LIMIT RESERVE 0 PRI 1 NAME /dev/vg00/lvol2
If you see that some swap was reserved and not released, then there is something else going on in the background that is skewing the figures. 4. Start glance and observe the Global bars at the top of the display for the duration of this step. Start a large, memory process and note how much the Current Swap Util. percentage increases in glance. Type: # /home/h4262/memory/paging/mem256 & This should reserve a large amount of swap space. Start as many mem256 processes as possible. For best results, wait until each swap reservation is complete, by observing the incremental increases in Current Swap Util. in glance. The system will get slower and slower as you start more mem256 processes. What was the maximum number of mem256 processes that can be started? Answer Varies with configuration, depends on your swap space.
http://education.hp.com
Solutions
On the rp2430, after 12 copies of mem256 the test system swap space was almost gone. Below is what happened when the 13th process was introduced.
# swapinfo -tm Mb TYPE AVAIL dev 512 reserve memory 451 total 963 Mb USED 461 51 399 911 Mb FREE 51 -51 52 52 PCT USED 90% 88% 95% START/ Mb LIMIT RESERVE 0 -
PRI 1
NAME /dev/vg00/lvol2
# /home/h4262/memory/paging/mem256& [13] 2864 # exec(2): insufficient swap or memory available. [13] + Done(9) /home/h4262/memory/paging/mem256&
On the rx2600, after 37 copies of mem256 the test system swap space was almost gone. Below is what happened when the 38th process was introduced.
# swapinfo -tm Mb TYPE AVAIL dev 2048 reserve memory 1013 total 3061 Mb USED 1978 70 991 3039 Mb FREE 70 -70 22 22 PCT USED 97% 98% 99% START/ Mb LIMIT RESERVE 0 -
PRI 1
NAME /dev/vg00/lvol2
What prevented an additional mem256 process from being started? Answer Insufficient swap or memory available
Kill all mem256 processes to restore performance. 5. Recompile the kernel, disabling pseudo-swap. Use the following procedure: 11i v1 and earlier: # # # # # # cd /stand/build /usr/lbin/sysadm/system_prep -s system echo "swapmem_on 0" >> system mk_kernel -s system cd / shutdown -ry 0
http://education.hp.com
Solutions -- The tunable swapmem_on cannot be changed in a dynamic fashion. WARNING: The automatic 'backup' configuration currently contains the configuration that was in use before the last reboot of this system. ==> Do you wish to update it to contain the current configuration before making the requested change? no NOTE: The backup will not be updated. * The requested changes have been saved, and will take effect at next boot. Tunable Value Expression swapmem_on (now) 1 Default (next boot) 0 0 # shutdown ry 0
6. Reboot from the new kernel. rp2430: Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test rx2600: (Nothing special needs to be done) 7. Once the system reboots, login and execute swapinfo. Is there a memory entry? Why or why not? Answer No. Pseudo-swap has been disabled.
Will the same number of mem256 processes be able to execute as earlier? Answer No. How many mem256 processes can be started now? Answer Varies with configuration
On the rp2430, only 6 processes could be started successfully. On the rx2600, only 27 processes could be started successfully. Kill all mem256 processes to restore performance. 8. If you have a two disk system. If you have a two disk system, add the second disk to vg00 (if this was not already done in a previous exercise) and build a second swap logical volume on it. This lvol should be the same size as the primary swap volume. If you do not have a second disk, continue this lab at question 13.
http://education.hp.com
Solutions
If you did not add the second disk earlier, # # # # vgdisplay v | grep Name (Note the physical disks used by vg00) ioscan fnC disk (Note which disk is unused by LVM) pvcreate f <raw_dev_file_of_second_disk> vgextend /dev/vg00 <block_dev_file_of_second_disk>
To create the new swap device on the second disk, # lvcreate n swap1 /dev/vg00 # lvextend L 512 /dev/vg00/swap1 <dev_file_of_second_disk> Note: In our case the primary swap was 512MB. See swapinfo on your system and match the size of the new swap device to the primary swap. 9. Now add the new logical volume to swap space. Ensure that the priority is the same as the primary swap: Check your work. # swapon p 1 /dev/vg00/swap1 Answer:
# swapinfo -tm Mb TYPE AVAIL dev 512 reserve total 512 Mb USED 0 130 130 Mb FREE 512 -130 382 PCT USED 0% 25% START/ Mb LIMIT RESERVE 0 0 PRI 1 NAME /dev/vg00/lvol2
# swapon -p 1 /dev/vg00/swap1
swapon: Device /dev/vg00/swap1 contains a file system. Use -e to page after the end of the file system, or -f to overwrite the file system with paging.
Oops! Problem 1, swapon is being overly cautious. If you get this message, the memory manager has detected what appears to be a file system already on the device. (Probably, left over from some previous use) You need to override.
# swapon -p 1 f /dev/vg00/swap1 swapon: The kernel tunable parameter "maxswapchunks" needs to be increased to add paging on device /dev/vg00/swap1.
Oops! Problem 2, the kernel cannot deal with this amount of swap. If you get this message, the tunable parameter, maxswapchunks, is set too small to accommodate all of the new swap space. We need to modify maxswapchunks and reboot. If you have this problem, use sam to double maxswapchunks. In 11i v2, maxswapchunks has been obsoleted and will not have to be modified. Recompile the kernel (if necessary), to increase maxswapchunks. Use the following procedure:
http://education.hp.com
Solutions
11i v1 and earlier (ONLY!) # # # # cd /stand/build echo "maxswapchunks mk_kernel -s system cd / 512" >> system
# shutdown -ry 0 10. If you had to rebuild the kernel to increase maxswapchunks, reboot the system. Otherwise, skip to step 11. 11i v1 and earlier (ONLY!) Press any key to interrupt the boot process Main menu> boot pri isl Interact with IPL> y ISL> hpux (;0)/stand/build/vmunix_test And now add the new swap device:
# swapon -p 1 f /dev/vg00/swap1
Verify that the new swap space has be recognized by the kernel:
# swapinfo -mt (rp2430) Mb Mb TYPE AVAIL USED dev 512 0 dev 512 0 reserve 141 total 1024 141 # swapinfo -tm (rx2600) Mb Mb TYPE AVAIL USED dev 2048 86 dev 2048 0 reserve 158 total 4096 244 Mb FREE 512 512 -141 883 PCT USED 0% 0% 14% START/ Mb LIMIT RESERVE 0 0 0
PRI 1 1 -
PCT USED 4% 0% 6%
PRI 1 1 -
Done! 11. Start enough mem256 processes to make the system start paging. Answer: This depends on how much memory you have but on an rp2430 with 640MB, I found that 8 processes got things paging nicely! On an rx2600, 10 should do nicely.
# vmstat 2 2 procs faults r b memory cpu w avm free re at pi po fr de sr in page
http://education.hp.com
Solutions
sy 9 213 9 191 cs 0 471 0 355 us sy id 0 180106 100 0 0 0 180106 100 0 0
5064 5056
34 23
0 0
192 122
340 217
99 63
0 0
3136 2006
339 216
Note the system is paging constantly in the vmstat output and free memory is very low. 12. Measure the disk I/O to see what is happening with swap space. Go to question 15 when you have finished. Answer: The I/O should be balanced across both disks!
# sar -d 5 2
(rp2430)
03/18/04 r+w/s 409 406 395 385 402 396 blks/s 12222 12093 12209 11976 12216 12034 avwait 33.45 31.03 28.53 25.00 31.03 28.10 avserv 13.86 9.24 12.26 10.57 13.08 9.89
HP-UX r206c41 B.11.11 U 9000/800 14:22:12 14:22:17 14:22:22 device c1t15d0 c3t15d0 c1t15d0 c3t15d0 c1t15d0 c3t15d0 %busy 87.03 60.68 82.60 72.20 84.82 66.43 avque 24.73 23.21 22.01 19.57 23.39 21.43
Average Average
# sar -d 5 2
(rx2600)
04/07/04 r+w/s 25 14 79 47 52 31 blks/s 542 271 2373 1229 1456 750 avwait 0.00 0.01 2.85 3.86 2.16 2.97 avserv 6.05 4.71 5.35 3.94 5.51 4.12
HP-UX r265c145 B.11.23 U ia64 11:28:10 11:28:15 11:28:20 device c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0 c2t0d0 %busy 9.38 3.79 21.40 6.60 15.38 5.19 avque 0.50 0.50 6.75 10.42 5.25 8.13
Average Average
This has doubled the effective performance of swap space. The results would be even better if the swap disks were on different controllers. 13. If you have a single disk system. Create three additional swap devices with sizes of 20 MB. # lvcreate -L 20 -n swap1 vg00 # lvcreate -L 20 -n swap2 vg00 # lvcreate -L 20 -n swap3 vg00 Prior to activating these swap devices, make note of the amount of swap space currently in use. When the new swap devices are activated with equal priority, all new paging activity will be spread evenly over these swap devices.
http://education.hp.com
Solutions
List the current amount of swap space in use. Answer Varies with configuration. Use swapinfo tm.
If 10 MB is currently in use on a single swap device, and we activate an equal priority swap device, what is the distribution if an additional 10 MB is paged out? A) B) The distribution would be 10MB and 10MB. or The distribution would be 15MB and 5MB. B. vhand does not consider what the previous utilization was.
Answer
14. Activate the newly created swap devices. Activate two with a priority of 1, and the third with a priority of 2. # swapon -p 1 /dev/vg00/swap1 # swapon -p 2 /dev/vg00/swap2 # swapon -p 1 /dev/vg00/swap3 Start enough mem256 processes to make the system start paging. Answer: This depends on how much memory you have but on a 640MB system I found that 8 processes got things paging nicely!
# vmstat 2 2 procs memory faults cpu r b w avm free sy cs us sy id 10 0 0 175597 6489 271 58 26 4 70 10 0 0 175597 6414 300 254 100 0 0
Note the system is paging constantly in the vmstat output and free memory is very low. Is the new paging activity being distributed evenly across the paging devices? Answer No. It is confined to lvol2 (primary swap), swap1, and swap3.
15. When finished with the lab, reboot the system as normal (do not boot vmunix_test) to re-enable pseudo-swap and remove the additional swap devices. For 11i v1 and earlier, follow this procedure:
http://education.hp.com
Solutions
# cd / # shutdown ry 0 For 11i v2 and later, follow this procedure: # cd / # kctune swapmem_on=1 # shutdown ry 0
http://education.hp.com
Solutions
Next, execute the make_files program to create five 4-MB ASCII files. # cd /vxfs # ./make_files 3. Purge the buffer cache of this data, by unmounting and remounting the file system. # cd / # umount /vxfs # mount /dev/vg00/vxfs /vxfs # cd /vxfs
http://education.hp.com
Solutions
4. Open a second terminal window and start glance. While in glance, display the Disk Report (d key). Zero out the data with the z key. From the first window, time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null real: user: sys: Answer: # timex cat file* > /dev/null (rp2430) real: user: sys: 0.73 0.01 0.11 glance Disk Report Logl Rds: Phys Rds:
2560 500
# timex cat file* > /dev/null (rx2600) real user sys 0.34 0.00 0.06
2560 2560
5. At this point, all 20 MB of data is resident in the buffer cache. Re-execute the same command and record the results below: # timex cat file* > /dev/null real: user: sys: Answer: # timex cat file* > /dev/null (rp2430) real: user: sys: 0.06 0.01 0.05 glance Disk Report Logl Rds: Phys Rds:
2560 0
# timex cat file* > /dev/null (rx2600) real: user: sys: 0.02 0.00 0.02
2560 0
http://education.hp.com
Solutions
NOTE:
The conclusion is that I/O is much faster coming from the buffer cache, than having to go to disk to get the data.
6. The sar -d report. Exit glance, and in the second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the VxFS file system (and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of request in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
Answer: The disk got over 80% busy. The average number of requests in the I/O queue reached around 53 on the rp2430 and 442 on the rx2600. The average wait time of a request was around 65 ms on the rp2430 and 182 ms on the rx2600. The task took around 12.5 seconds on the rp2430 and 7.5 seconds on the rx2600. 7. The glance I/O by Disk report Exit from the sar -d report, and start glance again. While in glance, display the I/O by Disk report (u key). From the first window, re-execute disk_long. Record the results below: # ./disk_long Answer: Utilization reached 86% and queue length reached 55 on the rp2430. Utilization reached 85% and queue length reached 414 on the rx2600. 8. The glance I/O by File System report Reset the data with the z key, and display the I/O by File System report (i key). From the first window, re-execute disk_long. Record results below: # ./disk_long glance I/O by Disk Report Logl I/O: Phys I/O: glance I/O by Disk Report Util: Qlen:
http://education.hp.com
Solutions
Answer: Logical I/Os reached 4059 and Physical I/Os reached 806 on the rp2430. Logical I/Os reached 4702 and Physical I/Os reached 1528 on the rx2600. 9. Performance tuning immediate reporting. Ensure the immediate reporting options are set for the disk that the file system is located on. If immediate reporting is not set, set it. # scsictl -m ir /dev/rdsk/cXtXdX (to report current "ir" status) # scsictl -m ir=1 /dev/rdsk/cXtXdX (ir=1 to set, ir=0 to clear) Purge the contents of buffer cache. # # # # cd / umount /vxfs mount /dev/vg00/vxfs /vxfs cd /vxfs
10. The sar -d report. Exit glance, and in the second window start: # sar -d 5 200 From the first window, execute the disk_long program (which writes 400 MB to the file system and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of requests in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
http://education.hp.com
Solutions
Next, execute the make_files program to create five 4-MB ASCII files. # cd /hfs # ./make_files 3. Purge the buffer cache of this data, by unmounting and remounting the file system. # cd / # umount /hfs # mount /dev/vg00/hfs /hfs # cd /hfs
http://education.hp.com
Solutions
4. Time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null real: user: sys: Answer: # timex cat file* > /dev/null (rp2430) real user sys 1.04 0.01 0.16
# timex cat file* > /dev/null (rx2600) real user sys 0.45 0.00 0.05
The cat command took 1.04 seconds to complete on the rp2430 and 0.45 seconds on the rx2600. 5. In a second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of request in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
Answer:
# sar -d 5 200 (rp2430) HP-UX r206c41 B.11.11 U 9000/800 11:53:15 11:53:20 11:53:25 11:53:30 11:53:35 device c1t15d0 c3t15d0 c1t15d0 c3t15d0 c1t15d0 c3t15d0 c1t15d0 c3t15d0 %busy 5.20 33.60 7.57 55.98 2.01 100.00 8.00 84.20 avque 0.50 6922.08 0.50 5215.11 0.50 8156.62 5.80 1237.19 03/23/04 r+w/s 13 950 10 1758 6 2983 18 558 blks/s avwait 66 5.09 15049 629.53 36 5.40 27980 2113.38 44 3.92 47696 2591.43 108 25.31 8670 1555.06 avserv 4.54 14.85 6.82 13.70 5.01 16.45 18.95 17.68
http://education.hp.com
Solutions 11:53:40 11:53:45 11:53:50 c1t15d0 c3t15d0 c1t15d0 c3t15d0 c3t15d0 6.00 0.50 71.20 7379.94 0.20 0.50 25.80 2375.50 9.20 0.50 15 2168 1 950 16 76 4.69 34537 1322.90 5 0.08 15206 3478.83 258 5.06 4.72 14.77 8.35 14.42 5.21
The disk got up to 100% busy. The average number of requests in the request queue was about 5200. The average wait time in the request queue was about 1950 ms.
# timex ./disk_long real user sys 22.76 4.57 3.45
%busy avque 4.39 0.50 27.15 0.50 41.00 104.29 99.20 24004.63 1.40 0.50 100.00 20020.69 4.00 0.50 57.20 5030.77 2.40 0.50 1.40 0.50
The disk got up to 100% busy. The average number of requests in the request queue was about 50,000. The average wait time in the request queue was about 6100 ms.
# timex ./disk_long real user sys 16.87 0.83 1.96
http://education.hp.com
Solutions
6. Performance tuning recreate the file system with larger fragment and file system block sizes. Tuning the size of the fragments and file system blocks can improve performance for sequentially accessed files. The procedure for creating a new file system with customized fragments of 8 KB and file system blocks of 64 KB is shown below: # lvcreate -n custom-lv vg00 # lvextend L 512 /dev/vg00/custom-lv /dev/dsk/cXtYdZ # newfs -F hfs -f 8192 -b 65536 /dev/vg00/rcustom-lv # mkdir /cust-hfs # mount /dev/vg00/custom_lv /cust-hfs 7. Copy the lab files to the customized HFS file system, execute the make_files program, and purge the buffer cache. # cp /hfs/disk_long /cust-hfs
# cp /hfs/make_files /cust-hfs # cd /cust-hfs # ./make_files # cd / # umount /cust-hfs # mount /dev/vg00/custom-lv /cust-hfs # cd /cust-hfs 8. Time how long it takes to read the files with the cat command. Record the results below: # timex cat file* > /dev/null real: user: sys: Answer: # timex cat file* > /dev/null (rp2430) real user sys 0.84 0.01 0.10
http://education.hp.com
Solutions
# timex cat file* > /dev/null (rx2600) real user sys 0.43 0.00 0.03
The cat command took 0.84 seconds to complete on the rp2430 and 0.43 seconds on the rx2600. How do the results of step 8 compare to the default HFS block and fragment results from step 4? _______________________________________________________________________ Answer: The larger block and fragment size resulted in I/O operations which were almost 20% faster on the rp2430 and marginally faster on the rx2600. 9. Performance tuning change file system mount options. The manner in which the file system is mounted can impact performance. The fsasync mount option can improve performance, but data (metadata) integrity is not as reliable in the event of a crash, and fsck could run into difficulties. # cd / # umount /hfs # mount -o fsasync /dev/vg00/hfs /hfs # cd /hfs 10. In a second window start: # sar -d 5 200 From the first window, execute the disk_long program, which writes 400 MB to the HFS file system (and then removes the files). # timex ./disk_long How busy did the disk get? What was the average number of requests in the I/O queue? What was the average wait time in the I/O queue? How much real time did the task take?
http://education.hp.com
Solutions
device c1t15d0 c3t15d0 c1t15d0 c3t15d0 c1t15d0 c3t15d0 c1t15d0 c3t15d0 c3t15d0
%busy 6.20 61.20 7.00 58.60 8.40 92.80 6.60 100.00 71.20
avque 0.50 5592.30 0.50 7186.64 3.94 4986.82 0.50 15588.44 5725.86
r+w/s blks/s avwait avserv 9 38 4.18 6.19 2120 33818 1376.80 13.94 16 81 4.31 5.28 1675 26765 1295.53 17.00 24 146 20.12 13.03 1860 29579 2678.62 16.11 17 120 4.84 3.79 2344 37493 2943.35 16.95 2292 36664 6159.69 15.69
The disk got up to 100% busy. The average number of requests in the request queue was about 7800. The average wait time in the request queue was about 2900 ms. # timex ./disk_long real user sys 17.17 4.61 3.72
The operation completed in 17.17 seconds. # sar -d 5 200 (rx2600) 04/07/04 r+w/s blks/s avwait avserv 4 67 0.00 2.51 1274 20184 1026.94 2.54 5 77 0.00 5.94 3684 58941 4021.91 2.15 9 141 11.85 12.77 3888 62008 8740.46 2.05 2 30 0.00 4.42 287 4562 11067.58 1.51 9 43 0.00 4.45
HP-UX r265c145 B.11.23 U ia64 13:39:39 13:39:44 13:39:49 13:39:54 13:39:59 13:40:04 device c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0 c2t0d0 c2t1d0
%busy avque 1.00 0.50 46.11 22190.48 2.00 0.50 100.00 30303.60 3.20 5.20 99.80 11176.41 0.80 0.50 5.60 716.00 4.00 0.50
The disk got up to 100% busy. The average number of requests in the request queue was about 17500. The average wait time in the request queue was about 6100 ms. # timex ./disk_long real user sys 14.46 0.86 3.04
http://education.hp.com
Solutions
How do the results of step 10 compare to the default mount options in step 5? _____________________________________________________________________ Answer: With fsasync turned on, the operation was about 25% faster on the rp2430 and 14% faster on the rx2600.
http://education.hp.com
Solutions
3. Change directory to /vxfs. Time the execution of the disk_long program, which writes 400 MB of data to the file system in 20 MB increments. After each 20 MB is written, the files are deleted. Run the command three times and record the middle results. # cd /vxfs # timex ./disk_long # timex ./disk_long # timex ./disk_long Record middle results: Real: _____________ User: ____________ Sys: ____________
Answer Varies with configuration, live data from test
(rp2430)
If you look back to the HFS results, you will see that this is faster. See question 5 from the previous lab; test time was 23 seconds or 17 seconds!
http://education.hp.com
Solutions
4. Remount the JFS file system using delaylog option. This helps performance of noncritical transactions. Run the command three times and record the middle results. # # # # # # # cd / umount /vxfs mount -o delaylog /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long
Record middle results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, should be faster than before: (rp2430)
(rx2600)
Based on the results, does the disk_long program perform any non-critical transactions?
Answer
The answer is yes; the disk_long program is performing some non-critical transactions. This is seen by some improvement in time to execute. Since the programs write data in 1 MB increments (that's it), just about every JFS transaction is critical, so mounting with delaylog versus full log does not greatly affect performance in this case. It will in other cases. 5. Remount the JFS file system using tmplog option. This causes the system call to be returned after the JFS transaction is updated in memory (step 1 from lecture), and before the transaction is written to the intent log. Run the command three times and record the middle results. # cd / # umount /vxfs
http://education.hp.com
Solutions
# # # # #
mount -o tmplog /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long
Record middle results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, live test data: (rp2430)
(rx2600)
Based on the results, why does the disk_long program show little or no improvement when mounted with tmplog?
Answer
The disk_long program shows little performance improvement because the program is performing extending write calls. When an extending write call is issued, by default JFS writes the user data first before writing the JFS transaction to the intent log. As a result, even JFS file systems mounted with tmplog or nolog will still have to wait for the user data to be written to disk. This waiting for the user data to be written hurts the performance of JFS.
6. Remount the JFS file system using tmpcache option. This allows the JFS transaction to be created without having to wait for the user data to be written in extending write calls. Run the command three times and record the middle results. # # # # # # # cd / umount /vxfs mount -o mincache=tmpcache /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long timex ./disk_long timex ./disk_long
http://education.hp.com
Solutions
Record middle results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, live test data. Fastest yet!
(rp2430)
# timex ./disk_long
Answer
When the mincache=tmpcache option is specified, under 2 MB out of 400 MB is physically written to disk. When this option is not specified, all 400 MB (400 out of 400) is physically written to disk. Major performance improvements should be seen with using this option, especially for applications doing lots of extending write calls (like the one in the lab). 7. Remount the JFS file system using direct option. This option requires all user data and all JFS transactions to bypass the buffer cache and go directly to disk. Run the command just once and record the results. # # # # # cd / umount /vxfs mount -o mincache=direct /dev/vg00/vxfs /vxfs cd /vxfs timex ./disk_long
# timex ./disk_long
# timex ./disk_long
http://education.hp.com
Solutions
Based on the results, why does the disk_long program show such poor performance results when mounted with mincache=direct? When would this option be appropriate to use?
Answer
The performance is poor because system calls have to wait while user data and JFS transactions are written out to disk. Normally, the JFS transactions are written to buffer cache, and the system calls do not have to wait for the transaction to be written to disk. This option is appropriate when the application performs its own caching, like with an RDBMS (for example, Oracle).
http://education.hp.com
Solutions
2. Export the JFS file system so the client can mount it. # exportfs -i -o root=<client_hostname> # exportfs 3. From the client system, mount the NFS file system. # umount /vxfs # mount server_hostname:/vxfs /vxfs /vxfs
4. Time how long it takes to read the 20 MB of files from the mounted file system. Record the results: # timex cat /vxfs/file* > /dev/null Record results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, live test data below, (rp2430)
# timex cat /vxfs/file* > /dev/null real user sys 1.80 0.01 0.07
(rx2600)
http://education.hp.com
Solutions
sys 0.02 5. Now that the data is in the client's buffer cache, time how long it takes to read the exact same files again. Record the results: # timex cat /vxfs/file* > /dev/null Record results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, live data below. Much faster once buffered. (rp2430)
# timex cat /vxfs/file* > /dev/null real user sys 0.05 0.01 0.04
# timex cat /vxfs/file* > /dev/null real user sys 0.02 0.00 0.01
(rx2600)
Moral: Try to have a big enough buffer on the client system for a lot of data to be cached. Also, biod daemons will help by prefetching data. 6. Test to see if fewer biod daemons will change the initial performance. # # # # # # cd / umount /vxfs kill $(ps -e | grep biod | cut -c1-7) /usr/sbin/biod 4 mount server_hostname:/vxfs /vxfs timex cat /vxfs/file* > /dev/null
Record results: Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, but significant change. Large sequential access appears to be independent of the number of biods. Not what theory suggests? Well this depends! (rp2430)
# timex cat /vxfs/file* > /dev/null real user sys 1.80 0.01 0.07
(rx2600)
http://education.hp.com
Solutions
real 1.15 user 0.00 sys 0.02 7. Once finished, remove the files and umount the file system. # rm /vxfs/file* # umount /vxfs
During this lab, the monitoring tools shown below should be used on the client and server CLIENT SERVER
# nfsstat -c # nfsstat -s # glance NFS report (n key) # glance NFS report (n key) # glance Global Process (g key) # glance Global Process (g key) - monitor biod daemons -monitor nfsd daemons # glance Disk report (d key) - monitor Remote Rds/Wrts 1. From the NFS client, mount the NFS file system as a version 2 file system. # mount -o vers=2 server_hostname:/vxfs /vxfs 2. Terminate all the biod daemons on the client. # kill $(ps -e |grep biod|cut -c1-7) 3. Time how long it takes to copy the vmunix file to the mounted NFS file system. Record the results: The first command buffers the file. # cat /stand/vmunix >/dev/null # timex cp /stand/vmunix /jfs Record results:
http://education.hp.com
Solutions
Answer
(rx2600)
4. Now, start up the biod daemons, and retry timing the copy. Record the results: # /usr/sbin/biod 4 # timex cp /stand/vmunix Record results: /jfs
Real: _____________ User: ____________ Sys: ____________ Answer Varies with configuration, the test data shows marked improvement. The biods are providing the write behind service which is reducing the wait time experienced by the cp command. (rp2430)
(rx2600)
5. Change the mount options to version 3 and retime the transfer: # # # # # cd / umount /vxfs mount o vers=3 server_hostname:/vxfs /vxfs cd / timex cp /stand/vmunix /vxfs
Record results:
http://education.hp.com
Solutions
Real: _____________ User: ____________ Sys: ____________ Answer: Interesting, it would appear that Version 3 mounting is far better than version 2. The results were obtained using the same 4 biods started in question 3. # timex cp /stand/vmunix /vxfs real 2.63 user 0.00 sys 0.18 # timex cp /stand/vmunix /vxfs real user sys 4.13 0.00 0.13 (rp2430)
(rx2600)
6. Compare the speed of FTP to NFS. Transfer the file to the server using the ftp utility. # ftp server_hostname # put /stand/vmunix /vxfs/vmunix.ftp How long did the FTP transfer take? _________ Explain the difference in performance. Answer The data below shows that ftp is well optimized to perform data transfer. The good news is that Version 3 of NFS keeps up with it and remember that at 11i, NFS is using TCP/IP and not UDP/IP.
# ftp r265c69 (rp2430) Connected to r265c69.cup.edunet.hp.com. 220 r265c69.cup.edunet.hp.com FTP server (Version 1.1.214.4(PHNE_23950) Tue May 22 05:49:01 GMT 2001) ready. Name (r265c69:root): 331 Password required for root. Password: 230 User root logged in. Remote system type is UNIX. Using binary mode to transfer files. ftp> put /stand/vmunix /vxfs/vmunix.ftp 200 PORT command successful. 150 Opening BINARY mode data connection for /vxfs/vmunix.ftp. 226 Transfer complete. 27573440 bytes sent in 2.55 seconds (10554.31 Kbytes/s) ftp> # ftp r265c145 (rx2600) Connected to r265c145. 220 r265c145.cup.edunet.hp.com FTP server (Revision 1.1 Version wuftpd-2.6.1 Tue Jul 15 07:42:07 GMT 2003) ready.
http://education.hp.com
Solutions
Name (r265c145:root): 331 Password required for root. Password: 230 User root logged in. Remote system type is UNIX. Using binary mode to transfer files. ftp> put /stand/vmunix /vxfs/vmunix.ftp 200 PORT command successful. 150 Opening BINARY mode data connection for /vxfs/vmunix.ftp. 226 Transfer complete. 47716848 bytes sent in 4.03 seconds (11557.24 Kbytes/s) ftp>
7. Test the potential performance benefit of turning off the new TCP feature of HPUX 11i. First, mount the file system with UDP protocol rather than the default TCP. # umount /vxfs # mount -o vers=3 o proto=udp server_hostname:/vxfs /vxfs Perform the copy test again and compare the results with the TCP version 3 mount data in part 3. Is UDP quicker than TCP? # timex cp /stand/vmunix /vxfs
Answer # timex cp /stand/vmunix /vxfs real user sys 2.44 0.00 0.15 (rx2600) (rp2430)
It would appear that UDP is marginally quicker than TCP but the difference is very small and probably not worth the risk. HPUX 11i version 3 NFS with TCP provides good performance and reliability.
http://education.hp.com