Oracle Real Application Clusters On Linux On IBM System Z Set Up and Network Performance Tuning ZSW03185 USEN 02 PDF
Oracle Real Application Clusters On Linux On IBM System Z Set Up and Network Performance Tuning ZSW03185 USEN 02 PDF
on IBM System z
,
LPAR technology, and large disk storage units, so that the content can be useful to someone
who is familiar with Oracle on other platforms and wants to know about the product on Linux
on IBM System z.
This study describes the system setup, by stepping through the installation process that was
followed, emphasizing features that are specific to IBM System z.
Summary
This paper is about installing and running Oracle RAC on Linux on IBM System z, with
emphasis on performance related topics.
This study is about the Oracle RAC used in the workload balanced mode, which means that
client requests are sent equally to both nodes. The resulting lock contention is intended for
this test. The system setup is presented in detail. Several tests are conducted, varying various
parameters, to determine optimal values and performance trade-offs with the focus on how
the contention inside the cluster could be mitigated by the setup without changing the
workload.
The amount of network traffic through the Interconnect between both nodes is an important
indicator of the communication requirements needed to keep the caches consistent and
coordinate access conflicts. The test setup run with a high cache hit ratio, which leads to very
low disk I/O throughput values. This was important to ensure that the results are not
influenced by disk I/O waits. The synchronization of the shared cache is managed on the
database block level, meaning that even the smallest update statement causes a full block
exchange over the interconnect when the data is accessed from both nodes. This exchange
leads to much higher network traffic over the Interconnect than for client communications.
Comparing OCFS2 versus Oracle Automatic Storage Management (ASM) showed that ASM
could be recommended as a cluster file system, providing better or at least comparable
throughput at lower CPU utilization.
For an Oracle RAC in workload balanced mode the smallest possible block size is
recommended, at least for tables that were updated from both nodes. Reducing the database
block size from 8 KB to 4 KB leads, as expected, to a decrease in the Interconnect traffic of
4
Oracle Real Application Clusters on Linux on IBM System z
approximately one-half. The transaction throughput increased by 4%, while the cluster wait
events are reduced by 4%. This block size reduction is also correlated with reduced lock
contention. For example, locks for two different 4 KB blocks in the same 8 KB block are now
two independent locks, making the sending of the block unnecessary.
An important tuning improvement for the test environment was the change from shared
server processes to dedicated server processes. With that change, the same system was able to
drive so many more transactions that adding more CPU capacity leads finally to an
improvement of a factor of three. This effect contributed to our workload with a moderate
amount of clients (up to 80), where each client is 100% active, generating a continuous
workload, which easily keeps one Oracle server process busy. If several of these workload
generating clients are sharing a server process, this server process becomes a bottleneck. For
this type of workload pattern, the use of dedicated server processes is recommended. This
workload pattern is typically produced by a batch job, an application server, or a connection
concentrator.
The opposite scenario occurs with connections that have a low utilization, such as from any
manual interactions, and with a very high number (1000 or more) of users. Here, the use of
dedicated server processes results in thousands of under-utilized server processes, with their
corresponding memory requirements. In the test case, the memory usage increases when
scaling the number of users up to 40, then the total memory usage leveled off.
When scaling the number of users, the behavior of the system is determined by two types of
lock contentions: globally between the two nodes or locally inside each node. The local lock
contention becomes the dominant factor when the number of users increases. The lock
contention is a bottleneck that needs to be resolved if user activity is to be increased. In our
case, the cause of lock contention is updates being made to a small shared table. Lock
contention occurs either when the nodes in the cluster, or two users inside the nodes, are
trying to update the same data. The local contention is based on row locking, while the
cluster contention locks the whole data block.
For tuning the network setup, it seems that the typical network tuning recommendations
should be carefully applied to Oracle RAC running in a workload balanced mode. Increasing
the network throughput might be related to increasing the cluster contention, which could
lead to a performance degradation. For example, increasing the buffer counts or the MTU
sizes did not improve the throughput in the test scenarios. The best results were obtained
with the buffer count at its default (16 buffers). For the MTU size, it seems that Oracle RAC
and the Oracle client use a maximum package size of 1330 bytes for all TCP/IP connections,
therefore the recommendation is to use an MTU size of 1492 for the LAN connection and
8192 for the Interconnect.
5
Oracle Real Application Clusters on Linux on IBM System z
The device queue length was found to be a very important tuning parameter. The increase to
2000 (default 1000) caused an improvement of approximately 40%, which is a very high
value. But the observation that there is a size that is too large, resulting in degradation,
indicates the need for careful use of this parameter. Increasing this parameter is best done in
combination with monitoring of the network or transaction throughput.
For the Interconnect network device type, HiperSockets are the best choice for the Oracle
RAC interconnect, when the server nodes are on the same physical IBM System z machine.
Because it is not recommended to run all Oracle RAC nodes on one physical machine, the
second choice would be to use a 10 Gb OSA card. The use of a 10 Gb OSA connection does
not necessarily require a full 10 Gb infrastructure with switches and so forth. The test
scenarios used a direct coupling of two OSA ports with a standard cable. This provides a
connection in point-to-point mode, which additionally reserves the full bandwidth for the
interconnect and avoids any interaction with other network workloads, ensuring a constant
response time. With a 1 Gb OSA connection for the Interconnect, a new wait event occurred,
wait event 'cr request retry', indicating that the 1 Gb OSA interconnect is not sufficient for the
workload processed.
The new IBM zEnterprise
DS8000
16 ranks are spread across the logical configuration of the control unit to avoid
congestion.
The storage server is connected to the IBM System z with eight FICON paths.
Client hardware
The client system was an IBM System x
on
IBM System z.
When IBM System z became a Linux platform, there were early development versions of
Oracle RDBMS with Oracle Real Application Clusters for 31-bit Linux. Oracle RAC 10g
Release 2 for Linux on IBM eServer
zSeries
.
Note: Although you can download Oracle RAC and its patches and associated software at any
time, do not install these products until after you have properly prepared the target Linux
systems. See Preparing Linux for the installation of Oracle Real Application Clusters.
Oracle HOME directories and multiple Oracle databases
The test system was built with all the Oracle components (ASM, CRS, and RDBMS) at the
same release level. However, the Oracle RAC model can be built with additional databases at
different release levels on the same server, as long as the clusterware (CRS) on any server is at
the same or a later release level than the level of any of the other components, such as
RDBMS or ASM. The additional databases can be single instance or in a cluster. Each
database can be upgraded individually, while support continues for databases at an earlier
level.
Operating in an Oracle environment with many possibilities is accomplished by maintaining
sets of environmental variables such as $HOME, which would have a related $SID to identify
a particular database, and $PATH, $LIBPATH and other environmental variables as needed.
As an example, $HOME_CRS is set to / pr oduct / cr s, which is the location in this
installation of the Oracle RAC 10.2.0.4 CRS binaries, CRS libraries, CRS logs, and CRS
parameters.
The database administrator's files, such as scripts and parameters for creating and
maintaining databases, is located outside of the Oracle HOMEs. The actual data is located in
the shared storage environment such as ASM.
In an Oracle clustered environment, each database can be upgraded and expanded
individually while support continues for databases at earlier releases.
Oracle server processes - shared versus dedicated
After the clustered database is installed, one of the decisions to make is whether to use
dedicated server processes or shared server processes. Dedicated server processes choose a one
to one relationship between request and server. With the other choice, shared server
processes, the requests are queued to match processes from any of the servers being served by
dispatchers.
The efficiency of each approach is related to the type of activity that takes place. With the
dedicated server setup, a client request to the database connects to a server process and holds
20
Oracle Real Application Clusters on Linux on IBM System z
it until the request is completed. In this study, this process provided very good throughput
results. This configuration might use a large amount of memory with an increasing amount of
client connections, because each server process requires a certain amount of memory. The
shared server process reuses the same server process for multiple client processes, which goes
the opposite way, reduced memory usage with possibly lower throughput.
When the workload is determined by many (1000 or more) connected users who spend
significant time between actions, such as online shoppers, the shared server process setup
might be more efficient. For workloads with a concentrated high utilization on each
connection, for example when the Oracle Database is the backend from an application server
or behind a connection concentrator, dedicated server processes might be more appropriate.
Details on how to configure shared or dedicated servers are in the section Setting up the
listeners.
Installation and configuration
The installation and configuration of an Oracle Real Application Clusters system entails first
preparing the target Linux systems, and then performing the installation tasks.
Preparing Linux for the installation of Oracle Real Application Clusters
Linux preparation before the installation of Oracle Real Application Clusters consists of
preparing users, specifying various parameters and settings, setting up the network, setting up
the file system, and performing some disk tasks.
To prepare Linux for Oracle RAC installation, complete these tasks in the order that they are
listed.
Creating users and authentication parameters
To set up users and authentication for Oracle RAC, complete the following steps:
1. Log in with root authority.
2. Create the user named oracle on all of the nodes.
3. Create the user group named oinstall on all of the nodes.
4. Use an editor such as vi to add these lines to the body of the
/ et c/ secur i t y/ l i mi t s/ conf file, to increase the limits for open files and
processes:
or acl e har d nof i l e 65536
or acl e sof t npr oc 2047
or acl e har d npr oc 16384
# End of f i l e
21
Oracle Real Application Clusters on Linux on IBM System z
5. Check these four files for the following line, and add it if the line is not already present:
sessi on r equi r ed pam_l i mi t s. so
/ et c/ pam. d sshd
/ et c/ pam. d/ l ogi n
/ et c/ pam. d/ su
/ et c/ pam. d/ xdm
Using ulimit for shell settings
Set shell limits by editing the file with suffix . pr of i l e for the user oracle. Insert the
following lines in the file:
ul i mi t - n 65536
ul i mi t - u 16384
expor t OPATCH_PLATFORM_I D=211
The environment variable OPATCH_PLATFORM_ID indicates Linux on System z.
Setting Linux kernel parameters
Check the Linux kernel parameters in the / et c/ sysct l . conf file. The values of some of
the parameters must be increased if they are not equal to or greater than these values:
ker nel . sem= 250 32000 100 128
f s. f i l e- max = 65536
net . i pv4. i i i i p_l ocal _por t _r ange = 1024 65000
net . cor e. r mem_def aul t = 1048576
net . cor e. r mem_max = 1048576
net . cor e. wmem_def aul t = 262144
net . cor e. wmem_max = 262144
Setting up ssh user equivalence
Oracle uses the ssh protocol for issuing remote commands. Oracle uses scp to perform the
installation. Therefore, it is necessary for the users oracle and root to have user equivalence,
or the ability to use ssh to move from one node to another without authenticating with
passwords or passphrases. Set up ssh user equivalence between each of the interfaces on all of
the nodes in the cluster with public and private authentication key pairs that are generated
with the Linux command key-gen. Instructions on how to use key-gen to create public and
private security keys are available at:
http://www.novell.com/documentation/sles10/sles_admin/?page=/documentation/sles10/sles
_admin/data/sec_ssh_authentic.html
22
Oracle Real Application Clusters on Linux on IBM System z
When the key-gen command issues a prompt to enter a passphrase, just type enter so that no
passphrase will be required.
For example, an installation with two Oracle RAC servers will have a total of twenty-four key
pairs (between three interfaces, public, vip, and interconnect; users root and oracle; node1 to
node; node1 to node2; and the reverse).
Note: After the user equivalence has been set up, try to use each of the new authorizations
one time before proceeding with the installation. This step is necessary because the first time
that the new authorizations are used, they generate an interactive message that the Oracle
Universal Installer cannot properly handle.
Setting up the network for Oracle RAC
Each server node in the cluster needs two physical connections and three IP interfaces, which
must be set up before beginning the installation. Network device type for the Oracle
interconnect explained the need to use the device that is fastest and can handle the most
throughput and traffic for the private interconnect between server nodes in the cluster. For
this reason, the study used HiperSockets on IBM System z.
For the public access, a 1 Gb Ethernet was used on each server, which was configured to have
two interfaces, the second one being an alias.
For example, on the first server (node1) the interface for a particular OSA device port was set
up as an Ethernet connection with a hardware definition in / et c/ sysconf i g/ net wor k,
where a file named i f cf g- qet h- bus- ccw- 0. 0. 07c0 where 07C0 is the device address
of an OSA port contains the following lines:
NAME=' I BM OSA Expr ess Net wor k car d ( 0. 0. 07c0) '
I PADDR=' 10. 10. 10. 200'
NETMASK=' 255. 255. 255. 0'
By using the network address 10.10.10.200 and a netmask of 255.255.255.0, it leaves the
hardware device available to be used by any other interface using the network address of the
form: 10.10.10.xxx, and the clusterware startup scripts will create an alias for the public
interface when the node starts CRS.
This environment does not use a DNS server, so the alias interface name must be included in
the file / et c/ host s, where the two host addresses and names looked like this:
f f 02: : 3 i pv6- al l host s
10. 10. 10. 200 r ac- node1 r ac- node1. pdl . pok. i bm. com
10. 10. 10. 202 r ac- node1vi p r ac- node1vi p. pdl . pok. i bm. com
23
Oracle Real Application Clusters on Linux on IBM System z
In an Oracle RAC system, the string VIP in the interface name stands for virtual IP, and
identifies its role. Having two interfaces for the same Ethernet connection supports immediate
failover within the cluster. If a node is not responding, the vip IP address is attached to
another node in the cluster, faster than the time it would take for hardware timeout to be
recognized and processed.
The larger section of the file / et c/ host s shows how the aliases were used for the two
nodes, because they are all present in the file and it is also the same on the other node:
10. 10. 10. 200 db- node1 db- node1. pdl . pok. i bm. com
10. 10. 10. 201 db- node2 db- node2. pdl . pok. i bm. com
10. 10. 10. 202 db- node1vi p db- node1vi p. pdl . pok. i bm. com
10. 10. 10. 203 db- node2vi p db- node2vi p. pdl . pok. i bm. com
10. 10. 50. 200 db- node1pr i v db- node1pr i v. pdl . pok. i bm. com
10. 10. 50. 201 db- node2pr i v db- node2pr i v. pdl . pok. i bm. com
The Linux command ifconfig displays interface net0 and its alias named net0:1
net 0 Li nk encap: Et her net HWaddr 00: 14: 5E: 78: 1D: 14
i net addr : 10. 10. 10. 200 Bcast : 10. 10. 10. 255 Mask: 255. 255. 255. 0
i net 6 addr : f e80: : 14: 5e00: 578: 1d14/ 64 Scope: Li nk
UP BROADCAST RUNNI NG MULTI CAST MTU: 1492 Met r i c: 1
RX packet s: 12723074 er r or s: 0 dr opped: 0 over r uns: 0 f r ame: 0
TX packet s: 13030111 er r or s: 0 dr opped: 0 over r uns: 0 car r i er : 0
col l i si ons: 0 t xqueuel en: 1000
RX byt es: 2164181117 ( 2063. 9 Mb) TX byt es: 5136519940 ( 4898. 5 Mb)
net 0: 1 Li nk encap: Et her net HWaddr 00: 14: 5E: 78: 1D: 14
net addr : 10. 10. 10. 203 Bcast : 10. 10. 10. 255 Mask: 255. 255. 255. 0
UP BROADCAST RUNNI NG MULTI CAST MTU: 1492 Met r i c: 1
When Oracle RAC uses the public interface, it will select the interface with the number zero
after the prefix. By default this interface would be eth0. If there is a situation where the
public interface must be forced to use number 0, use the udev command to name the
interfaces. Naming the interfaces can be done by modifying or creating the
/ et c/ udev/ r ul es. d/ 30- net _per si st ent _names. r ul es file, to contain a line
like this example, where 07C0 is the device and net is a prefix that was chosen in order to be
different from eth
SUBSYSTEM==" net " , ACTI ON==" add" , ENV{PHYSDEVPATH}==" *0. 0. 07c0" ,
I MPORT=" / l i b/ udev/ r ename_net i f ace %k net 0"
A separate physical connection will become the private interconnect used for Oracle cache
fusion, where the nodes in the cluster exchange cache memory and messages in order to
maintain a united cache for the cluster. This study used IBM System z HiperSockets in the
first installation, and named the interface db-node1priv, since the Oracle convention is to call
the interconnect the private connection.
24
Oracle Real Application Clusters on Linux on IBM System z
The decision of what type of connectivity to use for the private interconnect is an important
decision for a new installation, the objective being high speed, and the ability to handle
transferring large amounts of data.
The example of the interconnect for both nodes is shown in the example above of
/ et c/ host s.
The first node (node1) uses IP address 10.10.50.200 and host name db-node1priv. The
second node (node2) uses IP address 10.10.50.201 and host name db-node1priv.
Using a 10.x.x.x network for the external (public) interfaces
The setup for the study required circumventing an Oracle RAC requirement that the external
RAC IP address and it's alias be an IP address that is from the range of public IP addresses.
Th setup needed to use an IP address in the form 10.10.x.x, which is classified as an internal
range.
Oracle RAC would not work until a change was made in $HOME_CRS/ bi n/ r acgvi p. To
do the same thing on your system, either set the variable DEFAULTGW to an IP address that
can always be pinged successfully (it will never be used, Oracle only checks to see if it is
available), or else search for and change FAIL_WHEN_DEFAULTGW_NOT_FOUND to a
value of 0.
Using udev when preparing to install Automatic Storage Manager
To set up shared storage for Oracle RAC on IBM System z, some of the attributes in Linux of
the DASD disk storage devices must be modified. In SLES, configuration files located in
/ et c/ udev/ r ul es. d are read by Linux shortly after the kernel is loaded, and before
Linux has created the file structures for disk storage and also before Linux assigns file names
and attributes to all the known disk devices and partitions.
The udev command can be used to change ownership of the block devices that are used for
the OCR and Voting Disks. It is also necessary to alter the attributes of the block devices that
will be given to ASM to manage as shared storage for data. Shared DASD used for data also
managed by ASM must be assigned the owner oracle and the group named dba.
For this study, a new file was created and given a high number (98) in the file name, so that it
would be read last and the setup changes would not be overwritten by other startup processes.
There are different rules for the udev command even when comparing SLES 10 SP2 with
SLES 10 SP1, so it may be necessary to check the man pages or documentation for the udev
command on your system, to ensure that it works as expected.
25
Oracle Real Application Clusters on Linux on IBM System z
This is a sample for the udev file 98- or acl e. per mi ssi ons. r ul es:
# f or par t i t i ons i mpor t par ent i nf or mat i on
KERNEL==" *[ 0- 9] " , I MPORT{par ent }==" I D_*"
# OCR di sks
KERNEL==" dasdf 1" , OWNER=" or acl e" , GROUP=" oi nst al l " MODE=" 0660"
KERNEL==" dasdp1" , OWNER=" or acl e" , GROUP=" oi nst al l " MODE=" 0660"
# VOTI NG DI SKS
KERNEL==" dasdg1" , OWNER=" or acl e" , GROUP=" oi nst al l " MODE=" 0660"
KERNEL==" dasdq1" , OWNER=" or acl e" , GROUP=" oi nst al l " MODE=" 0660"
#ASM
KERNEL==" dasdh1" , OWNER=" or acl e" , GROUP=" dba" MODE=" 0660"
KERNEL==" dasdi 1" , OWNER=" or acl e" , GROUP=" dba" MODE=" 0660"
KERNEL==" dasdj 1" , OWNER=" or acl e" , GROUP=" dba" MODE=" 0660"
KERNEL==" dasdk1" , OWNER=" or acl e" , GROUP=" dba" MODE=" 0660"
KERNEL==" dasdm1" , OWNER=" or acl e" , GROUP=" dba" MODE=" 0660"
KERNEL==" dasdn1" , OWNER=" or acl e" , GROUP=" dba" MODE=" 0660"
To make the changes take effect immediately, run this command:
/ et c/ i ni t . d/ boot . udev r est ar t
Setting up persistent names for disk devices
Linux assigns names to all devices that it discovers at startup, in the order in which it
discovers them, assigning the device names starting with the name dasda (or sda for SCSI)
and continuing using that pattern. Even with a small number of disks used from a SAN, the
order can change from one Linux startup to the next. For example, if one disk in the
sequence becomes unavailable, then all the disks that follow it will shift to a different name in
the series. The naming order might change in a way that affects the individual nodes
differently, which makes the management of the disks complicated and error-prone.
To produce device names that were the same on different Linux systems and also persistent
after rebooting required the use of device names in Linux such as / dev/ di sk/ by- pat h,
/ dev/ di sk/ by- i d, or / dev/ di sk/ by- uui d that are unambiguous. The problem is
that those types of names did not fit into spaces provided in the ASM GUI installer for that
purpose. It is possible to use these names with the silent install method, which runs a scripts
and uses a response file to complete the installation. The problem with the silent install
approach when doing the first installation is that there is no interactive error checking, so if
any of the input was unacceptable there is no way to remove a failed installation.
This study employed a workaround for this issue, which was to use the file
/ et c/ zi pl . conf to set the disks in the same order of discovery at startup using the dasd=
parameter. With the order controlled this way, it is possible to use the names of the
26
Oracle Real Application Clusters on Linux on IBM System z
partitioned files with confidence that the naming is consistent among the nodes and will not
change with a reboot.
When the installation was based on Linux as a guest on z/VM