RAC System Test Plan Outline 11gr2 v2 4
RAC System Test Plan Outline 11gr2 v2 4
RAC System Test Plan Outline 11gr2 v2 4
Purpose
Before a new computer /cluster system is deployed in production it is important to test the system thoroughly to validate
that it will perform at a satisfactory level, relative to its service level objectives. Testing is also required when
introducing major or minor changes to the system. This document provides an outline consisting of basic guidelines and
recommendations for testing a new RAC system. This test plan outline can be used as a framework for building a
system test plan specific to each company’s RAC implementation and their associated service level objectives.
In addition to the specific system testing outlined in this document additional testing needs to be defined and executed
for RMAN, backup and recovery, and Data Guard (for disaster recovery). Each component area of testing also requires
specific operational procedures to be documented and maintained to address site-specific requirements.
Testing Objectives
In addition to application functionality testing, overall system testing is normally performed for one or more of the
following reasons:
• Verify that the system has been installed and configured correctly. Check that nothing is broken. Establish a
baseline of functionality behavior such that we can answer the question down the road: ‘has this ever worked in this
environment?’
• Verify that basic functionality still works in a specific environment and for a specific workload. Vendors normally
test their products very thoroughly, but it is not possible to test all possible hardware/software combinations and
unique workloads.
• Make sure that the system will achieve its objectives, in particular, availability and performance objectives. This can
be very complex and normally requires some form of simulated production environment and workload.
• Test operational procedures. This includes normal operational procedures and recovery procedures.
• Train operations staff.
Generating a realistic application workload can be complex and expensive but it is the most important factor for effective
testing. For each individual test in the plan, a clear understanding of the following is required:
• What is the objective of the test and how does this relate to the overall system objectives?
• Exactly how will the test be performed and what are the execution steps?
• What are the success/failure criteria, and what are the expected results?
• How will the test result be measured?
• Which tools will be used?
• Which logfiles and other data will be collected?
• Which operational procedures are relevant?
• What are the expected results of the application for each of the defined tests (TAF, FCF, RCLB)?
This list only covers testing for RAC-related components and procedures. Additional tests are required for other parts of
the system. These tests should be performed with a realistic workload on the system. Procedures for detecting and
recovering from these failures must also be tested.
In some worst-case scenarios it might not be possible to recover the system within an acceptable time frame and a
disaster recovery plan should specify how to switch to an alternative system or location. This should also be tested.
The result of a test should initially be measured at a business or user level to see if the result is within the service level
agreement. If a test fails it will be necessary to gather and analyze the relevant log and trace files. The analysis can result
in system tuning, changing the system architecture or possibly reporting component problems to the appropriate vendor.
Also, if the system objectives turn out to be unrealistic, they might have to be changed.
Test 4 Reboot all nodes at • Issue a reboot on all nodes at the same time • All nodes, instances and • Time for all resources to
the same time o For AIX, HPUX, Windows: ‘shutdown –r’ resources are restarted without become available again,
o For Linux: ‘shutdown –r now’ problems Check with “crsctl stat
o For Solaris: ‘reboot’ res –t”.
Test 11 SCAN Listener • For AIX, HPUX, Linux and Solaris: • No impact on connected • Same as Listener
Failure Obtain the PID for the SCAN listener process: database sessions. Failure
# ps –ef | grep tnslsnr • New connections are redirected
Kill the listener process: to listener on other node
# kill –9 <listener pid> (depends on client configuration)
• For Windows: • The Listener failure is detected
Use Process Explorer to identify the by CRSD ORAAGENT and is
tnslistener.exe process for the SCAN listener. automatically restarted. Review
This will be the tnslistener.exe registered to the the following logs:
“<home name> • $GI_HOME/log/<nodename>/cr
TNSListenerLISTENER_SCAN<n>” service (not sd/crsd.log
the <home name>TNSListener” service). Once • $GI_HOME/log/<nodename>/ag
the proper tnslistener.exe is identified kill the ent/crsd/oraagent_<GI_owner>/o
process by right clicking the executable and raagent_<GI_owner>.log
choosing “Kill Process”.
Test 17 Interconnect Switch • In a redundant network switch configuration, • Network traffic should fail over • Time to fail over to
Failure (Redundant power off one switch to other switch without any other NIC card. With
Switch impact on interconnect traffic or bonding /teaming/11.2
Configuration) instances. Redundant Interconnect
configured this should
be less than 100ms.
Test 20 Node Loses Access to • Unplug external storage cable connection (SCSI, • If multi-pathing is enabled, the • Monitor database status
Single Path of Disk FC or LAN cable) from node to disk subsystem. multi-pathing configuration under load to ensure no
Subsystem (OCR, should provide failure service interruption
Voting Device, transparency occurs.
Database files) • No impact to database instances. • Path failover should be
visible in the OS
logfiles.
Test 21 ASM Disk Lost • Assuming ASM normal redundancy • No impact on database instances • Monitor progress:
• Power off / pull out / offline (depending on • ASM starts rebalancing (view select * from
config) one ASM disk. ASM alert logs). v$asm_operation
Test 22 ASM Disk Repaired • Power on / insert / online the ASM disk • No impact on database instances • Monitor progress:
• ASM starts rebalancing (view select * from
ASM alert logs). v$asm_operation
Test 24 Lose and Recover one 1. Remove access to one copy of OCR or force • There will be no impact on the • There is no impact on
copy of OCR dismount of ASM diskgroup (asmcmd umount cluster operation. The loss of the cluster operation
<dg_name> -f). access and restoration of the • The OCR can be
2. Replace the disk or remount the diskgroup, missing/corrupt OCR will be replaced online, without
ocrcheck will report the OCR to be out of sync. reported in: a cluster outage.
3. Delete the corrupt OCR (ocrconfig –delete o $GI_HOME/log/<nodename>/
+<diskgroup>) and read the OCR (ocrconfig –add cssd/crsd.log
+<diskgroup>). This avoids having to stop o $GI_HOME/log/<nodename>/
CRSD. alert<nodename>.log
Test 5 CRSD ORAAGENT • For AIX, HPUX, Linux and Solaris: • The Grid Infrastructure • Time to restart the
Grid Infrastructure Obtain the PID for the CRSD oraagent for the GI ORAAGENT process failure is ORAAGENT process
Process Failure software owner: detected by CRSD and is
# cat automatically restarted. Review
$GI_HOME/log/<nodename>/agent/crsd/oraag the following logs:
ent_<GI_owner>/oraagent_<GI_owner>.pid o $GI_HOME/log/<nodename>/
# kill –9 <pid for GI oraagent process> crsd/crsd.log
o $GI_HOME/log/<nodename>/
• For Windows: agent/crsd/oraagent_<GI_own
Use Process Explorer to identify the crsd er>/oraagent_<GI_owner>.log
oraagent.exe process that is a child process of
crsd.exe (or obtain the pid for the crsd
oraagent.exe as shown in the Unix/Linux
instructions above). Once the proper oraagent.exe
process is identified kill the process by right
clicking the executable and choosing “Kill
Process”.
Oracle Support Services RAC Starter Kit
RAC Assurance Team System Test Plan Outline
Page 19
Test # Test Procedure Expected Results Measures Actual Results/Notes
Test 6 CRSD • For AIX, HPUX, Linux and Solaris: • The ORAROOTAGENT process • Time to restart the
ORAROOTAGENT Obtain the PID for the CRSD orarootagent: failure is detected by CRSD and ORAROOTAGENT
Process Failure # cat is automatically restarted. process
$GI_HOME/log/<nodename>/agent/crsd/oraro Review the following logs:
otagent_root/orarootagent_root.pid” o $GI_HOME/log/<nodename>/
# kill –9 <pid for orarootagent process> crsd/crsd.log
o $GI_HOME/log/<nodename>/
• For Windows: agent/crsd/orarootagent_root/o
Use Process Explorer to identify the crsd rarootagent_root.log
orarootagent.exe process that is a child process of
crsd.exe (or obtain the pid for the crsd
orarootagent.exe as shown in the Unix/Linux
instructions above). Once the proper
orarootagent.exe process is identified kill the
process by right clicking the executable and
choosing “Kill Process”.
Test 7 OHASD • For AIX, HPUX, Linux and Solaris: • The ORAAGENT process failure • Time to restart the
ORAAGENT Process Obtain the PID for the OHASD oraagent: is detected by OHASD and is ORAAGENT process
Failure # cat automatically restarted. Review
$GI_HOME/log/<nodename>/agent/ohasd/oraa the following logs:
gent_<GI_owner>/oraagent_<GI_owner>.pid o $GI_HOME/log/<nodename>/
# kill –9 <pid for oraagent process> ohasd/ohasd.log
o $GI_HOME/log/<nodename>/
• For Windows: agent/ohasd/oraagent_<GI_ow
Use Process Explorer to identify the ohasd ner>/oraagent_<GI_owner>.lo
oraagent.exe process that is a child process of g
ohasd.exe (or obtain the pid for the ohasd
oraagent.exe as shown in the Unix/Linux
instructions above). Once the proper oraagent.exe
process is identified kill the process by right
clicking the executable and choosing “Kill
Process”.
Cluster Infrastructure
To simplify testing and problem diagnosis it is often very useful to do some basic testing on the cluster infrastructure
without Oracle software or a workload running. Normally this testing will be performed after installing the hardware and
operating system, but before installing any Oracle software. If problems are encountered during System Stress Test or
Destructive Testing, diagnosis and analysis can be facilitated by testing the cluster infrastructure separately. Typically
some of these destructive tests will be used:
Test 2 Create an external • Login to ASM via SQL*Plus and run: • A successfully created diskgroup. This diskgroup should
redundancy ASM “create diskgroup <dg name> also be listed in v$asm_diskgroup.
diskgroup using external redundancy disk ‘<candidate • The diskgroup will be registered as a Clusterware resource
SQL*Plus path>’ ;“ (crsctl stat res –t)
Test 3 Create an normal or • Login to ASM via SQL*Plus and run: • A successfully created diskgroup with normal redundancy
high redundancy ASM “create diskgroup <dg name> norma and two failure groups. For high redundancy, it will create
diskgroup using lredundancy disk '<candidate1 path>, three fail groups.
SQL*Plus '<candidate 2 path> ;” • The diskgroup will be registered as a Clulsterware resource
(crsctl stat res –t)
Test 4 Add a disk to a ASM • Login to ASM via SQL*Plus and run: • The disk will be added to the diskgroup and the data will be
disk group using “alter diskgroup <dg name> add disk rebalanced evenly across all disks in the diskgroup.
SQL*Plus '<candidate1 path> ;”
Test 5 Drop an ASM disk • Login to ASM via SQL*Plus and run: • The data from the removed disk will be rebalanced across
from a diskgroup using “alter diskgroup <dg name> drop disk the remaining disks in the diskgroup. Once the rebalance is
SQL*Plus <disk name>;” complete the disk will have a header_status of “FORMER”
(v$asm_disk) and will be a candidate to be added to another
NOTE: Progress can be monitored by diskgroup.
querying v$asm_operation
Test 7 Drop a ASM diskgroup • Login to ASM via SQL*Plus and run: • The diskgroup will be successfully dropped.
using SQL*Plus “drop diskgroup <dg name>;” • The diskgroup will be unregistered as a Clusterware
resource (crsctl stat res –t)
Test 8 Modify rebalance • Login to ASM via SQL*Plus and run: • The rebalance power of the current operation will be
power of an active “alter diskgroup <dg name> add disk increased to the specified value. This is visible in the
operation using '<candidate1 path> ;” v$asm_operation view.
SQL*Plus • Before the rebalance completes run the
following command via SQL*Plus:
“alter diskgroup <dg name> rebalance
power <1 – 11>;”. 1 is the default
rebalance power.
Test 10 Check the internal • Login to ASM via SQL*Plus and run: • If there are no internal inconsistencies, the statement
consistency of disk “alter diskgroup <name> check all” “Diskgroup altered” will be returned (asmcmd will return
group metadata using back to the asmcmd prompt). If inconsistencies are
SQL*Plus discovered, then appropriate messages are displayed
describing the problem.
Test 2 Create an external • Identify the candidate disks for the • A successfully created diskgroup. This diskgroup can be
redundancy ASM diskgroup by running: viewed using the “lsdg” ASMCMD command.
diskgroup using “lsdsk –candidate” • The diskgroup will be registered as a Clusterware resource
ASMCMD (crsctl stat res –t)
• Create a XML config file to define the
diskgroup e.g.
<dg name="<dg name>"
redundancy="external">
<dsk string="<disk path>" />
<a name="compatible.asm"
value="11.1"/>
<a name="compatible.rdbms"
value="11.1"/>
</dg>
• Login to ASM via ASMCMD and run:
“mkdg <config file>.xml”
Test 4 Add a disk to a ASM • Identify the candidate disk to be added • The disk will be added to the diskgroup and the data will be
disk group using by running: rebalanced evenly across all disks in the diskgroup.
ASMCMD “lsdsk –candidate” Progress of the rebalance can be monitored by running the
• Create a XML config file to define the “lsop” ASMCMD command.
diskgroup change e.g.
<chdg name="<dg name>">
<add>
<dsk string="<disk path>"/>
</add>
</chdg>
• Login to ASM via ASMCMD and run:
“chdg <config file>.xml”
Test 6 Modify rebalance • Add a disk to a diskgroup (as shown • The rebalance power of the current operation will be
power of an active above). increased to the specified value. This is visible with the
operation using • Identify the rebalance operation by lsop command.
ASMCMD running “lsop” via ASMCMD.
• Before the rebalance completes run the
following command via ASMCMD:
“rebal –power <1-11> <dg name>.
Test 2 Apply an ASM template • Use the template above and apply it • The datafile is created using the attributes of the ASM
to a new tablespace to be created on template
the database
• Login to ASM via SQL*Plus and run:
“create tablespace test datafile '+<dg
name>/my_files(unreliable)' size
10M;”
Test 3 Drop an ASM template • Login to ASM via SQL*Plus and run: • This template should be removed from
“alter diskgroup <dg name> drop v$asm_template.
template unreliable;”
Test 4 Create an ASM • Login to ASM via SQL*Plus and run: • You can use the asmcmd tool to check that the new
directory “alter diskgroup <dg name> add directory name was created in the desired diskgroup.
directory '+<dg name>/my_files';”
• The created directory will have an entry in
v$asm_directory
Test 5 Create an ASM alias • Login to ASM via SQL*Plus and run: • Verify that the alias exists in v$asm_alias
“alter diskgroup DATA add alias
'+DATA/my_files/datafile_alias' for
'+<dg name>/
<db name>/DATAFILE/<file
name>';”
Test 6 Drop an ASM alias • Login to ASM via SQL*Plus and run: • Verify that the alias does not exist in v$asm_alias.
“alter diskgroup DATA drop alias
'+<dg name>/my_files/ datafile_alias
';”
Test 8 Drop an inactive • Identify a datafile that is no longer • Observe that file number in v$asm_file is now
database file within used by a database removed.
ASM • Login to ASM via SQL*Plus and run:
“alter diskgroup data drop file '+<dg
name>/<db name>/DATAFILE/<file
name>';”
Test 5 Create a file on the • Perform the following: • The file will exist on all nodes with the specified
ACFS filesystem “echo “Testing ACFS” > <mount contents.
point>/testfile
• Perform a “cat” command on the file
on all nodes in the cluster.
Test 6 Remove an ACFS • Use acfsutil to register the ACFS • The filesystem will be unregistered with the ACFS
filesystem from the filesystem: registry. This can be validated by running
ACFS mount registry “/sbin/acfsutil registry –d <volume “/sbin/acfsutil registry –l”
device path> • The filesystem will NOT be automounted on all nodes
in the cluster on reboot
Test 11 Delete a snapshot of a • Use acfsutil to delete a previously • The specified snapshot will be deleted and will no
ACFS filesystem created snapshot of an ACFS longer appear under <ACFS mount
filesystem: point>/.ACFS/snaps.
“/sbin/acfsutil snap delete <name>
<ACFS mount point>”
Test 2 Use • Use dbms_file_transfer.put_file and • The put_file and get file functions will
dbms_file_transfer to get_file functions to copy database copy files successfully to/from
copy files from ASM files (datafiles, archives, etc) into and filesystem. This provides an alternate
to filesystem out of ASM. option for migrating to ASM, or to
simply copy files out of ASM.
NOTE: This requires that a database
directory be pre-created and
available for the source and
destination directories. See PL/SQL
Guide for dbms_file_transfer details
Test 1 Create an OCFS2 • Add a Disk/LUN to the RAC nodes • The OCFS2 filesystem will be
filesystem and configure the Disk/LUN for use by created.
OCFS2. • The OCFS2 filesystem will be
• Create the appropriate partition table mounted on all nodes
on the disk and use “partprobe” to
rescan the partition tables.
• Create the OCFS2 filesystem by
running:
“/sbin/mkfs –t ocfs2 <device path>”
• Add the filesystem to /etc/fstab on all
nodes
• Mount the filesystem on all nodes
Test 2 Create a file on the • Perform the following: • The file will exist on all nodes with
OCFS filesystem “echo “Testing OCFS2” > <mount the specified contents.
point>/testfile
• Perform a “cat” command on the file
on all nodes in the cluster.
Test 3 Verify that the • Issue a “shutdown –r now” • The OCFS2 filesystem will
OCFS2 filesystem is automatically mount and be accessible
available after a to all nodes after a reboot.
system reboot
Test 4 Enable database • Modify the database archive log • Archivelog files are created, and
archive logs to settings to utilize OCFS2 available to all nodes on the specified
OCFS2 OCFS2 filesystem.
Test 8 Validate OCFS2 • Unplug external storage cable • If multi-pathing is enabled, the multi-
functionality during connection (SCSI, FC or LAN cable) pathing configuration should provide
disk/disk subsystem from node to disk subsystem. failure transparency
path failures • No impact to the OCFS2 filesystem.
• Path failover should be visible in the
NOTE: Only OS logfiles.
applicable on
multipath storage
environments.
Test 9 Perform a FSCK of a • Dismount the OCFS2 filesystem to be • FSCK will check the specified
OCFS2 filesystem checked on ALL nodes OCFS2 filesystem for errors, answer
• Execute fsck on the OCFS2 filesystem yes to any prompts (-y) and provide
as follows: verbose output (-v).
“sbin/fsck -v -y -t ocfs2 <device
path>”
This command will automatically,
answer yes to any prompts (-y) and
provide verbose output (-v).
Test 10 Check the OCFS2 • Check the OCFS2 cluster status on all • The output of the command will be
cluster status nodes by issuing “/etc/init.d/o2cb similar to:
status”. Module "configfs": Loaded
Filesystem "configfs": Mounted
Module "ocfs2_nodemanager": Loaded
Module "ocfs2_dlm": Loaded
Module "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Checking O2CB heartbeat: Active
Test 1 Create an OCFS • Add a Disk/LUN to the RAC nodes • The OCFS filesystem will be created.
filesystem and configure the Disk/LUN for use by • The OCFS filesystem will be mounted
OCFS. on all nodes
• Create the appropriate partition table
on the disk and validate disk and
partition table is visible on ALL nodes
(this can be achieved via diskpart).
• Assign a drive letter to the logical drive
• Create the OCFS filesystem by
running:
cmd> %GI_HOME%\cfs\ocfsformat
/m <drive_letter> /c <cluster size> /v
<volume name> /f /a
Test 2 Create a file on the • Perform the following: • The file will exist on all nodes with
OCFS filesystem Use notepad to create a text file the specified contents.
containing the text “TESTING
OCFS” on an OCFS drive.
• Use notepad to validate that the file
exists on all nodes.
Test 3 Verify that the OCFS • Issue a “reboot” • The OCFS filesystem will
filesystem is available automatically mount and be accessible
after a system reboot to all nodes after a reboot.
Test 4 Enable database • Modify the database archive log • Archivelog files are created, and
archive logs to OCFS settings to utilize OCFS available to all nodes on the specified
OCFS filesystem.
Test 5 Create an RMAN • Back up ASM based datafiles to OCFS • RMAN backupsets are created, and
backup on an OCFS filesystem. available to all nodes on the specified
filesystem • Execute baseline recovery scenarios OCFS filesystem.
(full, point-in-time, datafile). • Recovery scenarios completed with
no errors.
Test 6 Create a datapump • Using datapump, take an export of the • A full system export should be created
export on an OCFS database to an OCFS filesystem. without errors or warnings.
filesystem
Test 7 Validate OCFS • Issue a “reboot” from a single node in • OCFS filesystem should remain
functionality during the cluster available to surviving nodes.
node failures.
Test 8 Remove a drive letter • Using Windows disk management use • OracleClusterVolumeService should
and ensure that the the ‘Change Drive Letter and Paths …’ restore the drive letter assignment
letter is re- option to remove a drive letter within a short period of time.
established for that associated with an OCFS partition.
partition
Test 9 Run ocfscollect tool • OCFSCollect is available as an • A .zap file (rename to .zip and
attachment to Note: 332872.1 extract). Can be used as a baseline
regarding the health of the available
OCFS drives.