HA cluster takeover takes too long on HANA indexserver failure

This document (000020845) is provided subject to the disclaimer at the end of this document.

Environment

SUSE Linux Enterprise Server for SAP Applications 15
SUSE Linux Enterprise Server for SAP Applications 12

Situation

With regular configuration of an HANA database, the resource agent (RA) for HANA in a Linux cluster does not trigger a takeover to the secondary site when:

A software failure causes one or more HANA processes to be restarted in place by the HANA daemon (hdbdaemon).
A hardware error causes the HANA indexserver (hdbindexserver) to restart locally.

For big HANA databases the resulting service outage exceeds the acceptable downtime.

Resolution

The SAP HANA nameserver provides a Python-based API ("HA/DR providers").
The API method srServiceStateChanged() is called when HANA processes are failing, starting or stopping.
The SUSE hook script susChkSrv.py can be called on any srServiceStateChanged() event. It executes a predefined action on HANA. As soon as the HANA landscapeHostConfiguration status changes to 1, the Linux cluster will take action. The cluster action depends on HANA system replication status and the RA´s configuration parameters PREFER_SITE_TAKEOVER and AUTOMATED_REGISTER.

The resolution is described below for a SAP HANA scale-up systems.
It can be adapted for scale-out. See manual page susChkSrv.py(7) for details.

The resolution is implemented by four steps:

1. Updating the software packages
The package SAPHanaSR should be updated on all nodes. It has to provide the hook script susChkSrv.py.

# zypper up SAPHanaSR SAPHanaSR-doc
# rpm -ql SAPHanaSR | grep susChkSrv.py

2. Adapting the HANA global configuration
The section [ha_dr_provider_suschksrv] has to be added to HANA global.ini at both sites.

---
[ha_dr_provider_suschksrv]
provider = susChkSrv
path = /usr/share/SAPHanaSR/
execution_order = 3
action_on_lost = stop
---

Refer to SAP HANA documentation on how to change the global.ini.
Alternatively you may use SAPHanaSR-manageProvider. See manual pages susChkSrv.py(7) and SAPHanaSR-manageProvider(8).

3. Loading the new HADR provider hook script
The newly added HADR provider hook script needs to be loaded.

# su - <sid>adm
~> hdbnsutil -reloadHADRProviders; echo rc=$?

Refer to SAP HANA documentation on details about loading HADR provider hook scripts.

4. Checking if the hook script has been loaded
The hook script should appear in the HANA nameserver trace files at both sites. It also should write into its own log file nameserver_suschksrv.trc.

# su - <sid>adm
~> cdtrace
~> grep HADR.*load.*susChkSrv nameserver_*.trc
~> grep susChkSrv.init nameserver_*.trc

See manual page susChkSrv.py(7).

Additional Information

ocf_suse_SAPHana(7)
susChkSrv.py(7)
SAPHanaSR-manageProvider(8)
zypper(8)

https://www.suse.com/c/emergency-braking-for-sap-hana-dying-indexserver/
https://documentation.suse.com/sbp/all/single-html/SLES4SAP-hana-sr-guide-PerfOpt-15/

Disclaimer

This Support Knowledgebase provides a valuable tool for SUSE customers and parties interested in our products and solutions to acquire information, ideas and learn from one another. Materials are provided for informational, personal or non-commercial use within your organization and are presented "AS IS" WITHOUT WARRANTY OF ANY KIND.