FY24 EMEA TAC Sec Workshop - Firewall - ASAFTD High-Availability

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

ASA/FTD High-Availability

Common issues and methodology of troubleshooting


Konrad Adamczyk
High Touch Technical Consulting Engineer – Security Support Services

October 2023
• Understanding methodology of troubleshooting
most common issues regarding High-Availability
setup in both ASA and FTD.
Session Goal
• Using verification commands in real scenarios to
determine causes of the failover events.

• Showing how solving some issues with HA can be


speeded up before opening the TAC case.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
1 Few words about High Availability

2 Differences in HA for ASA/FTD

Agenda
3 Troubleshooting workflow

4 Common issues

5 Best practices to avoid issues

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Disclaimer
• Rebranding
• Cisco Next-Generation Firewall (NGFW) is now Cisco Secure Firewall.
• Rebranded names in version 7.2:
Former Name Rebranded Name

Firepower Threat Defense (FTD) Secure Firewall Threat Defense

Firepower Threat Defense Virtual (FTDv) Secure Firewall Threat Defense Virtual

Firepower Device Manager (FDM) Secure Firewall Device Manager

Firepower Magament Center (FMC) Secure Firewall Management Center

Firepower Management Center Virtual (FMCv) Secure Firewall Management Center Virtual

Firepower eXtentsible Operating System (FXOS) Secure Firewall eXtensible Operating system

Firepower Chassis Manager Secure Firewall Chassis Manager

• Lina is the Data Plane module.


• Snort is the Deep Packet Inspection module.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Few words about HA

• Failover (Active/Standby) is known as High Availability (Cluster is known as High Scalability).

• High availability refers to the failover configuration. High availability or failover setup joins two devices
so that if one of the devices fails, the other device can take over.

• Primary and Secondary are roles, stay with the units and specified during the HA initial configuration.

• Active and Standby are states and change depending on the health status of each unit.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Few words about HA

• Both ASA/FTD in pair must be identical in hardware, software, memory, interfaces and mode.

• For FP9300 HA is supported only between same-type modules.

• Configuration Replication is done through the FAILOVER link.


• The failover configuration is not replicated .
• State Replication is done through the STATE link.
• STATE and FOVER can be the same or different interface.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Classic ASA vs FTD failover

• While in classic ASA the State link is optional, on FTD is mandatory.

• ASA monitors the state of the interfaces. FTD also monitors Snort and Disk space.

• Failover replication command options are not configurable for FTD and use default setting:

failover replication http


failover replication rate 40000

• On ASA you can configure encryption for the failover link in 2 different ways: a simple key or an IPsec
tunnel. FTD supports only the IPsec tunnel option.
• On ASA you can use a sub-interface as a failover or state interfaces. On FTD you must use a physical
interface.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
HA state flow diagram

Primary (without any connected peer):

Secondary (with an Active connected peer):

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Verification commands

• show runing-config failover:


> show running-config failover
failover
failover lan unit primary
failover lan interface failover GigabitEthernet0/4
failover replication http
failover link failover GigabitEthernet0/4
failover interface ip failover 17.17.17.1 255.255.255.252 standby 17.17.17.2

> show running-config failover


failover
failover lan unit secondary
failover lan interface failover GigabitEthernet0/4
failover replication http
failover link failover GigabitEthernet0/4
failover interface ip failover 17.17.17.1 255.255.255.252 standby 17.17.17.2

• show running-config all monitor-interface:


> show running-config all monitor-interface
monitor-interface Outside
monitor-interface Inside
monitor-interface diagnostic
monitor-interface service-module

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Verification commands

• show failover:

Primary/Active Secondary/Standby
> show failover > show failover
Failover On Failover On
Failover unit Primary Failover unit Secondary
Failover LAN Interface: failover GigabitEthernet0/4 (up) Failover LAN Interface: failover GigabitEthernet0/4 (up)
Reconnect timeout 0:00:00 Reconnect timeout 0:00:00
Unit Poll frequency 1 seconds, holdtime 15 seconds Unit Poll frequency 1 seconds, holdtime 15 seconds
Interface Poll frequency 5 seconds, holdtime 25 seconds Interface Poll frequency 5 seconds, holdtime 25 seconds
Interface Policy 1 Interface Policy 1
Monitored Interfaces 3 of 361 maximum Monitored Interfaces 3 of 361 maximum
MAC Address Move Notification Interval not set MAC Address Move Notification Interval not set
failover replication http failover replication http
Version: Ours 9.18(2)219, Mate 9.18(2)219 Version: Ours 9.18(2)219, Mate 9.18(2)219
Serial Number: Ours 9AD2AL87FDQ, Mate 9ALU58NUM7A Serial Number: Ours 9ALU58NUM7A, Mate 9AD2AL87FDQ
Last Failover at: 06:24:15 UTC Jul 5 2023 Last Failover at: 19:07:10 UTC Jul 5 2023
This host: Primary - Active This host: Secondary - Standby Ready
Active time: 102448 (sec) Active time: 0 (sec)
slot 0: ASAv hw/sw rev (/9.18(2)219) status (Up Sys) slot 0: ASAv hw/sw rev (/9.18(2)219) status (Up Sys)
Interface diagnostic (0.0.0.0): Normal (Waiting) Interface Outside (0.0.0.0): Normal (Waiting)
Interface Outside (192.168.2.10): Normal (Waiting) Interface Inside (0.0.0.0): Normal (Waiting)
Interface Inside (192.168.28.1): Normal (Waiting) Interface diagnostic (0.0.0.0): Normal (Waiting)
slot 1: snort rev (1.0) status (up) slot 1: snort rev (1.0) status (up)
slot 2: diskstatus rev (1.0) status (up) slot 2: diskstatus rev (1.0) status (up)
Other host: Secondary - Standby Ready Other host: Primary - Active
Active time: 0 (sec) Active time: 102512 (sec)
Interface diagnostic (0.0.0.0): Normal (Waiting) Interface Outside (192.168.2.10): Normal (Waiting)
Interface Outside (0.0.0.0): Normal (Waiting) Interface Inside (192.168.28.1): Normal (Waiting)
Interface Inside (0.0.0.0): Normal (Waiting) Interface diagnostic (0.0.0.0): Normal (Waiting)
slot 1: snort rev (1.0) status (up) slot 1: snort rev (1.0) status (up)
slot 2: diskstatus rev (1.0) status (up) slot 2: diskstatus rev (1.0) status (up)

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Verification commands
• show failover:

Primary/Active Secondary/Standby
Stateful Failover Logical Update Statistics Stateful Failover Logical Update Statistics
Link : failover GigabitEthernet0/4 (up) Link : failover GigabitEthernet0/4 (up)
Stateful Obj xmit xerr rcv rerr Stateful Obj xmit xerr rcv rerr
General 79005 0 78326 0 General 7601 0 7607 0
sys cmd 78333 0 78326 0 sys cmd 7601 0 7601 0
up time 0 0 0 0 up time 0 0 0 0
RPC services 0 0 0 0 RPC services 0 0 0 0
TCP conn 117 0 0 0 TCP conn 0 0 0 0
UDP conn 402 0 0 0 UDP conn 0 0 0 0
ARP tbl 143 0 0 0 ARP tbl 0 0 5 0
Xlate_Timeout 0 0 0 0 Xlate_Timeout 0 0 0 0
IPv6 ND tbl 0 0 0 0 IPv6 ND tbl 0 0 0 0
VPN IKEv1 SA 0 0 0 0 VPN IKEv1 SA 0 0 0 0
VPN IKEv1 P2 0 0 0 0 VPN IKEv1 P2 0 0 0 0
VPN IKEv2 SA 0 0 0 0 VPN IKEv2 SA 0 0 0 0
VPN IKEv2 P2 0 0 0 0 VPN IKEv2 P2 0 0 0 0
VPN CTCP upd 0 0 0 0 VPN CTCP upd 0 0 0 0
VPN SDI upd 0 0 0 0 VPN SDI upd 0 0 0 0
VPN DHCP upd 0 0 0 0 VPN DHCP upd 0 0 0 0
SIP Session 0 0 0 0 SIP Session 0 0 0 0
SIP Tx 0 0 0 0 SIP Tx 0 0 0 0
SIP Pinhole 0 0 0 0 SIP Pinhole 0 0 0 0
Route Session 0 0 0 0 Route Session 0 0 0 0
Router ID 0 0 0 0 Router ID 0 0 0 0
User-Identity 5 0 0 0 User-Identity 0 0 1 0
CTS SGTNAME 0 0 0 0 CTS SGTNAME 0 0 0 0
CTS PAC 0 0 0 0 CTS PAC 0 0 0 0
TrustSec-SXP 0 0 0 0 TrustSec-SXP 0 0 0 0
IPv6 Route 0 0 0 0 IPv6 Route 0 0 0 0
STS Table 0 0 0 0 STS Table 0 0 0 0
Umbrella Device-ID 0 0 0 0 Umbrella Device-ID 0 0 0 0
Rule DB B-Sync 0 0 0 0 Rule DB B-Sync 0 0 0 0
Rule DB P-Sync 4 0 0 0 Rule DB P-Sync 0 0 0 0
Rule DB Delete 1 0 0 0 Rule DB Delete 0 0 0 0

Logical Update Queue Information Logical Update Queue Information


Cur Max Total Cur Max Total
© 2023 Cisco Recv
and/or Q:
its affiliates. All rights 0 10 Confidential
reserved. Cisco 78326 Recv Q: 0 7 38514
Xmit Q: 0 11 397533 Xmit Q: 0 1 7601
Verification commands

• show failover state:

Primary/Active Secondary/Standby
> show failover state > show failover state

State Last Failure Reason Date/Time State Last Failure Reason Date/Time
This host - Primary This host - Secondary
Active Comm Failure 06:22:47 UTC Jul 5 2023 Standby Ready None
Other host - Secondary Other host - Primary
Standby Ready Comm Failure 19:01:26 UTC Jul 5 2023 Active None

====Configuration State=== ====Configuration State===


Sync Done ====Communication State===
====Communication State=== Mac set
Mac set

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Verification commands

• show failover history:

Primary/Active Secondary/Standby

> show failover history > show failover history


========================================================================== =========================================================================
From State To State Reason From State To State Reason
========================================================================== =========================================================================
06:45:28 UTC Jun 27 2023 19:07:15 UTC Jul 5 2023
Not Detected Disabled No Error Not Detected Negotiation No Error

11:54:36 UTC Jun 27 2023 19:07:19 UTC Jul 5 2023


Disabled Negotiation Set by the config command Negotiation Cold Standby Detected an Active peer
(failover)
19:07:20 UTC Jul 5 2023
11:55:21 UTC Jun 27 2023 Cold Standby App Sync Detected an Active peer
Negotiation Just Active No Active unit found
19:08:38 UTC Jul 5 2023
11:55:21 UTC Jun 27 2023 App Sync Sync Config Detected an Active peer
Just Active Active Drain No Active unit found
19:04:44 UTC Jul 5 2023
11:55:21 UTC Jun 27 2023 Sync Config Sync File System Detected an Active peer
Active Drain Active Applying Config No Active unit found
19:04:44 UTC Jul 5 2023
11:55:21 UTC Jun 27 2023 Sync File System Bulk Sync Detected an Active peer
Active Applying Config Active Config Applied No Active unit found
19:04:58 UTC Jul 5 2023
11:55:21 UTC Jun 27 2023 Bulk Sync Standby Ready Detected an Active peer
Active Config Applied Active No Active unit found
==========================================================================
==========================================================================

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Common issues related to HA

• There are common situations where failover happens without a clear reason:
• Issue with monitored interfaces.
• Disk issue.
• Traceback (reboot).

• App-Sync error during joining unit into the HA.

• Split-Brain, where both units work in Active-Active state.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Monitored interfaces

• When a unit does not receive hello messages on a monitored interface for 15 seconds, it runs
interface tests.
• If one of the interface tests fails for an interface, but the same interface on the other unit continues to
successfully pass traffic, then the interface is considered to be failed, and the device stops running
tests.
• If faulty interface is on Active unit, failover will happen.

• If faulty interface is on Standby unit, no failover happens, Standby unit will be marked as Failed.

• If Unit is failed becasue of monitored interface failure, that interface need to be verified.

• Statuses of monitored interfaces which can cause failover:


• No Link
• Link Down
• Failed

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Monitored interfaces
Primary Secondary
> show failover state
> show failover state
State Last Failure Reason Date/Time
This host - Primary State Last Failure Reason Date/Time
Failed Ifc Failure 10:31:10 UTC Jul 17 2023 This host - Secondary
Outside: No Link Active Comm Failure 18:44:01 UTC Jul 10 2023
Other host - Secondary Other host - Primary
Active Comm Failure 18:44:37 UTC Jul 10 2023 Failed Ifc Failure 10:31:10 UTC Jul 17 2023
Outside: No Link
====Configuration State===
Sync Done ====Configuration State===
====Communication State=== ====Communication State===
Mac set Mac set

> show failover history > Show failover history


18:49:49 UTC Jul 10 2023
06:24:15 UTC Jul 5 2023 Bulk Sync Standby Ready Failover state check
Active Config Applied Active No Active unit found

10:31:10 UTC Jul 17 2023 10:31:10 UTC Jul 17 2023


Active Failed Interface check Standby Ready Just Active Other unit wants me Active
This host:1 (Interface check)
single_vf: Outside
Other host:0 10:31:10 UTC Jul 17 2023
Just Active Active Drain Other unit wants me Active
(Interface check)

10:31:10 UTC Jul 17 2023


Active Drain Active Applying Config Other unit wants me Active
(Interface check)

10:31:10 UTC Jul 17 2023


Active Applying Config Active Config Applied Other unit wants me Active
(Interface check)

10:31:10 UTC Jul 17 2023


Active Config Applied Active Other unit wants me Active
(Interface check)
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Disk issue

• Snort may fail due to high disk space utilization.

• Error which can lead to disk usage issue:


• „Detect Inspection engine failure due to disk failure”

Active unit got failed: Standby went active:


>show failover history >show failover history
==========================================================================
========================================================================== From State To State Reason
From State To State Reason ==========================================================================
========================================================================== 13:01:47 UTC May 1 2023
Bulk Sync Standby Ready Detected an Active mate
19:49:25 UTC Jul 15 2023
Standby Ready Failed Detect Inspection engine failure due to disk failure 10:04:21 UTC Jul 15 2023
Standby Ready Just Active Inspection engine in other unit has failed due to
20:24:44 UTC Jul 15 2023 disk failure
Failed Standby Ready My Inspection engine is as good as
peer due to disk recovery 10:04:21 UTC Jul 15 2023
Just Active Active Drain Inspection engine in other unit has failed due to
20:24:46 UTC Jul 15 2023 disk failure
Standby Ready Failed Detect Inspection engine failure due to disk
failure 10:04:21 UTC Jul 15 2023
Active Drain Active Applying Config Inspection engine in other unit has failed due to
20:29:47 UTC Jul 15 2023 disk failure
Failed Standby Ready My Inspection engine is as good as peer due
to disk recovery 10:04:21 UTC Jul 15 2023
Active Applying Config Active Config Applied Inspection engine in other unit has failed due to
21:40:36 UTC Jul 15 2023 disk failure
Standby Ready Failed Detect Inspection engine failure due to disk failure
10:04:21 UTC Jul 15 2023
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential Active Config Applied Active Inspection engine in other unit has failed
Unexpected failover – Disk issue

• Troubleshooting to be performed:
• admin@firepower:~$ sudo df -hT ( -h: prints disk utilization in human-readable form, -T: print
file system type):
admin@firepower:~$ sudo df -hT
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 720917580 104508748 616408832 15% /
tmpfs 65536 0 65536 0% /dev
tmpfs 98385644 0 98385644 0% /sys/fs/cgroup
/dev/sda6 41943040 40814524 1128516 98% /opt
tmpfs 98385644 248 98385396 1% /run
shm 13331456 51400 13280056 1% /dev/shm
tmpfs 98385644 4 98385640 1% /var/config
tmpfs 98385644 42320 98343324 1% /var/volatile/tmp
/dev/sda5 51474044 53200 48799456 1% /var/data/cores
/dev/sda2 1001328 30664 918136 4% /opt/cisco/config/host-common
/dev/sda3 4722056 16760 4458768 1% /opt/cisco/csp/applications/cisco-ftd.7.2_ftd_001_/app_data/disk0/log/.ntp.log
tmpfs 98385644 0 98385644 0% /proc/acpi
tmpfs 98385644 0 98385644 0% /proc/scsi
tmpfs 98385644 0 98385644 0% /sys/firmware
none 514048 0 514048 0% /dev/shm/snort

• Depending on the of the output, additional commands can be performed:


admin@firepower:~$ sudo find /opt/cisco/csp -type f -exec du -Sh {} + | sort -rh | head -n 15
admin@firepower:~$ sudo find /ngfw -size +10M -exec du -h {} \;
admin@firepower:~$ sudo ls –ltr /var/data/cores/
admin@firepower:~$ sudo ls –ltr /ngfw/var/common/

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Disk issue

• High disk utilization can be observed because of old not needed files.

• Cleaning the disk from old files can be performed with extra caution.

• Linux does not have concept of a „recycle bin”, deleted items practically cannot be restored.​

• rm –fr a forcefully deletes directory and its contents.​

• rm –fr path/contents is NOT the same as rm –fr path/ contents.​


• Removes only folders 2020* and their contents from all instances:​
• rm -fr /ngfw/var/sf/detection-engines/74c996ae-4873-11ec-a276-35c0a9612b1e/instance-*/2020*

• Removes folder var and its contents!:​


• rm -fr /ngfw/var/ sf/detection-engines/74c996ae-4873-11ec-a276-35c0a9612b1e/instance-*/2020*

• Do not use absolute paths, first enter the directory and then remove file.

• If you are not sure if specific file can be removed, do not delete it.

• Avoid copy-pasting commands.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Disk issue

• High disk utilization because of old not needed files.


• Files which can be removed with extra caution:
admin@firepower:~$ sudo cd /var/sf
admin@firepower:~$ sudo rm -rf backup/*
admin@firepower:~$ sudo rm -rf SRU/*
admin@firepower:~$ sudo rm -rf updates/*
admin@firepower:~$ sudo cd /var/sf/detection_engines/DETECTION_ENGINE_ID
admin@firepower:~$ sudo rm -rf instance-*/backup/*
admin@firepower:~$ sudo rm –rf instance-*/archive/*
admin@firepower:~$ sudo rm -rf instance-*/ssl-nse-debug.log*
admin@firepower:~$ sudo rm -rf instance-*/ssl-stats-unified.log*
admin@firepower:~$ sudo rm -rf instance-*/unified_events*
admin@firepower:~$ sudo rm -rf instance-*/fileperfstats.log*
admin@firepower:~$ sudo rm –rf instance-*/2020*

• Log files not rotated properly:


• bug: CSCwb34240
admin@firepower:~$ sudo lsof | grep deleted | grep process_std
syslog-ng 638 root 33w REG 253,7 124121812 527928 /var/log/ process_stdout.log.1 (deleted)
syslog-ng 638 root 34w REG 253,7 161211401889 527776 /var/log/ process_stderr.log.1 (deleted)

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
DEMO Unexpected failover – Disk issue

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Traceback

• Failover also will occur if there is traceback of Active unit.

• Root cause of Lina/Snort tracebacks are usually investigated by TAC and the software engineering
team.

• There are steps which can be taken to collect needed outputs before opening the case:
• Generate Troubleshoot file for FTD or show tech-support for ASA.
• Verify show tech-support outputs for confirmation of the traceback.
• Collect Lina crash-info (if exists).
• Collect core file (if exists).

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Traceback

• Outputs confirming traceback from show tech-support:

FW1 up 7 hours 58 mins FW2 up 98 days 20 hours


failover cluster up 3 years 46 days failover cluster up 3 years 46 days

------------------ dir all-filesystems ------------------

Directory of disk0:/

539609915 -rw- 0 19:40:28 Mar 23 2020 coredumpfsysimage.bin


2 drwx 4096 06:05:25 Jun 20 2023 coredumpfsys
269456734 drwx 4096 14:05:01 Jun 20 2023 log
539657510 drw- 25 19:41:28 Mar 23 2020 coredumpinfo
541907993 -rw- 610 06:12:27 Jun 20 2023 hitcnt_del_ruleid_list
541576985 -rwx 464383 17:44:48 Jun 13 2022 backup-config.cfg
541576986 -rwx 345025 17:42:12 Mar 05 2021 startup-config
541576988 -rwx 73658 17:44:49 Jun 13 2022 modified-config.cfg
16913648 drwx 4096 18:49:20 Jun 13 2022 csm
538558273 -rw- 340384 06:13:37 Jun 20 2023 asa-cmd-server.log
539196960 -rw- 39 06:07:36 Jun 20 2023 snortpacketinfo.conf
538450060 -rwx 494476 05:59:36 Jun 20 2023 crashinfo_20230620_054931_UTC

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Traceback

• Outputs confirming traceback from show tech-support:


------------------ show crashinfo ------------------

Thread Name: DATAPATH-2-19821


Abort: Unknown
r8 0x0000000000000000
r9 0x000014bee10fcc80
r10 0x0000000000000000
r11 0x0000000000000003
r12 0x0000000000000000
r13 0x0000000000000000
r14 0x0000000000000000
r15 0x000014bed9e67910
rdi 0x0000000000000000
rsi 0x0000000000000000
rbp 0x000014bf5606d350
rbx 0x000014bedbf09c60
rdx 0x000014bed9e97680
rax 0x0000000000000000
rcx 0x0000000000000000
rsp 0x000014bf5606dd90
rip 0x00005616266d7fd9
eflags 0x0000000000003293
csgsfs 0x002b000000000033
error code n/a
vector 0x0000000000000020
old mask 0xffffffde3e3bfa07
cr2 0x0000000000000000

Cisco Adaptive Security Appliance Software Version 9.16(3)11

Compiled on Wed 20-Apr-22 08:20 GMT by builders


Hardware: FPR-1120
Target: SSP
Crashinfo collected on 05:49:31.595 UTC Tue Jun 20 2023
ASLR enabled, text region 5616231c2000-5616277d4475

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Traceback

• Outputs confirming traceback from show tech-support:

Failed unit: New Active unit:

------------------ show failover history ------------------ ------------------ show failover history ------------------

========================================================================== ==========================================================================
From State To State Reason From State To State Reason
========================================================================== ==========================================================================
06:08:57 UTC Jun 20 2023 04:51:06 UTC May 13 2023
Not Detected Disabled No Error Bulk Sync Standby Ready Failover state check

06:10:14 UTC Jun 20 2023


Disabled Negotiation Set by the config command 05:49:46 UTC Jun 20 2023
Standby Ready Just Active HELLO not heard from mate
06:10:17 UTC Jun 20 2023
Negotiation Cold Standby Detected an Active mate 05:49:46 UTC Jun 20 2023
Just Active Active Drain HELLO not heard from mate
06:10:18 UTC Jun 20 2023
Cold Standby App Sync Detected an Active mate 05:49:46 UTC Jun 20 2023
Active Drain Active Applying Config HELLO not heard from mate
06:13:14 UTC Jun 20 2023
App Sync Sync Config Detected an Active mate 05:49:46 UTC Jun 20 2023
Active Applying Config Active Config Applied HELLO not heard from mate
06:13:53 UTC Jun 20 2023
Sync Config Sync File System Detected an Active mate 05:49:46 UTC Jun 20 2023
Active Config Applied Active HELLO not heard from mate
06:13:53 UTC Jun 20 2023
Sync File System Bulk Sync Detected an Active mate

06:14:05 UTC Jun 20 2023


Bulk Sync Standby Ready Detected an Active mate

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
App Sync issues during joining HA

• If the show failover history output indicates an App Sync failure, then there was a problem at the time
of the HA validation phase, where the system checks that the units can function correctly as a high
availability group.
• The message “All validation passed” when the From State is App Sync appears, and the node moves
to the Standby Ready state.
• Any validation failure transitions the peer to the Disabled (Failed).

• Possible error messages in „show failover history”:


==========================================================================
From State To State Reason
==========================================================================
15:10:16 CDT Sep 28 2021
Not Detected Disabled No Error
15:10:18 CDT Sep 28 2021
Disabled Negotiation Set by the config command
15:10:24 CDT Sep 28 2021
Negotiation Cold Standby Detected an Active mate
15:10:25 CDT Sep 28 2021
Cold Standby App Sync Detected an Active mate
15:10:55 CDT Sep 28 2021
App Sync Disabled CD App Sync error is App Config Apply Failed

HA state progression failed due to APP SYNC timeout

CD App Sync error is Failed to apply SSP config on standby

CD App Sync error is Rsync based file retrieval failed.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
App Sync issues during joining HA

• CD App Sync error is App Config Apply Failed.

• On the Standby FTD command line check the log: /ngfw/var/log/action_queue.log

• On identification of the configuration error, post-making required changes, HA can be resumed.

• HA state progression failed due to APP SYNC timeout.

• On the Standby FTD command line check the log: /ngfw/var/log/ngfwmanager.log

• At this stage, policy deployments also fail because the active unit thinks app sync is still in progress.

• Policy deployment throws the error - "since new Node join/AppSync process is in progress,
Configuration Changes are not allowed, and hence rejects the deployment request. Please retry
deployment after some time„.
• Sometimes, when you resume high availability on the Standby node, it can resolve the issue.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
App Sync issues during joining HA

• CD App Sync error is Failed to apply SSP config on standby.

• On the Standby FTD command line check the log: /ngfw/var/log/ngfwmanager.log.

• Sometimes, when you resume high availability on the Standby node, it can resolve the issue.

• CD App Sync error is Rsync based file retrieval failed. Check app-sync-history CLI for details.

• Possible cause: CSCwh02757.

• Standby unit can recover by its own, after reboot or after resuming HA.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
App Sync issues during joining HA

• Some Sync issues are temporary and can be resolved by resuming HA on standby unit:

• ASA:

Ciscoasa(config)#failover

• FTD:

> configure high-availability resume

• If issue persists after resuming, it need further analysis so TAC engineer needs to be involved.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
App-Sync Issues
DEMO
Bug: CSCwh02757

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Split-Brain (Active/Active)- What is it?

• Scenario in which the units of an ASA/FTD HA:


• Are unable to detect each other on the network.
• Because of that, both take the Active role.
• Both units will have the same interface IP Address and MAC Address.
• This may cause severe inconsistencies in your network resulting in loss of services.

Primary Secondary
>show failover state >show failover state

State Last Failure Reason Date/Time State Last Failure Reason Date/Time
This host – Primary This host – Secondary
Active None Active None
Other host - Secondary Other host - Primary
Failed Comm Failure 06:24:15 UTC Jul 6 2023 Failed Comm Failure 06:24:15 UTC Jul 6 2023

====Configuration State=== ====Configuration State===


Sync Done - STANDBY Sync Done - STANDBY
====Communication State== ====Communication State==

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Split-Brain (Active/Active)

Primary Secondary
> show failover history > show failover history
========================================================================== ==========================================================================
From State To State Reason From State To State Reason
========================================================================== ==========================================================================
06:45:28 UTC Jun 27 2023
Not Detected Disabled No Error 19:04:58 UTC Jul 5 2023
Bulk Sync Standby Ready Detected an Active peer
11:54:36 UTC Jun 27 2023
Disabled Negotiation Set by the config command 06:24:15 UTC Jul 6 2023
(failover) Standby Ready Just Active HELLO not heard from peer
(failover link up, no response from peer)
11:55:21 UTC Jun 27 2023
Negotiation Just Active No Active unit found 06:24:15 UTC Jul 6 2023
Just Active Active Drain HELLO not heard from peer
11:55:21 UTC Jun 27 2023 (failover link up, no response from peer)
Just Active Active Drain No Active unit found
06:24:15 UTC Jul 6 2023
11:55:21 UTC Jun 27 2023 Active Drain Active Applying Config HELLO not heard from peer
Active Drain Active Applying Config No Active unit found (failover link up, no response from peer)

11:55:21 UTC Jun 27 2023 06:24:15 UTC Jul 6 2023


Active Applying Config Active Config Applied No Active unit found Active Applying Config Active Config Applied HELLO not heard from peer
(failover link up, no response from peer)
11:55:21 UTC Jun 27 2023
Active Config Applied Active No Active unit found 06:24:15 UTC Jul 6 2023
Active Config Applied Active HELLO not heard from peer
(failover link up, no response from peer)

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Emergency Recovery from Split-Brain

• To minimize impact of split-brain, you can disable failover on 1 of the units or disconnect it from the
network
• Disable Failover on the unit not passing traffic:
• On ASA Platform, over CLI, navigate to the configuration terminal and enter "no failover" command.
• On FTD Platform, over CLI, enter "configure high-availability suspend" command.

• For ASA, shutdown the data interfaces.

• For FTD, shutdown the interfaces on the connected device. Alternatively, you can also physically
disconnect the interfaces.

• Also, you can power off the device, but this will limit you from managing the device.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Emergency Recovery from Split-Brain
> configure high-availability suspend

Result on Primary Peer Result on Secondary Peer


> show high-availability config > show high-availability config
Failover On Failover Off (pseudo-Standby)
Failover unit Primary Failover unit Secondary
Failover LAN Interface: FOVER GigabitEthernet1/3 (up) Failover LAN Interface: FOVER GigabitEthernet1/3 (up)
… Reconnect timeout 0:00:00
This host: Primary - Active Unit Poll frequency 1 seconds, holdtime 15 seconds
Active time: 3542 (sec) Interface Poll frequency 5 seconds, holdtime 25 seconds
... Interface Policy 1
Other host: Secondary - Disabled Monitored Interfaces 3 of 60 maximum
Active time: 27 (sec) MAC Address Move Notification Interval not set
Interface INSINE (192.168.75.14): Unknown (Monitored) failover replication http
Interface OUTSIDE (192.168.76.14): Unknown (Monitored)
slot 1: snort rev (1.0) status (up)
slot 2: diskstatus rev (1.0) status (up)

Resume HA:
> configure high-availability resume
Successfully resumed high-availablity.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Split-Brain - Possible causes

• Split-Brain occurs when the communication between the failover Link interfaces is down
(unidirectionally or bidirectionally). This scenario can be seen if failover and data links travel through
the same path. The most common reasons are:

• L1 Issues – Faulty Cable/SFP/Interface.


• An issue on an intermediate device.
• Lack of Memory or CPU Resources on ASA/FTD.
• The ASA/Lina Engine utilize 1550-byte memory blocks to store packets for processing. If the are
no of free blocks of this size it depletes the ASA/FTD which will no longer be able to process
failover packets. Run the show blocks to check for block depletion.
• CSCwc10241 - Temporary HA split-brain following upgrade or device reboot.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Procedure to Troubleshoot failover link - Flowchart

Start of
troubleshooting

L1/L2: Is the
The link on both of the units has to be UP. Common reasons for connecton to
status/protocol for
be down include:
Failover LAN NO • Failed/Shut interface of an intermediate device – check intermediate
interface on both
device if any
the units up?
Show interface • Issue with physical cabling or interface failure – check physical connection,
ip brief if possible replace cables/sfp

YES
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Procedure to Troubleshoot failover link - Flowchart

Apply captures on both the units for protocol 105 for failover link interface, Eg:
cap test interface fover match 105 any any

You should see protocol 105 packets in the above capture between the Primary and
Secondary Unit. You will see ESP packets Incase IPSec Encryption is enabled on failover
interface.

NO In case you see only one way traffic on both/one of the boxes:
L3: Can both the
> Check show blocks to verify if Memory Block 1550 has been depleted
units ping each
> Check show mac address-table on the intermediate L2 device, if any. Verify the
other over the
mac addresses are being correctly learnt.
Failover Link? > Another quick way to verify connectivity is by running the show failover command
for both the units. A "normal" status on each interface indicates that the keepalive
packets are correctly received

> show capture


capture Test type raw-data interface fover [Capturing - 452080 bytes]
match 105 any any

YES > show capture test

15 packets captured

1: 09:53:18.506611 10.197.200.69 > 10.197.200.89 ip-proto-105, length 54


2: 09:53:18.506687 10.197.200.89 > 10.197.200.69 ip-proto-105, length 54
3: 09:53:18.813800 10.197.200.89 > 10.197.200.69 ip-proto-105, length 46
4: 09:53:18.814121 10.197.200.69 > 10.197.200.89 ip-proto-105, length 50
5: 09:53:18.814151 10.197.200.69 > 10.197.200.89 ip-proto-105, length 62
6: 09:53:18.815143 10.197.200.89 > 10.197.200.69 ip-proto-105, length 62
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Procedure to Troubleshoot failover link – Flowchart:

Check for latency ping peer firewalls failover interface. Usually the round-trip time/2 is a
good indicator of peak and average latency.

For more accurate readings captures on failover interface from both units can be
Is latency between exported and compared.
the two units YES
greater than Latency between the two units in a Failover Pair needs to be under 250ms.
It's recomended to keep latency under 10ms.
10ms?
Though chances of latency causing Split-brain scenario are less, high latency can cause
intermittent failovers and impact failover performance in general.

NO

Your problem is not a common problem. You should engage TAC by opening a case for
further troubleshooting

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
To proactively prepare against a Split-Brain condition:

• Be on the Cisco Recommended Golden Release.

• Keep proper Network topology for Data and Failover links.

• Use a port-channel interface for the Failover interface.

• Ensure Failover interface doesn't have too much latency.

• Adjust Poll Timer/Hold Timer values as per your deployment.

• Configure a Virtual MAC Address for interfaces.

• Enable logging to external syslog server and enable logging timestamp option.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Verification Cheat Sheet
Co mmands Lo gs

show running-config failover FTD:


/ngfw/var/log/ActionQueue.log

show running-config all monitor-interface FTD:


/ngfw/var/log/ngfwManager.log
Show failover FTD:
/ngfw/var/cisco/ngfwManager/store/HA_STATE
Show failover history FTD:
/ngfw/var/cisco/ngfwManager/store/CD_STATE

Show failover state FMC:


/opt/CSCOpx/MDC/log/operation/usmsharedsvcs.log

show failover interface FTD:


/ngfw/var/cisco/ngfwManager/store/clustering_health_events

ASA/LINA (Enhancement in new versions):

Disk0/log/fover_trace.log | /mnt/Disk0/log/fover_trace.log

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
References

• FTD Release Notes:


• https://www.cisco.com/c/en/us/support/security/firepower-ngfw/products-release-notes-list.html
• Bug search tool help:
• https://www.cisco.com/c/en/us/support/web/tools/bst/bsthelp/index.html
• Collecting core files:
• https://www.cisco.com/c/en/us/support/docs/security/adaptive-security-appliance-asa-
software/217663-troubleshoot-asa-or-ftd-unexpected-reloa.html
• Configure FTD High Availability on Firepower Appliances:
• https://www.cisco.com/c/en/us/support/docs/security/firepower-management-center/212699-
configure-ftd-high-availability-on-firep.html

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Summary

What we talked about during this session:

• Brief introduction to High Availability feature.

• Difference between HA in ASA and FTD.

• HA setup and health verification commands.

• Troubleshooting steps for unexpected failover due to issues with monitored interfaces, disk or
traceback.
• Explanation of App-sync errors and troubleshooting steps.

• Steps to troubleshoot and recover from Split-Brain.

• HA best practices.

© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential

You might also like