FY24 EMEA TAC Sec Workshop - Firewall - ASAFTD High-Availability
FY24 EMEA TAC Sec Workshop - Firewall - ASAFTD High-Availability
FY24 EMEA TAC Sec Workshop - Firewall - ASAFTD High-Availability
October 2023
• Understanding methodology of troubleshooting
most common issues regarding High-Availability
setup in both ASA and FTD.
Session Goal
• Using verification commands in real scenarios to
determine causes of the failover events.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
1 Few words about High Availability
Agenda
3 Troubleshooting workflow
4 Common issues
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Disclaimer
• Rebranding
• Cisco Next-Generation Firewall (NGFW) is now Cisco Secure Firewall.
• Rebranded names in version 7.2:
Former Name Rebranded Name
Firepower Threat Defense Virtual (FTDv) Secure Firewall Threat Defense Virtual
Firepower Management Center Virtual (FMCv) Secure Firewall Management Center Virtual
Firepower eXtentsible Operating System (FXOS) Secure Firewall eXtensible Operating system
• High availability refers to the failover configuration. High availability or failover setup joins two devices
so that if one of the devices fails, the other device can take over.
• Primary and Secondary are roles, stay with the units and specified during the HA initial configuration.
• Active and Standby are states and change depending on the health status of each unit.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Few words about HA
• Both ASA/FTD in pair must be identical in hardware, software, memory, interfaces and mode.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Classic ASA vs FTD failover
• ASA monitors the state of the interfaces. FTD also monitors Snort and Disk space.
• Failover replication command options are not configurable for FTD and use default setting:
• On ASA you can configure encryption for the failover link in 2 different ways: a simple key or an IPsec
tunnel. FTD supports only the IPsec tunnel option.
• On ASA you can use a sub-interface as a failover or state interfaces. On FTD you must use a physical
interface.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
HA state flow diagram
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Verification commands
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Verification commands
• show failover:
Primary/Active Secondary/Standby
> show failover > show failover
Failover On Failover On
Failover unit Primary Failover unit Secondary
Failover LAN Interface: failover GigabitEthernet0/4 (up) Failover LAN Interface: failover GigabitEthernet0/4 (up)
Reconnect timeout 0:00:00 Reconnect timeout 0:00:00
Unit Poll frequency 1 seconds, holdtime 15 seconds Unit Poll frequency 1 seconds, holdtime 15 seconds
Interface Poll frequency 5 seconds, holdtime 25 seconds Interface Poll frequency 5 seconds, holdtime 25 seconds
Interface Policy 1 Interface Policy 1
Monitored Interfaces 3 of 361 maximum Monitored Interfaces 3 of 361 maximum
MAC Address Move Notification Interval not set MAC Address Move Notification Interval not set
failover replication http failover replication http
Version: Ours 9.18(2)219, Mate 9.18(2)219 Version: Ours 9.18(2)219, Mate 9.18(2)219
Serial Number: Ours 9AD2AL87FDQ, Mate 9ALU58NUM7A Serial Number: Ours 9ALU58NUM7A, Mate 9AD2AL87FDQ
Last Failover at: 06:24:15 UTC Jul 5 2023 Last Failover at: 19:07:10 UTC Jul 5 2023
This host: Primary - Active This host: Secondary - Standby Ready
Active time: 102448 (sec) Active time: 0 (sec)
slot 0: ASAv hw/sw rev (/9.18(2)219) status (Up Sys) slot 0: ASAv hw/sw rev (/9.18(2)219) status (Up Sys)
Interface diagnostic (0.0.0.0): Normal (Waiting) Interface Outside (0.0.0.0): Normal (Waiting)
Interface Outside (192.168.2.10): Normal (Waiting) Interface Inside (0.0.0.0): Normal (Waiting)
Interface Inside (192.168.28.1): Normal (Waiting) Interface diagnostic (0.0.0.0): Normal (Waiting)
slot 1: snort rev (1.0) status (up) slot 1: snort rev (1.0) status (up)
slot 2: diskstatus rev (1.0) status (up) slot 2: diskstatus rev (1.0) status (up)
Other host: Secondary - Standby Ready Other host: Primary - Active
Active time: 0 (sec) Active time: 102512 (sec)
Interface diagnostic (0.0.0.0): Normal (Waiting) Interface Outside (192.168.2.10): Normal (Waiting)
Interface Outside (0.0.0.0): Normal (Waiting) Interface Inside (192.168.28.1): Normal (Waiting)
Interface Inside (0.0.0.0): Normal (Waiting) Interface diagnostic (0.0.0.0): Normal (Waiting)
slot 1: snort rev (1.0) status (up) slot 1: snort rev (1.0) status (up)
slot 2: diskstatus rev (1.0) status (up) slot 2: diskstatus rev (1.0) status (up)
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Verification commands
• show failover:
Primary/Active Secondary/Standby
Stateful Failover Logical Update Statistics Stateful Failover Logical Update Statistics
Link : failover GigabitEthernet0/4 (up) Link : failover GigabitEthernet0/4 (up)
Stateful Obj xmit xerr rcv rerr Stateful Obj xmit xerr rcv rerr
General 79005 0 78326 0 General 7601 0 7607 0
sys cmd 78333 0 78326 0 sys cmd 7601 0 7601 0
up time 0 0 0 0 up time 0 0 0 0
RPC services 0 0 0 0 RPC services 0 0 0 0
TCP conn 117 0 0 0 TCP conn 0 0 0 0
UDP conn 402 0 0 0 UDP conn 0 0 0 0
ARP tbl 143 0 0 0 ARP tbl 0 0 5 0
Xlate_Timeout 0 0 0 0 Xlate_Timeout 0 0 0 0
IPv6 ND tbl 0 0 0 0 IPv6 ND tbl 0 0 0 0
VPN IKEv1 SA 0 0 0 0 VPN IKEv1 SA 0 0 0 0
VPN IKEv1 P2 0 0 0 0 VPN IKEv1 P2 0 0 0 0
VPN IKEv2 SA 0 0 0 0 VPN IKEv2 SA 0 0 0 0
VPN IKEv2 P2 0 0 0 0 VPN IKEv2 P2 0 0 0 0
VPN CTCP upd 0 0 0 0 VPN CTCP upd 0 0 0 0
VPN SDI upd 0 0 0 0 VPN SDI upd 0 0 0 0
VPN DHCP upd 0 0 0 0 VPN DHCP upd 0 0 0 0
SIP Session 0 0 0 0 SIP Session 0 0 0 0
SIP Tx 0 0 0 0 SIP Tx 0 0 0 0
SIP Pinhole 0 0 0 0 SIP Pinhole 0 0 0 0
Route Session 0 0 0 0 Route Session 0 0 0 0
Router ID 0 0 0 0 Router ID 0 0 0 0
User-Identity 5 0 0 0 User-Identity 0 0 1 0
CTS SGTNAME 0 0 0 0 CTS SGTNAME 0 0 0 0
CTS PAC 0 0 0 0 CTS PAC 0 0 0 0
TrustSec-SXP 0 0 0 0 TrustSec-SXP 0 0 0 0
IPv6 Route 0 0 0 0 IPv6 Route 0 0 0 0
STS Table 0 0 0 0 STS Table 0 0 0 0
Umbrella Device-ID 0 0 0 0 Umbrella Device-ID 0 0 0 0
Rule DB B-Sync 0 0 0 0 Rule DB B-Sync 0 0 0 0
Rule DB P-Sync 4 0 0 0 Rule DB P-Sync 0 0 0 0
Rule DB Delete 1 0 0 0 Rule DB Delete 0 0 0 0
Primary/Active Secondary/Standby
> show failover state > show failover state
State Last Failure Reason Date/Time State Last Failure Reason Date/Time
This host - Primary This host - Secondary
Active Comm Failure 06:22:47 UTC Jul 5 2023 Standby Ready None
Other host - Secondary Other host - Primary
Standby Ready Comm Failure 19:01:26 UTC Jul 5 2023 Active None
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Verification commands
Primary/Active Secondary/Standby
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Common issues related to HA
• There are common situations where failover happens without a clear reason:
• Issue with monitored interfaces.
• Disk issue.
• Traceback (reboot).
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Monitored interfaces
• When a unit does not receive hello messages on a monitored interface for 15 seconds, it runs
interface tests.
• If one of the interface tests fails for an interface, but the same interface on the other unit continues to
successfully pass traffic, then the interface is considered to be failed, and the device stops running
tests.
• If faulty interface is on Active unit, failover will happen.
• If faulty interface is on Standby unit, no failover happens, Standby unit will be marked as Failed.
• If Unit is failed becasue of monitored interface failure, that interface need to be verified.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Monitored interfaces
Primary Secondary
> show failover state
> show failover state
State Last Failure Reason Date/Time
This host - Primary State Last Failure Reason Date/Time
Failed Ifc Failure 10:31:10 UTC Jul 17 2023 This host - Secondary
Outside: No Link Active Comm Failure 18:44:01 UTC Jul 10 2023
Other host - Secondary Other host - Primary
Active Comm Failure 18:44:37 UTC Jul 10 2023 Failed Ifc Failure 10:31:10 UTC Jul 17 2023
Outside: No Link
====Configuration State===
Sync Done ====Configuration State===
====Communication State=== ====Communication State===
Mac set Mac set
• Troubleshooting to be performed:
• admin@firepower:~$ sudo df -hT ( -h: prints disk utilization in human-readable form, -T: print
file system type):
admin@firepower:~$ sudo df -hT
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 720917580 104508748 616408832 15% /
tmpfs 65536 0 65536 0% /dev
tmpfs 98385644 0 98385644 0% /sys/fs/cgroup
/dev/sda6 41943040 40814524 1128516 98% /opt
tmpfs 98385644 248 98385396 1% /run
shm 13331456 51400 13280056 1% /dev/shm
tmpfs 98385644 4 98385640 1% /var/config
tmpfs 98385644 42320 98343324 1% /var/volatile/tmp
/dev/sda5 51474044 53200 48799456 1% /var/data/cores
/dev/sda2 1001328 30664 918136 4% /opt/cisco/config/host-common
/dev/sda3 4722056 16760 4458768 1% /opt/cisco/csp/applications/cisco-ftd.7.2_ftd_001_/app_data/disk0/log/.ntp.log
tmpfs 98385644 0 98385644 0% /proc/acpi
tmpfs 98385644 0 98385644 0% /proc/scsi
tmpfs 98385644 0 98385644 0% /sys/firmware
none 514048 0 514048 0% /dev/shm/snort
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Disk issue
• High disk utilization can be observed because of old not needed files.
• Cleaning the disk from old files can be performed with extra caution.
• Linux does not have concept of a „recycle bin”, deleted items practically cannot be restored.
• Do not use absolute paths, first enter the directory and then remove file.
• If you are not sure if specific file can be removed, do not delete it.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Disk issue
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
DEMO Unexpected failover – Disk issue
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Traceback
• Root cause of Lina/Snort tracebacks are usually investigated by TAC and the software engineering
team.
• There are steps which can be taken to collect needed outputs before opening the case:
• Generate Troubleshoot file for FTD or show tech-support for ASA.
• Verify show tech-support outputs for confirmation of the traceback.
• Collect Lina crash-info (if exists).
• Collect core file (if exists).
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Traceback
Directory of disk0:/
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Traceback
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Unexpected failover – Traceback
------------------ show failover history ------------------ ------------------ show failover history ------------------
========================================================================== ==========================================================================
From State To State Reason From State To State Reason
========================================================================== ==========================================================================
06:08:57 UTC Jun 20 2023 04:51:06 UTC May 13 2023
Not Detected Disabled No Error Bulk Sync Standby Ready Failover state check
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
App Sync issues during joining HA
• If the show failover history output indicates an App Sync failure, then there was a problem at the time
of the HA validation phase, where the system checks that the units can function correctly as a high
availability group.
• The message “All validation passed” when the From State is App Sync appears, and the node moves
to the Standby Ready state.
• Any validation failure transitions the peer to the Disabled (Failed).
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
App Sync issues during joining HA
• At this stage, policy deployments also fail because the active unit thinks app sync is still in progress.
• Policy deployment throws the error - "since new Node join/AppSync process is in progress,
Configuration Changes are not allowed, and hence rejects the deployment request. Please retry
deployment after some time„.
• Sometimes, when you resume high availability on the Standby node, it can resolve the issue.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
App Sync issues during joining HA
• Sometimes, when you resume high availability on the Standby node, it can resolve the issue.
• CD App Sync error is Rsync based file retrieval failed. Check app-sync-history CLI for details.
• Standby unit can recover by its own, after reboot or after resuming HA.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
App Sync issues during joining HA
• Some Sync issues are temporary and can be resolved by resuming HA on standby unit:
• ASA:
Ciscoasa(config)#failover
• FTD:
• If issue persists after resuming, it need further analysis so TAC engineer needs to be involved.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
App-Sync Issues
DEMO
Bug: CSCwh02757
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Split-Brain (Active/Active)- What is it?
Primary Secondary
>show failover state >show failover state
State Last Failure Reason Date/Time State Last Failure Reason Date/Time
This host – Primary This host – Secondary
Active None Active None
Other host - Secondary Other host - Primary
Failed Comm Failure 06:24:15 UTC Jul 6 2023 Failed Comm Failure 06:24:15 UTC Jul 6 2023
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Split-Brain (Active/Active)
Primary Secondary
> show failover history > show failover history
========================================================================== ==========================================================================
From State To State Reason From State To State Reason
========================================================================== ==========================================================================
06:45:28 UTC Jun 27 2023
Not Detected Disabled No Error 19:04:58 UTC Jul 5 2023
Bulk Sync Standby Ready Detected an Active peer
11:54:36 UTC Jun 27 2023
Disabled Negotiation Set by the config command 06:24:15 UTC Jul 6 2023
(failover) Standby Ready Just Active HELLO not heard from peer
(failover link up, no response from peer)
11:55:21 UTC Jun 27 2023
Negotiation Just Active No Active unit found 06:24:15 UTC Jul 6 2023
Just Active Active Drain HELLO not heard from peer
11:55:21 UTC Jun 27 2023 (failover link up, no response from peer)
Just Active Active Drain No Active unit found
06:24:15 UTC Jul 6 2023
11:55:21 UTC Jun 27 2023 Active Drain Active Applying Config HELLO not heard from peer
Active Drain Active Applying Config No Active unit found (failover link up, no response from peer)
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Emergency Recovery from Split-Brain
• To minimize impact of split-brain, you can disable failover on 1 of the units or disconnect it from the
network
• Disable Failover on the unit not passing traffic:
• On ASA Platform, over CLI, navigate to the configuration terminal and enter "no failover" command.
• On FTD Platform, over CLI, enter "configure high-availability suspend" command.
• For FTD, shutdown the interfaces on the connected device. Alternatively, you can also physically
disconnect the interfaces.
• Also, you can power off the device, but this will limit you from managing the device.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Emergency Recovery from Split-Brain
> configure high-availability suspend
Resume HA:
> configure high-availability resume
Successfully resumed high-availablity.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Split-Brain - Possible causes
• Split-Brain occurs when the communication between the failover Link interfaces is down
(unidirectionally or bidirectionally). This scenario can be seen if failover and data links travel through
the same path. The most common reasons are:
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Procedure to Troubleshoot failover link - Flowchart
Start of
troubleshooting
L1/L2: Is the
The link on both of the units has to be UP. Common reasons for connecton to
status/protocol for
be down include:
Failover LAN NO • Failed/Shut interface of an intermediate device – check intermediate
interface on both
device if any
the units up?
Show interface • Issue with physical cabling or interface failure – check physical connection,
ip brief if possible replace cables/sfp
YES
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Procedure to Troubleshoot failover link - Flowchart
Apply captures on both the units for protocol 105 for failover link interface, Eg:
cap test interface fover match 105 any any
You should see protocol 105 packets in the above capture between the Primary and
Secondary Unit. You will see ESP packets Incase IPSec Encryption is enabled on failover
interface.
NO In case you see only one way traffic on both/one of the boxes:
L3: Can both the
> Check show blocks to verify if Memory Block 1550 has been depleted
units ping each
> Check show mac address-table on the intermediate L2 device, if any. Verify the
other over the
mac addresses are being correctly learnt.
Failover Link? > Another quick way to verify connectivity is by running the show failover command
for both the units. A "normal" status on each interface indicates that the keepalive
packets are correctly received
15 packets captured
Check for latency ping peer firewalls failover interface. Usually the round-trip time/2 is a
good indicator of peak and average latency.
For more accurate readings captures on failover interface from both units can be
Is latency between exported and compared.
the two units YES
greater than Latency between the two units in a Failover Pair needs to be under 250ms.
It's recomended to keep latency under 10ms.
10ms?
Though chances of latency causing Split-brain scenario are less, high latency can cause
intermittent failovers and impact failover performance in general.
NO
Your problem is not a common problem. You should engage TAC by opening a case for
further troubleshooting
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
To proactively prepare against a Split-Brain condition:
• Enable logging to external syslog server and enable logging timestamp option.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Verification Cheat Sheet
Co mmands Lo gs
Disk0/log/fover_trace.log | /mnt/Disk0/log/fover_trace.log
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
References
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential
Summary
• Troubleshooting steps for unexpected failover due to issues with monitored interfaces, disk or
traceback.
• Explanation of App-sync errors and troubleshooting steps.
• HA best practices.
© 2023 Cisco and/or its affiliates. All rights reserved. Cisco Confidential