NSX 63 Troubleshooting
NSX 63 Troubleshooting
NSX 63 Troubleshooting
Guide
Update 6
Modified on 29 MAR 2018
VMware NSX for vSphere 6.3
NSX Troubleshooting Guide
You can find the most up-to-date technical documentation on the VMware website at:
https://docs.vmware.com/
If you have comments about this documentation, submit your feedback to
[email protected]
VMware, Inc.
3401 Hillview Ave.
Palo Alto, CA 94304
www.vmware.com
Copyright © 2010 – 2018 VMware, Inc. All rights reserved. Copyright and trademark information.
VMware, Inc. 2
Contents
VMware, Inc. 3
NSX Troubleshooting Guide
VMware, Inc. 4
NSX Troubleshooting Guide 1
®
The NSX Troubleshooting Guide describes how to monitor and troubleshoot the VMware NSX for
®
vSphere system by using the NSX Manager user interface, the vSphere Web Client, and other NSX
components, as needed.
Intended Audience
This manual is intended for anyone who wants to use or troubleshoot any problem for NSX in a VMware
vCenter environment. The information in this manual is written for experienced system administrators who
are familiar with virtual machine technology and virtual datacenter operations. This manual assumes
familiarity with VMware vSphere, including VMware ESXi, vCenter Server, and the vSphere Web Client.
1 Go to the Using the NSX Dashboard and see if there are any errors or warnings displayed for a
component.
2 Go to Monitor tab of the primary NSX Manager, and see if there are any triggered system events.
For more details on system events and alarms, refer to NSX Logging and System Events.
3 Use the GET api/2.0/services/systemalarms API to view alarms on NSX object. For more
information on API, refer to NSX API Guide.
5 If your problem is not resolved, download the technical support logs and contact VMware support.
See "How to file a Support Request in My VMware". For more information on how to download logs,
refer NSX Logging and System Events.
VMware, Inc. 5
NSX Troubleshooting Guide
2 Click Networking & Security, and then click Dashboard. The Dashboard page is displayed.
3 In a Cross-vCenter NSX environment, select the NSX Manager with primary role or secondary role.
n Replicator service—Also monitors for replication errors (if Cross-vCenter NSX is enabled).
n Controller peer connectivity status is displayed. If controller is down indicated as Red, then peer
controllers are displayed as Yellow.
VMware, Inc. 6
NSX Troubleshooting Guide
n Backup schedule.
n Last backup status (Failed/successful/not scheduled along with date and time).
n Deployment related:
n Firewall:
VMware, Inc. 7
NSX Troubleshooting Guide
n Yellow indicates that the distributed firewall is disabled on any of the clusters.
n Red indicates that the distributed firewall was unable to get installed on any of the
hosts/clusters.
n VXLAN:
n Red (error) indicates the state when VTEP creation failed, VTEP could not find the IP
address, VTEP got LinkLocal IP address assigned, and so on.
Edge notifications dashboard highlights active alarms for certain services. It monitors list of critical
events that are listed below and tracks them till the issue is unresolved. Alarms are auto resolved
when recovery event is reported, or edge is force synced, redeployed or upgraded.
n Appliance (edge VM, edge gateway, edge file system, NSX Manager, and edge services gateway
reports status):
VMware, Inc. 8
NSX Troubleshooting Guide
Note Load balancer and VPN alarms are not auto cleared on configuration update. Once the issue
is resolved, you have to clear the alarms manually with API using the alarm-id command. Here is
the example of API that you can use to clear the alarms. For details, refer to NSX API Guide.
GET https://<<NSX-IP>>/api/2.0/services/alarms/{source-Id}
POST https://<<NSX-IP>>/api/2.0/services/alarms?action=resolve
GET https://<<NSX-IP>>/api/2.0/services/systemalarms/<alarmId>
POST https://<<NSX-IP>>/api/2.0/services/systemalarms/<alarmId>?action=resolve
n Number of hosts with Firewall Publish status as failed. Status is Red when any host do not
successfully apply the published distributed firewall configuration.
n Flags when backed distributed virtual port group is deleted from vCenter Server.
Table 1‑1. Checking the NSX Installation on ESXi Host—Commands Run from NSX Manager
Description Commands on NSX Manager Notes
List all clusters to get the cluster IDs show cluster all View all cluster information
List all the hosts in the cluster to get the show cluster View the list of hosts in the cluster, the
host IDs clusterID host-ids, and the host-prep installation
status
List all the VMs on a host show host View particular host information, VMs, VM
hostID IDs, and power status
Table 1‑2. Names of VIBs and Modules Installed on Hosts to Use in Commands
NSX version ESXi version VIBs Modules
Any 6.3.x 5.5 esx-vxlan and esx-vsip vdl2, vdrb, vsip, dvfilter-
switch-security, bfd,
traceflow
6.3.2 and earlier 6.0 and later esx-vxlan and esx-vsip vdl2, vdrb, vsip, dvfilter-
switch-security, bfd,
traceflow
6.3.3 and later 6.0 and later esx-nsxv nsx-vdl2, nsx-vdrb, nsx-
vsip, nsx-dvfilter-switch-
security, nsx-core, nsx-
bfd, nsx-traceflow
VMware, Inc. 9
NSX Troubleshooting Guide
Table 1‑3. Checking the NSX Installation on ESXi Host—Commands Run from Host
Description Commands on Host Notes
VIBs present depend on the NSX and esxcli software vib get -- Check the version/date installed
ESXi versions. vibname <name> esxcli software vib list displays a
See table Names of VIBs and Modules list of all VIBs on the system
Installed on Hosts for details on which
modules to check on your installation.
List all the system modules currently esxcli system module list Older equivalent command: vmkload_mod
loaded in the system -l | grep -E vdl2|vdrb|vsip|
dvfilter-switch-security
Modules present depend on the NSX and esxcli system module get -m Run the command for each module
ESXi versions. <name>
See table Names of VIBs and Modules
Installed on Hosts for details on which
modules to check on your installation.
Check UWAs connection, port 1234 to esxcli network ip connection Controller TCP connection
controllers and 5671 to NSX Manager list | grep 1234 Message bus TCP connection
esxcli network ip connection
list | grep 5671
Table 1‑4. Checking the NSX Installation on ESXi Host—Host Networking Commands
Description Host Networking Commands Notes
List physical NICs/vmnic esxcli network nic list Check the NIC type, driver type, link
status, MTU
Physical NIC details esxcli network nic get -n vmnic# Check the driver and firmware versions
along with other details
List vmk NICs with IP esxcli network ip interface ipv4 To ensure VTEPs are correctly
addresses/MAC/MTU, and so on get instantiated
Details of each vmk NIC, including vDS esxcli network ip interface list To ensure VTEPs are correctly
information instantiated
Details of each vmk NIC, including vDS esxcli network ip interface list To ensure VTEPs are correctly
info for VXLAN vmks --netstack=vxlan instantiated
Find the VDS name associated with this esxcli network vswitch dvs To ensure VTEPs are correctly
host’s VTEP vmware vxlan list instantiated
Ping from VXLAN-dedicated TCP/IP stack ping ++netstack=vxlan –I vmk1 To troubleshoot VTEP communication
x.x.x.x issues: add option -d -s 1572 to make
sure that the MTU of transport network is
correct for VXLAN
VMware, Inc. 10
NSX Troubleshooting Guide
Table 1‑4. Checking the NSX Installation on ESXi Host—Host Networking Commands
(Continued)
Description Host Networking Commands Notes
View routing table of VXLAN-dedicated esxcli network ip route ipv4 To troubleshoot VTEP communication
TCP/IP stack list -N vxlan issues
View ARP table of VXLAN-dedicated esxcli network ip neighbor list To troubleshoot VTEP communication
TCP/IP stack -N vxlan issues
Table 1‑5. Checking the NSX Installation on ESXi Host—Host Log Files
Description Log File Notes
From NSX Manager show manager log follow Tails the NSX Manager logs
For live troubleshooting
List all logical switches show logical-switch list all List all the logical switches, their UUIDs to
be used in API, transport zone, and
vdnscope
Find the controller that is the owner of the show control-cluster logical- Note the controller IP address in the
VNI switches vni 5000 output and SSH to it
Find all the hosts that are connected to show control-cluster logical- The source IP address in output is the
this controller for this VNI switch connection-table 5000 management interface of host, and the
port number is the source port of TCP
connection
Find the VTEPs registered to host this VNI show control-cluster logical-
switches vtep-table 5002
List the MAC addresses learned for VMs show control-cluster logical- Map that the MAC address is actually on
on this VNI switches mac-table 5002 the VTEP reporting it
VMware, Inc. 11
NSX Troubleshooting Guide
List the ARP cache populated by the VM show control-cluster logical- ARP cache expires in 180 secs
IP updates switches arp-table 5002
Check if the host VXLAN is in-sync or not esxcli network vswitch dvs Shows the sync state and port used for
vmware vxlan get encapsulation
View VM attached and local switch port ID net-stats -l A nicer way to get vm switchport for a
for datapath captures specific VM
Verify VXLAN kernel module vdl2 is loaded esxcli system module get -m vdl2 Shows full detail of the specified module.
Verify the version
Verify correct VXLAN VIB version is esxcli software vib get -- Shows full detail of the specified VIB
installed vibname esx-vxlan Verify the version and date
See table Names of VIBs and Modules or
Installed on Hosts for details on which esxcli software vib get --
VIBs to check on your installation. vibname esx-nsxv
Verify the host knows about other hosts in esxcli network vswitch dvs Shows list of all the VTEPs that this host
the logical switch vmware vxlan network vtep list knows about that are hosting vtep 5001
--vxlan-id=5001 --vds-
name=Compute_VDS
Verify control plane is up and active for a esxcli network vswitch dvs Make sure the controller connection is up
Logical switch vmware vxlan network list --vds- and the Port/Mac count matches the VMs
name Compute_VDS on the LS on this host
Verify host has learnt MAC addresses of esxcli network vswitch dvs This should list all the MACs for the VNI
all VMs vmware vxlan network mac list -- 5000 VMs on this host
vds-name Compute_VDS --vxlan-
id=5000
Verify host has locally cached ARP entry esxcli network vswitch dvs Verify host has locally cached ARP entry
for remote VM’s vmware vxlan network arp list -- for remote VM’s
vds-name Compute_VDS --vxlan-
id=5000
Verify VM is connected to LS & mapped to esxcli network vswitch dvs the vdrport will always be listed as long as
a local VMKnic vmware vxlan network port list the VNI is attached to a router
Also shows what vmknic ID a VM dvPort is --vds-name Compute_VDS --vxlan-
mapped to id=5000
VMware, Inc. 12
NSX Troubleshooting Guide
Hosts are always connected to controllers /etc/vmware/netcpa/config-by- This file should always have all the
hosting their VNIs vsm.xml controllers in the environment listed The
config-by-vsm.xml file is created by
netcpa process
The config-by-vsm.xml file is pushed /var/log/vsfwd.log Parse through this file looking for errors
by NSX Manager using vsfwd To restart
If the config-by-vsm.xml file is not process: /etc/init.d/vShield-
correct look at the vsfwd log Stateful-Firewall stop|start
Connection to controller is made using /var/log/netcpa.log Parse through this file looking for errors
netcpa
Logical switching module logs are in /var/log/vmkernel.log Check logical switching module logs
vmkernel.log in /var/log/vmkernel.log “prefixed with
VXLAN:”
Commands for ESG show edge CLI commands for Edge ServicesGateway
(ESG) start with 'show edge'
Commands for DLR Control VM show edge CLI commands for Distributed Logical
Router (DLR) Control VM start with 'show
edge'
Commands for DLR show logical-router CLI commands for Distributed Logical
Router (DLR) start with show logical-
router
List all edges show edge all List all the edges that support the central
CLI
List all the services and deployment show edge edgeID View Edge Service Gateway Information
details of an edge
List the command options for edge show edge edgeID ? View details, such as version, log, NAT,
routing table, firewall, configuration,
interface, and services
View routing details show edge edgeID ip ? View routing info, BGP, OSPF and other
details
View routing table show edge edgeID ip route View the routing table at Edge
View routing neighbor show edge edgeID ip ospf View routing neighbor relationship
neighbor
View logical routers connection show logical-router host hostID Verify that the number of LIFs connected
information connection are correct, the teaming policy is right and
the appropriate vDS is being used
VMware, Inc. 13
NSX Troubleshooting Guide
Table 1‑10. Checking Logical Routing—Commands Run from NSX Manager (Continued)
Description Commands on NSX Manager Notes
List all logical router instances running on show logical-router host hostID Verify the number of LIFs and routes
the host dlr all Controller IP should be same on all hosts
for a logical router
Control Plane Active should be yes
--brief gives a compact response
Check the routing table on the host show logical-router host hostID This is the routing table pushed by the
dlr dlrID route controller to all the hosts in the transport
zone
This must be same across all the hosts
If some of the routes are missing on few
hosts, try the sync command from
controller mentioned earlier
The E flag means routes are learned via
ECMP
Check the LIFs for a DLR on the host show logical-router host hostID The LIF information is pushed to hosts
dlr dlrID interface (all | from the controller
intName) verbose Use this command to ensure the host
knows about all the LIFs it should
Find all the Logical Router Instances show control-cluster logical- This should list the logical router instance
routers instance all and all the hosts in the transport zone
which should have the logical router
instance on them
In addition, shows the Controller that
servicing this logical router
View details of each logical router show control-cluster logical- The IP column shows the vmk0 IP
routers instance 0x570d4555 addresses of all hosts where this DLR
exists
View all the interfaces CONNECTED to show control-cluster logical- The IP column shows the vmk0 IP
the logical router routers interface-summary addresses of all hosts where this DLR
0x570d4555 exists
View all the routes learned by this logical show control-cluster logical- Note that the IP column shows the vmk0
router routers routes 0x570d4555 IP addresses of all hosts where this DLR
exists
shows all the network connections show network connections of-type Check if the host you are troubleshooting
established, like a net stat output tcp has netcpa connection Established to
controller
VMware, Inc. 14
NSX Troubleshooting Guide
Table 1‑11. Checking Logical Routing—Commands Run from NSX Controller (Continued)
Description Commands on NSX Controller Notes
Sync interfaces from controller to host sync control-cluster logical- Useful if new interface was connected to
routers interface-to-host logical router but is not sync'd to all hosts
<logical-router-id> <host-ip>
Sync routes from controller to host sync control-cluster logical- Useful if some routes are missing on few
routers route-to-host <logical- hosts but are available on majority of
router-id> <host-ip> hosts
View the routes learned show ip route Make sure the routing and forwarding
tables are in sync
View the forwarding table show ip forwarding Make sure the routing and forwarding
tables are in sync
View the distributed logical router show interface First NIC shown in the output is the
interfaces distributed logical router interface
The distributed logical router interface is
not a real vNIC on that VM
All the subnets attached to distributed
logical router are of type INTERNAL
View the other interfaces (management) show interface Management/HA interface is a real vNIC
on the logical router Control VM
If HA was enabled without specifying an
IP address, 169.254.x.x/ 30 is used
If the management interface is given an IP
address, it appears here
debug the protocol debug ip ospf Useful to see issues with the configuration
debug ip bgp (such as mismatched OSPF areas, timers,
and wrong ASN)
Note: output is only seen on the Console
of Edge (not via SSH session)
VMware, Inc. 15
NSX Troubleshooting Guide
The above file is pushed by NSX Manager /var/log/vsfwd.log Parse through this file looking for errors
using vsfwd To restart process: /etc/init.d/vShield-
If the config-by-vsm.xml file is not Stateful-Firewall stop|start
correct look at the vsfwd log
Connection to controller is made using /var/log/netcpa.log Parse through this file looking for errors
netcpa
Logical switching module logs are in /var/log/vmkernel.log Check logical switching module logs
vmkernel.log in /var/log/vmkernel.log “prefixed with
vxlan:”
List all controllers with state show controller list all Shows the list of all controllers and their
running state
VMware, Inc. 16
NSX Troubleshooting Guide
Check controller cluster status show control-cluster status Should always show 'Join complete' and
'Connected to Cluster Majority'
Check the stats for flapping connections show control-cluster core stats The dropped counter should not change
and messages
View the node's activity in relation to show control-cluster history This is great for troubleshooting cluster
joining the cluster initially or after a restart join issues
View list of nodes in the cluster show control-cluster startup- Note that the list doesn’t have to have
nodes ONLY have active cluster nodes
This should have a list of all the currently
deployed controllers
This list is used by starting controller to
contact other controllers in the cluster
shows all the network connections show network connections of-type Check if the host you are troubleshooting
established, like a net stat output tcp has netcpa connection Established to
controller
To restart the controller process restart controller Only restarts the main controller process
Forces a re-connection to the cluster
View controller history and recent joins, show control-cluster history Great troubleshooting tool for controller
restarts. and so on issues especially around clustering
Check for slow disk show log cloudnet/cloudnet_java- A reliable way to check for slow disks is to
zookeeper<timestamp>.log look for "fsync" messages in the
filtered-by fsync cloudnet_java-zookeeper log
If sync takes more than 1 second,
ZooKeeper prints this message, and it is a
good indication that something else was
utilizing the disk at that time
Check for slow/malfunctioning disk show log syslog filtered-by Messages like the one in ample output
collectd about “collectd” tend to correlate with slow
or malfunctioning disks
Check for diskspace usage show log syslog filtered-by There is a background job called
freespace: “freespace” that periodically cleans up old
logs and other files from the disk when the
space usage reaches some threshold. In
some cases, if the disk is small and/or
filling up very fast, you’ll see a lot of
freespace messages. This could be an
indication that the disk filled up
VMware, Inc. 17
NSX Troubleshooting Guide
Find currently active cluster members show log syslog filtered-by Lists the node-id for currently active
Active cluster members cluster members. May need to look in
older syslogs as this message is not
printed all the time.
View the core controller logs show log cloudnet/cloudnet_java- There may be multiple zookeeper logs,
zookeeper. look at the latest timestamped file
20150703-165223.3702.log This file has information about controller
cluster master election and other
information related to the distributed
nature of controllers
View the core controller logs show log cloudnet/cloudnet.nsx- Main controller working logs, like LIF
controller.root.log.INFO. creation, connection listener on 1234,
20150703-165223.3668 sharding
View a VMs Information show vm vmID Details such as DC, Cluster, Host, VM
Name, vNICs, dvfilters installed
View particular virtual NIC information show vnic icID Details such as VNIC name, mac address,
pg, applied filters
View all cluster information show dfw cluster all Cluster Name, Cluster Id, Datacenter
Name, Firewall Status
View particular cluster information show dfw cluster clusterID Host Name, Host Id, Installation Status
View dfw related host information show dfw host hostID VM Name, VM Id, Power Status
View details within a dvfilter show dfw host hostID filter List rules, stats, address sets etc for each
filterID <option> VNIC
View DFW information for a VM show dfw vm vmID View VM's name, VNIC ID, filters, and so
on
View VNIC details show dfw vnic vnicID View VNIC name, ID, MAC address,
portgroup, filter
List the filters installed per vNIC show dfw host hostID summarize- Find the VM/vNIC of interest and get the
dvfilter name field to use in the next commands
as filter
View rules for a specific filter/vNIC show dfw host hostID filter
filterID rules
show dfw vnic nicID
View details of an address set show dfw host hostID filter The rules only display address sets, this
filterID addrsets command can be used to expand what is
part of an address set
Spoofguard details per vNIC show dfw host hostID filter Check if SpoofGuard is enabled and what
filterID spoofguard is the current IP/MAC
VMware, Inc. 18
NSX Troubleshooting Guide
Table 1‑17. Checking Distributed Firewall—Commands Run from NSX Manager (Continued)
Description Commands on NSX Manager Notes
View details of flow records show dfw host hostID filter If flow monitoring is enabled, host sends
filterID flows flow information periodically to NSX
Manager
Use this command to see flows per vNIC
View statistics for each rule for a vNIC show dfw host hostID filter This is useful to see if rules are being hit
filterID stats
Lists VIBs downloaded on the host. esxcli software vib list | grep Check to make sure right vib version is
See table Names of VIBs and Modules esx-vsip downloaded
Installed on Hosts for details on which or
VIBs to check on your installation. esxcli software vib list | grep
esx-nsxv
Details on system modules currently esxcli system module get -m vsip Check to make sure that the module was
loaded or installed/loaded
See table Names of VIBs and Modules esxcli system module get -m nsx-
Installed on Hosts for details on which vsip
modules to check on your installation.
Process list ps | grep vsfwd View if the vsfwd process is running with
several threads
View network connection esxcli network ip connection Check if the host has TCP connectivity to
list | grep 5671 NSX Manager
Packet logs dedicated file /var/log/dfwpktlogs.log Dedicated log file for packet logs
VMware, Inc. 19
NSX Troubleshooting Guide
For example:
The host-check command can also be invoked through the NSX Manager API.
VMware, Inc. 20
Troubleshooting NSX
Infrastructure 2
NSX preparation is a 4-step process.
1 Connect NSX Manager to vCenter Server. There is a one-to-one relationship between NSX Manager
and vCenter Server.
a Register with vCenter Server.
2 Deploy NSX Controllers (Only required for logical switching, distributed routing, or VXLAN in unicast
or hybrid mode. If you are only using distributed firewall (DFW), controllers are not required).
3 Host Preparation: Installs VIBs for VXLAN, DFW, and DLR on all hosts in the cluster. Configures the
Rabbit MQ-based messaging infrastructure. Enables firewall. Notifies controllers that hosts are ready
for NSX.
4 Configure IP pool settings and configure VXLAN: Creates a VTEP port group and VMKNICs on all
hosts in the cluster. During this step, you can set the transport VLAN ID, teaming policy, and MTU.
For more information about installation and configuration of each step, refer to NSX Installation Guide and
NSX Administration Guide.
Host Preparation
vSphere ESX Agent Manager deploys vSphere installation bundles (VIBs) onto ESXi hosts.
The deployment on hosts requires that DNS be configured on the hosts, vCenter Server, and
NSX Manager. Deployment does not require an ESXi host reboot, but any update or removal of VIBs
requires an ESXi host reboot.
VIBs are hosted on NSX Manager and are also available as a zip file.
VMware, Inc. 21
NSX Troubleshooting Guide
# Single Version associated with all the VIBs pointed by above VDN_VIB_PATH(s)
VDN_VIB_VERSION=6.3.0.4744320
The VIBs installed on a host depends on the NSX and ESXi versions:
You can view the installed VIBs using the esxcli software vib list command.
VMware, Inc. 22
NSX Troubleshooting Guide
or
n Might be due to a firewall blocking required ports between ESXi, NSX Manager, and vCenter
Server.
Most of the issues are resolved by clicking the Resolve option. Refer to Installation Status Is Not
Ready.
n A previous VIB of an older version is already installed. This requires user intervention to reboot hosts.
n NSX Manager and vCenter Server experience communication issues. The Host Preparation tab in
the Networking and Security Plug-in not showing all hosts properly:
If problem is not fixed with the Resolve option, refer to Problem Not Fixed With the Resolve Option.
vCenter home > Administration > vCenter Server Extensions > vSphere ESX Agent Manager.
On vSphere ESX Agent Manager, check the status of agencies that are prefixed with “VCNS160”. If
an agency has a bad status, select the agency and view its issues.
VMware, Inc. 23
NSX Troubleshooting Guide
n On the host that is having an issue, run the tail /var/log/esxupdate.log command.
NSX Manager
NSX
Controller
Cluster
UWA VXLAN
Security
In rare cases, the installation of the VIBs succeeds but for some reason one or both of the user world
agents is not functioning correctly. This could manifest itself as:
VMware, Inc. 24
NSX Troubleshooting Guide
n The control plane between hypervisors and the controllers being down. Check NSX Manager System
Events. Refer to NSX Logging and System Events.
If more than one ESXi host is affected, check the status of message bus service on NSX Manager
Appliance web UI under the Summary tab. If RabbitMQ is stopped, restart it.
n Check the messaging bus user world agent status on the hosts by running
the /etc/init.d/vShield-Stateful-Firewall status command on ESXi hosts.
n Run the esxcfg-advcfg -l | grep Rmq command on ESXi hosts to show all Rmq variables. There
should be 16 Rmq variables.
VMware, Inc. 25
NSX Troubleshooting Guide
n Run the esxcli network ip connection list | grep 5671 command on ESXi hosts to check
for active messaging bus connection.
For problems related to control plane agent, refer to Control Plane Agent (netcpa) Issues.
VMware, Inc. 26
NSX Troubleshooting Guide
Fetch
VIBs,
NSX OVF
vCenter
https
Service Insertion
ESX Agent Manager Service
vCenter inventory
Create
Agency
Fabric Cluster 1 Cluster 2
Host 1 Host 3
Host 2 Host 4
Network Fabric Secutity Fabric
(Host Preparations) (Service Deployment) Fetch
VIBs,
OVF
Network Fabric Guest Introspection
VIBs VIB
https
Guest Introspection
SVM
VMware, Inc. 27
NSX Troubleshooting Guide
NSX Object
EAM EAM
IP Configuration
Agent 1 Agent 3
VMware, Inc. 28
NSX Troubleshooting Guide
The following terms can help you to understand the host preparation architecture:
Fabric Fabric is a software layer in NSX Manager which interacts with ESX Agent
Manager to install network and security fabric services on hosts.
Network Fabric Network fabric services are deployed on a cluster. Network fabric services
include host preparation, VXLAN, distributed routing, distributed firewall,
and message bus.
Security Fabric Security fabric services are deployed on a cluster. Security fabric services
include Guest Introspection and partner security solutions.
Fabric Agent A fabric agent is a combination of a fabric service and a host in the
NSX Manager database. One fabric agent is created per host for a cluster
on which a networking or security fabric service is deployed.
Deployment Unit A combination of a fabric service and a cluster in the NSX Manager
database. A deployment unit must be created for networking and security
services to get installed.
ESX Agent Manager An ESX Agent Manager Agent is a combination of a service specification
Agent and a host in the vCenter Server database. An ESX Agent Manager agent
maps to an NSX Fabric Agent.
ESX Agent Manager An ESX Agent Manager Agency is a combination of a specification and a
Agency cluster in the vCenter Server database. The specification describes ESX
Agent Manager agents and VIBs, OVFs and their configuration (such as
datastore and network settings) that it manages.
The NSX Manager creates an ESX Agent Manager agency for each of the
clusters that are being prepared.
The NSX Manager creates an ESX Agent Manager agency for each of the clusters that are being
prepared. NSX Manager creates a Deployment Unit on its database for each ESX Agent Manager
agency. One ESX Agent Manager agency = One Deployment Unit .
VMware, Inc. 29
NSX Troubleshooting Guide
n Go to vCenter Solutions Manager > vSphere ESX Agent Manager > Manage.
n Under ESX Agencies, you can see the agencies (one per cluster that has been prepared for a
host ).
The lifecycle of a deployment unit is tied to that of the agency and removal of an agency from ESX Agent
Manager results in removal of the corresponding deployment unit from the NSX.
VMware, Inc. 30
NSX Troubleshooting Guide
Install Workflow
Fabric creates
Fabric creates
User deploys service Agency in EAM with
deployment unit with
on a cluster host preparation URLs
service ID and cluster ID
and cluster ID
VMware, Inc. 31
NSX Troubleshooting Guide
Upgrade Workflow
Fabric updates
User clicks “Upgrade Fabric updates Agency
deployment unit with
Available” link for a in EAM with new
with information about
cluster service URLs
the new version
EAM installs NSX VIBs NSX exits host from EAM marks the Agent
on host maintenance mode as GREEN
VMware, Inc. 32
NSX Troubleshooting Guide
Install Workflow
Fabric invokes
EAM deploys EAM sends provision
Deployment Plugins
VM on host signal to NSX
(if any) to configure VM
Fabric invokes
Fabric acknowledges EAM marks the
Deployment plugins
the signal from EAM Agent as GREEN
(if any) to configure VM
NSX marks
3rd party service
deployment as SUCCESS
VMware, Inc. 33
NSX Troubleshooting Guide
Upgrade Workflow
Fabric invokes
EAM deploys new VM EAM sends provision
Deployment Plugins (if
on host signal to NSX
any) to configure VM
To check the communication channel health between NSX Manager and the firewall agent, NSX Manager
and the control plane agent, and the control plane agent and controllers, perform the following steps:
1 In vSphere Web Client, navigate to Networking & Security > Installation > Host Preparation.
2
Select a cluster or expand a cluster and select a host. Click Actions ( ) then Communication
Channel Health.
VMware, Inc. 34
NSX Troubleshooting Guide
If the status of any of the three connections for a host changes, a message is written to the NSX Manager
log. In the log message, the status of a connection can be UP, DOWN, or NOT_AVAILABLE (displayed as
Unknown in vSphere Web Client). If the status changes from UP to DOWN or NOT_AVAILABLE, a
warning message is generated. For example:
If the status changes from DOWN or NOT_AVAILABLE to UP, an INFO message that is similar to the
warning message is generated. For example:
If the control plane channel experiences a communication fault, a system event with one of the following
granular failure reason is generated:
VMware, Inc. 35
NSX Troubleshooting Guide
Also, heartbeat messages are generated from NSX Manager to hosts. A configuration full sync is
triggered, if heartbeat between the NSX Manager and netcpa is lost.
For more information on how to download logs, refer to NSX Administration Guide.
Problem
On the Host Preparation tab or Service Deployment tab, the installation status appears as Not Ready.
Solution
1 Go to the Networking & Security > Installation> Host Preparation tab or Service Deployment
tab.
To see list of issues that are resolved by the Resolve option, refer to NSX Logging and System
Events.
4 If you still see Not Ready and error is still not resolved, refer to Problem Not Fixed With the Resolve
Option.
VMware, Inc. 36
NSX Troubleshooting Guide
Solution
Start
Yes
No No
Check if Agency
Add entry of NSX Is NSX FQDN
is Yellow in
FQDN in DNS server resolvable?
EAM MOB
Yes Yes
No Yes
No
VMware, Inc. 37
NSX Troubleshooting Guide
VMware, Inc. 38
NSX Troubleshooting Guide
Solution
Start
No
Worked
Yes
Yes
Is any external
Is it vCenter Server 5.5 No firewall blocking Yes Allow vCenter Server to
and NSX has FIPS vCenter Server to NSX NSX communication
enabled packet on on port 443 in firewall
port 443
No
VMware, Inc. 39
NSX Troubleshooting Guide
Problem
n Clicking the Not Ready link shows error as VIB module for agent is not installed on the
host.
n While changing from vShield Endpoint to NSX Manager, you may see status as Failed.
Solution
1 Verify that the DNS is configured correctly on the vCenter Server, ESXi hosts and the NSX Manager.
Ensure that the forward and reverse DNS resolution from the vCenter Server, ESXi hosts,
NSX Manager and the vSphere Update Manager are working.
2 To determine if the problem is related to DNS, review the esxupdate logs and look for the message
“esxupdate: ERROR: MetadataDownloadError:IOError: <urlopen error [Errno -2] Name=
or service not known in the esxupdate.log file.
This message indicates that the ESXi host is unable to access the vCenter Server's Fully Qualified
Domain Name (FQDN). For more information, see Verifying the VMware vCenter Server Managed IP
Address (1008030).
3 Verify that Network Time Protocol (NTP) is configured correctly. VMware recommends configuring
NTP. To determine whether NTP out of sync issues are impacting your environment, check
the /etc/ntp.drift file in the NSX Manager support bundles with version 6.2.4 and later.
4 Verify that all ports required for NSX for vSphere 6.x are not blocked by a firewall. For related
information, refer to:
n TCP and UDP Ports required to access VMware vCenter Server, VMware ESXi and ESX hosts,
and other network components (1012382).
Note VMware vSphere 6.x supports VIB downloads over port 443 (instead of port 80). This port is
opened and closed dynamically. The intermediate devices between the ESXi hosts and
vCenter Server must allow traffic using this port.
5 Verify that the vCenter Server Managed IP Address is configured correctly. For more information,
see Verifying the VMware vCenter Server Managed IP Address (1008030).
VMware, Inc. 40
NSX Troubleshooting Guide
6 Verify that the vSphere Update Manager is working correctly. Beginning with vCenter Server 6.0U3,
NSX installation and upgrade procedures no longer leverage vSphere Update Manager with ESX
Agent Manager. VMware strongly recommends running at least vCenter Server 6.0U3 or later. If you
cannot upgrade, ensure that the vSphere Update Manager service is running. You can configure the
vSphere Update Manager bypass option, as per KB 2053782.
7 If you specify non-default ports while deploying vCenter Server, ensure that these ports are not
blocked by the ESXi host firewall.
8 Verify that vCenter Server vpxd process is listening on TCP port 8089. NSX Manager supports only
the default port 8089.
n Windows—C:\ProgramData\VMware\vCenterServer\logs\eam\eam.log
n VCSA—/var/log/vmware/vpx/eam.log
n ESXi—/var/log/esxupdate.log
Important Make sure to change the bypassVumEnabled flag to True before starting the NSX installation
and change it back to False after the installation. See https://kb.vmware.com/kb/2053782.
2 Click Administration > vCenter Server Extensions, and then click the vSphere ESX Agent
Manager.
The Manage tab shows information about running agencies, lists any orphaned ESX agents, and
logs information about the ESX agents that ESX Agent Manager manages.
For more information about agents and agencies, see vSphere documentation.
The Monitor > Events tab shows information about the events associated with ESX Agent
Manager.
VMware, Inc. 41
NSX Troubleshooting Guide
Problem
Solution
Validate that each troubleshooting step is true for your environment. Each step provides instructions to
eliminate possible causes and take corrective action as necessary. The steps are ordered in the most
appropriate sequence to isolate the issue and identify the proper resolution. Do not skip a step.
Procedure
1 Check the NSX Release Notes for current releases to see if the problem is resolved in a bug fix.
2 Ensure that the minimum system requirements are met when installing VMware NSX Manager.
4 Installation issues:
n If configuring the lookup service or vCenter Server fails, verify that the NSX Manager and
lookup service appliances are in time sync. Use the same NTP server configurations on both
NSX Manager and the lookup service. Also ensure that DNS is properly configured.
n Verify that the OVA file is getting installed correctly. If an NSX OVA file cannot be installed, an
error window in the vSphere client notes where the failure occurred. Also, verify and validate
the MD5 checksum of the downloaded OVA/OVF file.
n Verify that the time on the ESXi hosts is in sync with NSX Manager.
n VMware recommends that you schedule a backup of the NSX Manager data immediately
after installing NSX Manager.
5 Upgrade issues:
n Before upgrading, see the latest interoperability information in the Product Interoperability
Matrixes page.
n VMware recommends that you back up your current configuration and download technical
support logs before upgrading.
VMware, Inc. 42
NSX Troubleshooting Guide
n A force-resync with the vCenter Server may be required after the NSX Manager upgrade. To
do this, log in to the NSX Manager Web Interface GUI. Then go to Manage vCenter
Registration > NSX Management Service > Edit and re-enter the password for the
administrative user.
6 Performance issues:
n Verify that the root (/) partition has adequate space. You can verify this by logging in to the
ESXi host and typing this command df -h.
For example:
[root@esx-01a:~] df -h
Filesystem Size Used Available Use% Mounted on
NFS 111.4G 80.8G 30.5G 73% /vmfs/volumes/ds-site-a-nfs01
vfat 249.7M 172.2M 77.5M 69% /vmfs/volumes/68cb5875-d887b9c6-a805-65901f83f3d4
vfat 249.7M 167.7M 82.0M 67% /vmfs/volumes/fe84b77a-b2a8860f-38cf-168d5dfe66a5
vfat 285.8M 206.3M 79.6M 72% /vmfs/volumes/54de790f-05f8a633-2ad8-00505603302a
n Use the esxtop command to check which processes are using large amounts of CPU and
memory.
n If the NSX Manager encounters any out-of-memory errors in the logs, verify that
the /common/dumps/java.hprof file exists. If this file exists, create a copy of the file and
include this with the NSX technical support log bundle.
7 Connectivity issues:
n If NSX Manager is having connectivity issues either with vCenter Server or the ESXi host, log
in to the NSX Manager CLI console, run the command: debug connection
IP_of_ESXi_or_VC, and examine the output.
n Verify that the Virtual Center Web management services is started and the browser is not in
an error state.
n If the NSX Manager Web User Interface (UI) is not updating, you can attempt to resolve the
issue by disabling and then re-enabling the Web services. See
https://kb.vmware.com/kb/2126701.
n Verify which port group and uplink NIC is used by the NSX Manager using the esxtop
command on the ESXi host. For more information, see https://kb.vmware.com/kb/1003893.
n Check the NSX Manager virtual machine appliance Tasks and Events tab from the vSphere
Web Client under the Monitor tab.
VMware, Inc. 43
NSX Troubleshooting Guide
n If the NSX Manager is having connectivity issues with vCenter Server, attempt to migrate the
NSX Manager to the same ESXi host where the vCenter Server virtual machine is running to
eliminate possible underlying physical network issues.
Note that this only works if both virtual machines are on the same VLAN/port group.
For the connection to work, you must have DNS and NTP configured on NSX Manager, vCenter Server
and the ESXi hosts. If you added ESXi hosts by name to the vSphere inventory, ensure that DNS servers
have been configured on the NSX Manager and name resolution is working. Otherwise, NSX Manager
cannot resolve the IP addresses. The NTP server must be specified so that the SSO server time and
NSX Manager time are in sync. On NSX Manager, the drift file at /etc/ntp.drift is included in the tech
Support bundle for NSX Manager.
The account you use to connect NSX Manager to vCenter Server must have the vCenter role
"Administrator." Having the "Administrator" role enables NSX Manager to register itself with the Security
Token Service server. When a particular user account is used to connect NSX Manager to vCenter, an
“Enterprise Administrator" role for the user is also created on NSX Manager.
n User account without vCenter role of Administrator used to connect NSX Manager to vCenter.
n User logging into vCenter with an account that does not have a role on NSX Manager.
You need to initially log into vCenter with the account you used to link NSX Manager to vCenter Server.
Then you can create additional users with roles on NSX Manager using theHome > Networking &
Security > NSX Managers > {IP of NSX Manager} > Manage > Users.
The first login can take up to 4 minutes while vCenter loads and deploys NSX UI bundles.
VMware, Inc. 44
NSX Troubleshooting Guide
n Look for errors in the NSX Manager log to indicate the reason for not connecting to vCenter Server.
The command to view the log is show log manager follow.
n Run the command: debug connection IP_of_ESXi_or_VC, and examine the output.
The command runs in privileged mode only. To enter privileged mode, run the enable command and
provide the admin password.
nsxmgr# en
nsxmgr# debug packet display interface mgmt port_80_or_port_443
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on mgmt, link-type EN10MB (Ethernet), capture size 262144 bytes
23:40:25.321085 IP 192.168.210.15.54688 > 192.168.210.22.443: Flags [P.], seq 2645022162:2645022199,
ack 2668322748, win 244, options [nop,nop,TS val 1447550948 ecr 365097421], length 37
...
VMware, Inc. 45
NSX Troubleshooting Guide
Current configuration:
!
ntp server 192.168.110.1
!
ip name server 192.168.110.10
!
hostname nsxmgr
!
interface mgmt
ip address 192.168.110.15/24
!
ip route 0.0.0.0/0 192.168.110.1
!
web-manager
There is a known issue in which the CMS silently fails to make API calls.
This happens when the certificate issuer is not known to the caller because it is an untrusted root
certificate authority or the certificate is self-signed. To resolve this issue, use a browser to navigate to the
NSX Manager IP address or hostname and accept the certificate.
Problem
3 Later you remove the secondary NSX Manager. The secondary NSX Manager is in transit mode.
4 Now for some reasons, you restore the backup on primary NSX Manager.
VMware, Inc. 46
NSX Troubleshooting Guide
5 In database, the transit NSX Manager gets updated as Secondary, but on UI it displays as Transit,
and the sync fails.
6 You may not be able to remove the secondary NSX Manager, or promote it as a secondary again.
7 While promoting transit NSX Manager, an error message saying NSX Manager node with IP
address/hostname already exists is displayed.
8 While removing transit NSX Manager, an error message saying Incorrect user name or
password is displayed.
Solution
1 Log in to the vCenter linked to the primary NSX Manager using the vSphere Web Client.
2 Navigate to Home > Networking & Security> Installation, and then select Management tab.
3 Select the secondary NSX Manager that you want to delete and click Actions, and then
click Remove Secondary NSX Manager.
4 Select the Perform operation even if NSX Manager is inaccessible check box.
5 Click OK.
The secondary NSX Manager gets deleted from the primary database.
What to do next
For more information about adding secondary NSX Manager, refer to NSX Installation Guide.
Problem
VMware, Inc. 47
NSX Troubleshooting Guide
Solution
1 Connectivity issues:
n If NSX Manager is having connectivity issues either with vCenter Server or the ESXi host, log in
to the NSX Manager CLI console, run the command: debug connection IP_of_ESXi_or_VC,
and examine the output.
n Ping from NSX Manager to the vCenter Server with the IP address and FQDN to check for
routing, or static, or default route in NSX Manager, using this command:
Codes:
K – kernel route,
C – connected,
S – static
* – FIB route
2 DNS Issue
Ping from NSX Manager to vCenter Server with FQDN using this command:
VMware, Inc. 48
NSX Troubleshooting Guide
If this does not work, navigate to Manage > Network > DNS Servers in NSX Manager and ensure
that DNS is properly configured.
3 Firewall Issue
If there is a firewall between NSX Manager and vCenter Server, verify that it allows SSL on TCP/443.
Also, ping to check connectivity.
4 Verify that the following required ports are open in NSX Manager.
443/TCP Downloading the OVA file on the ESXI host for deployment
Using REST APIs
Using the NSX Manager user interface
5 NTP Issues
Verify that time is synchronized between vCenter Server and NSX Manager. To achieve this, use the
same NTP server configurations on the NSX Manager and vCenter Server.
To determine the time on the NSX Manager, run this command from the CLI:
To determine the time on the vCenter Server, run this command on the CLI:
vc-l-01a:~ # date
To register to vCenter Server or SSO Lookup Service, you must have administrative rights.
VMware, Inc. 49
NSX Troubleshooting Guide
The teaming policy, load balancing method, MTU, and VLAN ID of the VTEPs are chosen during VXLAN
configuration. The teaming and load balancing methods must match the configuration of the DVS
selected for VXLAN.
The MTU must be set to be at least 1600 and not less than what is already configured on the DVS.
The number of VTEPs created depends on the teaming policy selected and the DVS configuration.
n Teaming method chosen for VXLAN does not match what can be supported by the DVS. To review
supported methods, see the VMware NSX for vSphere Network Virtualization Design Guide at
https://communities.vmware.com/docs/DOC-27683.
n A VMkernel NIC is missing. Resolve the error as described in VXLAN VMkernel NIC Out Of Sync.
n Incorrect MTU setting is chosen for the VTEPs. You should investigate if there is an MTU mismatch
as described later in this topic.
n Incorrect VXLAN gateway is chosen. You should investigate if there is an error while configuring the
VXLAN gateway as described later in this topic.
Control plane status displays as disabled if the host does not have
any active VMs which need a controller connection
Use the show logical-switch commands to view VXLAN details on the host. For details, refer to NSX
Command Line Interface Reference.
VMware, Inc. 50
NSX Troubleshooting Guide
The show logical-switch host hostID verbose command will display status of control plane as
disabled if the host has not been populated with any VMs which require a connection to the controller
cluster for forwarding table information.
Network count: 18
VXLAN network: 32003
Multicast IP: 0.0.0.0
Control plane: Disabled <<========
MAC entry count: 0
ARP entry count: 0
Port count: 1
<nwFabricFeatureStatus>
<featureId>com.vmware.vshield.nsxmgr.vxlan</featureId>
<featureVersion>5.5</featureVersion>
<updateAvailable>false</updateAvailable>
<status>RED</status>
<message>VXLAN Gateway cannot be set on host</message>
<installed>true</installed>
<enabled>true</enabled>
<errorStatus>VXLAN_GATEWAY_SETUP_FAILURE</errorStatus>
</nwFabricFeatureStatus>
n Option 1: Remove VXLAN configuration for the host cluster, fix the underlying gateway setup in the
IP pool by making sure the gateway is properly configured and reachable, and then reconfigure
VXLAN for the host cluster.
a Fix the underlying gateway setup in the IP pool by making sure the gateway is properly
configured and reachable.
b Put the host (or hosts) into maintenance mode to ensure no VM traffic is active on the host.
d Take the host out of maintenance mode. Taking hosts out of maintenance mode triggers the
VXLAN VTEP creation process on NSX Manager. NSX Manager will try to re-create the required
VTEPs on the host.
VMware, Inc. 51
NSX Troubleshooting Guide
where vmkx is the ID of your VMkernel port and hostname_or_IP is the IP or hostname of the
VMkernel port.
This allows you to check the validity of all uplinks. If you are working in a multi-VTEP environment,
you can validate all uplinks by running the ping command from each possible VTEP VMkernel
source/destination interface to validate all the paths.
n Check the physical infrastructure. Many times issue gets resolved by a configuration change to the
physical infrastructure.
n Determine whether the issue is confined to a single logical switch, or other logical switches are also
affected. Verify if the issue affects all the logical switches.
For more information about the MTU check, see "Verify the NSX Working State" in the NSX Upgrade
Guide.
Prerequisites
Procedure
1 In the vSphere Web Client, navigate to Networking & Security > Installation > Logical Network
Preparation.
3 Click the Error icon to view with information of the VMkernel NIC which is deleted on the host.
4 Click the Resolve All button to recreate the deleted VMkernel NIC on the host.
VMware, Inc. 52
NSX Troubleshooting Guide
Problem
Solution
1 Retrieve information about all the VXLAN prepared switches using the GET https://<NSX-
Manager-IP-Address>/api/2.0/vdn/switches API.
In the output of the API , locate the switch that you would like to modify and note the name. For
example, dvs-35.
2 Now query with the specific vSphere Distributed Switch that you noted earlier.
<vdsContext>
<switch>
<objectId>dvs-35</objectId>
<objectTypeName>VmwareDistributedVirtualSwitch</objectTypeName>
<vsmUuid>423A993F-BEE6-1285-58F1-54E48D508D90</vsmUuid>
<nodeId>916287b3-761d-430b-8ab2-83878dfe3e7f</nodeId>
<revision>6</revision>
<type>
<typeName>VmwareDistributedVirtualSwitch</typeName>
</type>
< name>vds-site-a</name>
<scope>
<id>datacenter-21</id>
<objectTypeName>Datacenter</objectTypeName>
< name>Datacenter Site A</name>
</scope>
<clientHandle/>
<extendedAttributes/>
<isUniversal>false</isUniversal>
<universalRevision>0</universalRevision>
</switch>
<mtu>1600</mtu>
<teaming>FAILOVER_ORDER</teaming>
<uplinkPortName>Uplink 4</uplinkPortName>
<promiscuousMode>false</promiscuousMode
</vdsContext>
VMware, Inc. 53
NSX Troubleshooting Guide
3 You can modify the parameters such as teaming policy and/or MTU on a vSphere Distributed Switch
using the API call. The following example shows changing the teaming policy of dvs-35 from
FAILOVER_ORDER to LOADBALANCE_SRCMAC and MTU from 1600 to 9000 .
<vdsContext>
<switch>
<objectId>dvs-35</objectId>
<objectTypeName>VmwareDistributedVirtualSwitch</objectTypeName>
<vsmUuid>423A993F-BEE6-1285-58F1-54E48D508D90</vsmUuid>
<nodeId>916287b3-761d-430b-8ab2-83878dfe3e7f</nodeId>
<revision>6</revision>
<type>
<typeName>VmwareDistributedVirtualSwitch</typeName>
</type>
<name>vds-site-a</name>
<scope>
<id>datacenter-21</id>
<objectTypeName>Datacenter</objectTypeName>
<name>Datacenter Site A</name>
</scope>
<clientHandle/>
<extendedAttributes/>
<isUniversal>false</isUniversal>
<universalRevision>0</universalRevision>
</switch>
<mtu>9000</mtu>
<teaming>LOADBALANCE_SRCMAC</teaming>
<uplinkPortName>Uplink 4</uplinkPortName>
<promiscuousMode>false</promiscuousMode>
</vdsContext>
Note Following is a list of valid teaming policy entries for the <teaming> parameter:
n FAILOVER_ORDER
n ETHER_CHANNEL
n LACP_ACTIVE
n LACP_PASSIVE
n LOADBALANCE_LOADBASED
n LOADBALANCE_SRCID
n LOADBALANCE_SRCMAC LACP_V2
4 Verify the syntax used is correct and the change is active for the vSphere Distributed Switch you are
working with using the GET command. For example, GET https://<NSX-Manager-IP-
Address>/api/2.0/vdn/switches/dvs-35.
VMware, Inc. 54
NSX Troubleshooting Guide
5 Open the vSphere Web Client and confirm that the configuration changes are reflected.
Prerequisites
Procedure
1 In the vSphere Web Client, navigate to Home > Networking & Security > Logical Switches.
2 In the Status column, click the Out of sync link to see the detailed reason for this out of sync state.
VMware, Inc. 55
Troubleshooting NSX Routing 3
NSX has two types of routing subsystems, optimised for two key needs.
n Routing within the logical space, also known as “East – West” routing, provided by the Distributed
Logical Router (DLR);
n Routing between the physical and logical space, also known as “North – South” routing, provided by
the Edge Services Gateways (ESG).
The DLR supports running a single dynamic routing protocol at a time (OSPF or BGP), while the ESG
supports running both routing protocols at the same time. The reason for this is the DLR is designed to be
s a “stub” router, with a single path out, which means more advanced routing configurations are typically
not required.
Both the DLR and the ESG support having a combination of static and dynamic routes.
Both provide L3 domain separation, meaning that each instance of a Distributed Logical Router or an
Edge Services Gateway has its own L3 configuration, similar to an L3VPN VRF.
VMware, Inc. 56
NSX Troubleshooting Guide
Route updates
7
sent to Hosts
3
Controller
Slicing Info
Router 6 pushed by
Logical Router UW Agent Security VXLAN DR
NSX Manager sends
4 update Master node
to Hosts
4 New Logical Router
Instance Created 5
Host connects
to DLR instance
master controller
and gets LIFs
and Routes UW Agent Security VXLAN DR
NSX Edge
Distributed
Logical Router
VMware, Inc. 57
NSX Troubleshooting Guide
n Up to 999 logical interfaces (LIFs) on each DLR (8 x uplink + 991 internal) + 1 x management
n Up to 10,000 LIFs per host distributed across all DLR instances (not enforced by NSX Manager)
n Cannot connect more than one DLR to any given VLAN or VXLAN.
n To route between VXLAN and VLAN, the transport zone must span single DVS.
The DLR’s design at a high level is analogous to a modular router chassis, in the following ways:
n It runs dynamic routing protocols to exchange routing information with the rest of the network.
n It computes forwarding tables for “line cards” based on the configuration of interfaces, static
routes, and dynamic routing information.
n It programs these forwarding tables into the “line cards” (via the Controller Cluster, to enable
scale and resiliency).
Routing is always handled by a DLR instance on the same host where the source VM is running. This
means that when source and destination VMs are on different hosts, the DLR instance that provides
routing between them sees packets only in one direction, from source VM to destination. Return traffic is
only seen by the corresponding instance of the same DLR on the destination VM’s host.
After the DLR has completed routing, delivery to the final destination is the responsibility of the DVS via
L2 – VXLAN or VLAN if the source and destination VMs are on different hosts, or by the DVS locally if
they are on the same host.
VMware, Inc. 58
NSX Troubleshooting Guide
Figure 3‑2 illustrates data flow between two VMs, VM1 and VM2, running on different hosts and
connected to two different Logical Switches, VXLAN 5000 and VXLAN 5001.
VM1
VXLAN 5000
DLR DLR
VXLAN 5001
VM1
1 VM1 sends a packet toward VM2, which is addressed to VM1’s gateway for VM2’s subnet (or
default). This gateway is a VXLAN 5000 LIF on the DLR.
2 The DVS on ESXi Host A delivers the packet to the DLR on that host, where the lookup is performed,
and the egress LIF is determined (in this case – VXLAN 5001 LIF).
3 The packet is then sent out of that destination LIF, which essentially returns the packet to the DVS,
but on a different Logical Switch (5001).
4 The DVS then performs L2 delivery of that packet to the destination host (ESXi Host B), where the
DVS will forward the packet to VM2.
Return traffic will follow in the same order, where traffic from VM2 is forwarded to the DLR instance on
ESXi Host B, and then delivered via L2 on VXLAN 5000.
VMware, Inc. 59
NSX Troubleshooting Guide
VM1
(MAC1)
VXLAN 5000
DLR DLR
1 2 3 4
7 6 5
VM2
VXLAN 5001 (MAC2)
1 The DLR instance on Host A generates an ARP request packet, with SRC MAC = vMAC, and DST
MAC = Broadcast. The VXLAN module on Host A finds all VTEPs on the egress VXLAN 5001, and
sends each one a copy of that broadcast frame.
2 As the frame leaves the host via the VXLAN encapsulation process, the SRC MAC is changed from
vMAC to pMAC A, so that return traffic can find the originating DLR instance on Host A. Frame now is
SRC MAC = pMAC A, and DST MAC = Broadcast.
3 As the frame is received and decapsulated on Host B, it is examined and found to be sourced from
the IP address that matches the local DLR instance’s LIF on VXLAN 5001. This flags the frame as
abrequest to perform the proxy ARP function. The DST MAC is changed from Broadcast to vMAC so
that the frame can reach the local DLR instance.
4 The local DLR instance on Host B receives the ARP Request frame, SRC MAC = pMAC A, DST MAC
= vMAC, and sees its own LIF IP address requesting this. It saves the SRC MAC, and generates a
new ARP Request packet, SRC MAC = vMAC, DST MAC = Broadcast. This frame is tagged as “DVS
Local” to prevent it from being flooded via the dvUplink. The DVS delivers the frame to VM2.
5 VM2 sends an ARP Reply, SRC MAC = MAC2, DST MAC = vMAC. The DVS delivers it to the local
DLR instance.
6 The DLR instance on Host B replaces DST MAC with the pMAC A saved at from step 4, and sends
the packet back to the DVS for delivery back to Host A.
7 After the ARP Reply reaches Host A, DST MAC is changed to vMAC, and the ARP Reply frame with
SRC MAC = MAC2 and DST MAC = vMAC reaches the DLR instance on Host A.
The ARP resolution process is complete, and the DLR instance on Host A can now start sending traffic to
VM2.
VMware, Inc. 60
NSX Troubleshooting Guide
When VM1 wants to know the MAC address for VM2, it sends an ARP request. This ARP request is
intercepted by the logical switch and if logical switch already has an ARP entry for the target, it sends the
ARP response to the VM.
If not, it sends an ARP query to the NSX Controller. If controller knows VM IP to MAC binding, controller
replies with the binding and the logical switch sends the ARP response. If controller does not have the
ARP entry, then the ARP request is re-broadcasted on the logical switch. NSX Controller learns the MAC
address via Switch Security module which snoops on ARP requests/DHCP packets.
ARP suppression has been extended to include the Distributed logical router (DLR) as well.
n ARP requests from distributed logical router are treated the same way as ARP requests from other
VMs and are subjected to suppression. When distributed logical router has to resolve ARP request of
a destination IP, the ARP request is suppressed by the logical switch, preventing flooding when the IP
to MAC binding is already known to the controller.
n When a LIF is created, distributed logical router adds the ARP entry for the LIF IP in the logical
switch, so ARP requests for the LIF IP are also suppressed by the logical switch.
The ESG is essentially a router in a virtual machine. It is delivered in an appliance-like form factor with
four sizes, with its complete lifecycle managed by the NSX Manager. The ESG’s primary use case is as a
perimeter router, where it is deployed between multiple DLRs and between the physical world and the
virtualized network.
n Each ESG supports 8-way ECMP for path redundancy and scalability.
Figure 3‑4 shows the ESG and DLR packet flow when equal-cost multipath (ECMP) routing is enabled
between two ESGs and the physical infrastructure.
VM1 thus has access to 2x bi-directional throughput compared with a deployment with a single ESG.
VMware, Inc. 61
NSX Troubleshooting Guide
The DLR has two LIFs – Internal on VNI 5000, and Uplink on VNI 5001.
The DLR has ECMP enabled and is receiving equal cost routes toward the IP subnet of VLAN 20 from a
pair of ESGs, ESG A and ESG B via a dynamic routing protocol (BGP or OSPF).
The two ESGs are connected to a VLAN-backed dvPortgroup associated with VLAN 10, where a physical
router that provides connectivity to VLAN 20 is also connected.
The ESGs receive external routes for VLAN 20, via a dynamic routing protocol from the physical router.
The physical router in exchange learns about the IP subnet associated with VXLAN 5000 from both
ESGs, and performs ECMP load balancing for the traffic toward VMs in that subnet.
Figure 3‑4. High-Level ESG and DLR Packet Flow with ECMP
VM1
VXLAN 5000
Physical
Server
DLR DLR
ECMP
VLAN 20
VXLAN 5001
ESG A ESG B
Physical
Router
VLAN 10
ECMP
The DLR can receive up to eight equal-cost routes and balance traffic across the routes. ESG A and ESG
B in the diagram provide two equal-cost routes.
ESGs can do ECMP routing toward the physical network, assuming multiple physical routers are present.
For simplicity, the diagram shows a single physical router.
There is no need for ECMP to be configured on ESGs toward the DLR, because all DLR LIFs are “local”
on the same host where ESG resides. There would be no additional benefit provided by configuring
multiple uplink interfaces on a DLR.
In situations where more North-South bandwidth is required, multiple ESGs can be placed on different
ESXi hosts to scale up to ~80Gbps with 8 x ESGs.
VMware, Inc. 62
NSX Troubleshooting Guide
1 VM1 sends a packet to the physical server, which is sent to VM1’s IP gateway (which is a DLR LIF)
on ESXi Host A.
2 The DLR performs a route lookup for the IP of the physical server, and finds that it is not directly
connected, but matches two ECMP routes received from ESG A and ESG B.
3 The DLR calculates an ECMP hash, and decides on a next hop, which could be either ESG A or ESG
B, and sends the packet out the VXLAN 5001 LIF.
5 The ESG performs the routing lookup and finds that the physical server’s subnet is accessible via the
physical router’s IP address on VLAN 10, which is directly connected to one of ESG’s interfaces.
6 The packet is sent out through the DVS, which passes it on to the physical network after tagging it
with the correct 801.Q tag with VLAN ID 10.
7 The packet travels through the physical switching infrastructure to reach the physical router, which
performs a lookup to find that the physical server is directly connected to an interface on VLAN 20.
1 The physical server sends the packet to VM1, with the physical router as the next hop.
2 The physical router performs a lookup for VM1’s subnet, and sees two equal-cost paths to that subnet
with the next hops, ESG A's and ESG B’s VLAN 10 interface, respectively.
3 The physical router selects one of the paths and sends the packet toward the corresponding ESG.
4 The physical network delivers the packet to the ESXi host where the ESG resides, and delivers it to
DVS, which decapsulates the packet and forwards it on the dvPortgroup associated with VLAN 10 to
the ESG.
5 The ESG performs a routing lookup and finds that VM1’s subnet is accessible via its interface
associated with VXLAN 5001 with the next hop being DLR’s uplink interface IP address.
6 The ESG sends the packet to the DLR instance on the same host as the ESG.
7 The DLR performs a routing lookup to find that VM1 is available via its VXLAN 5000 LIF.
8 The DLR sends the packet out its VXLAN 5000 LIF to the DVS, which performs the final delivery.
This means L2 that forwarding services that are connected to DLR or ESG must be configured and
operational. In the NSX installation process, these services are provided by “Host Preparation” and
“Logical Network Preparation.”
VMware, Inc. 63
NSX Troubleshooting Guide
When creating transport zones on multi-cluster DVS configurations, make sure that all clusters in the
selected DVS are included under the transport zone. This ensures that the DLR is available on all clusters
where DVS dvPortgroups are available.
When a transport zone is aligned with DVS boundary, the DLR instance is created correctly.
5001
5002
5003
db1
When a transport zone is not aligned to the DVS boundary, the scope of logical switches (5001, 5002 and
5003) and the DLR instances that these logical switches are connected to becomes disjointed, causing
VMs in cluster Comp A to have no access to DLR LIFs.
In the diagram above, DVS “Compute_DVS” covers two clusters, “Comp A” and “Comp B”. The “Global-
Transport-Zone” includes both “Comp A” and “Comp B.”
This results in correct alignment between the scope of Logical Switches (5001, 5002, and 5003), and the
DLR instance created on all hosts in all clusters where these Logical Switches are present.
Now, let’s look at an alternative situation, where the Transport Zone was not configured to include cluster
“Comp A”:
VMware, Inc. 64
NSX Troubleshooting Guide
5001
DLR DLR
Missing! Missing!
5002
5003
db1
In this case, VMs running on cluster “Comp A” have full access to all logical switches. This is because
logical switches are represented by dvPortgoups on hosts, and dvProtgroups are a DVS-wide construct.
In our sample environment, “Compute_DVS” covers both “Comp A” an “Comp B."
DLR instances, however, are created in strict alignment with the transport zone scope, which means no
DLR instance will be created on hosts in “Comp A."
As the result, VM “web1” will be able to reach VMs “web2” and “LB” because they are on the same logical
switch, but VMs “app1” and “db1” will not be able to communicate with anything.
The DLR relies on the Controller Cluster to function, while the ESG does not. Make sure that the
Controller Cluster is up and available before creating or changing a DLR configuration.
If the DLR is to be connected to VLAN dvPortgroups, ensure that ESXi hosts with the DLR configured can
reach each other on UDP/6999 for DLR VLAN-based ARP proxy to work.
Considerations:
n A given DLR instance cannot be connected to logical switches that exist in different transport zones.
This is to ensure all logical switches and DLR instances are aligned.
n The DLR cannot be connected to VLAN-backed portgroups, if that DLR is connected to logical
switches spanning more than one DVS. As above, this is to ensure correct alignment of DLR
instances with logical switches and dvPortgroups across hosts.
VMware, Inc. 65
NSX Troubleshooting Guide
n When selecting placement of the DLR Control VM, avoid placing it on the same host as one or more
of its upstream ESGs by using DRS anti-affinity rules if they are in the same cluster. This is to reduce
the impact of host failure on DLR forwarding.
n OSPF can be enabled only on a single Uplink (but supports multiple adjacencies). BGP, on other
hand, can be enabled on multiple Uplink interfaces, where it is necessary.
NSX Routing UI
The vSphere Web Client UI provides two major sections relevant to NSX routing.
These include the L2 and control-plane infrastructure dependencies and the routing subsystem
configuration.
NSX distributed routing requires functions that are provided by the Controller Cluster. The following
screen shot shows a Controller Cluster in a healthy state.
Things to note:
Host kernel modules for distributed routing are installed and configured as part of VXLAN configuration on
the host. This means distributed routing requires that ESXi hosts are prepared and VXLAN is configured
on them.
Things to note:
VMware, Inc. 66
NSX Troubleshooting Guide
n “VXLAN” is “Configured."
Things to note:
n The VLAN ID must be correct for the VTEP transport VLAN. Note that in the screen shot above it is
“0." In most real-world deployments this would not the case.
n MTU is configured to be 1600 or larger. Make sure that the MTU is not set to 9000 with the
expectation that the MTU on VMs would be also set to 9000. The DVS maximum MTU is 9000, and if
VMs are also at 9000, there is no space for VXLAN headers.
n VMKNics must have the correct addresses. Make sure that they are not set to 169.254.x.x addresses,
indicating that nodes have failed to get addresses from DHCP.
n The teaming policy must be consistent for all cluster members of the same DVS.
n The number of VTEPs must be the same as the number of dvUplinks. Make sure that valid/expected
IP addresses are listed.
Transport Zones have to be correctly aligned to DVS boundaries, to avoid the situation in which the DLR
is missing on some clusters.
NSX Edges UI
The NSX routing subsystem is configured and managed in the “NSX Edges” section of the UI.
All currently deployed DLRs and ESGs are shown, with the following information displayed for each:
n “Id” shows the ESG or DLR Edge appliance ID, which can be used for any API calls referring to that
ESG or DLR
n “Tenant” + “Id” forms the DLR instance name. This name is visible and used in the NSX CLI.
VMware, Inc. 67
NSX Troubleshooting Guide
n “Size” is always “Compact” for DLR, and the size that was selected by the operator for ESG.
In addition to the information in the table, there is a context menu, accessible either via buttons or via
“Actions."
“Force Sync” operation clears the ESG’s or the DLR's Control VM’s configuration, reboots it, and re-pushes
the configuration.
“Redeploy” tears down the ESG or DLR, and creates is a new ESG or DLR with the same configuration.
The existing ID is preserved.
“Change Auto Rule Configuration” applies to the ESG’s built-in firewall rules, created when services are
enabled on the ESG (for example, BGP which needs TCP/179).
“Download tech support logs” creates a log bundle from the ESG or DLR Control VM
For the DLR, host logs are not included in the tech support bundle and need to be collected separately.
“Change appliance size” is only applicable to ESGs. This will perform a “redeploy” with a new appliance
(vNIC MAC addresses will change).
“Change CLI credentials” allows the operator to force-update the CLI credentials.
If the CLI is locked-out on an ESG or DLR Control VM after 5 failed logins, this will not lift the lock-out. You
will need to wait 5 minutes, or “Redeploy” your ESG/DLR to get back in with the correct credentials.
“Change Log Level” changes the level of detail to be sent to ESG/DLR syslog.
“Configure Advanced Debugging” re-deploys the ESG or DLR with core-dump enabled and additional virtual
disk attached for storing core dump files.
“Deploy” becomes available when an ESG has been created without deploying it.
This option simply executes the deployment steps (deploys OVF, configures Interfaces, pushes
configuration to the created appliance.
If the version of DLR/ESG is older than NSX Manager, the “Upgrade Version” option becomes available.
VMware, Inc. 68
NSX Troubleshooting Guide
n “Hostname” will be used to set the DNS name of the ESG or DLR Control VM, visible on
SSH/Console session, in syslog messages, and in the vCenter “Summary” page for the ESG/DLR VM
under “DNS Name.”
n “Tenant” will be used to form the DLR Instance Name, used by the NSX CLI. It can be also be used
by external cloud management platform.
n “User Name” and “Password” set the CLI/VM console credentials to access the DLR Control VM.
NSX does not support AAA on ESG or DLR Control VMs. This account has full rights to ESG/DLR
Control VMs; however, the ESG/DLR configuration cannot be changed via the CLI/VMconsole.
n “Enable SSH access” enables the SSH daemon on the DLR Control VM to start.
n The control VM Firewall rules need to be adjusted to allow SSH network access.
VMware, Inc. 69
NSX Troubleshooting Guide
n The operator can connect to the DLR Control VM from either a host on the subnet of the Control
VM’s management Interface, or without such restriction on the OSPF/BGP “Protocol Address," if
a protocol address is configured.
Note It is not possible to have network connectivity between the DLR Control VM and any IP
address that falls into any subnet configured on any of that DLR’s “Internal” interfaces. This is
because the egress interface for these subnets on DLR Control VM points to the pseudo-interface
“VDR," which is not connected to the data plane.
n “Edge Control Level Logging” sets the syslog level on the Edge appliance.
n “Datacenter” selects the vCenter datacenter in which to deploy the Control VM.
n “NSX Edge Appliances” refers to the DLR Control VM and allows definition of exactly one (as shown).
n If “HA” is enabled, the Standby Edge will be deployed on the same cluster, host, and datastore. A
DRS “Separate Virtual Machines” rule will be created for the Active and Standby DLR Control
VMs.
n “HA Interface”
n Is not created as a DLR logical interface capable of routing. It is only a vNIC on the Control VM.
n This interface does not require an IP address, because NSX manages the DLR configuration via
VMCI.
VMware, Inc. 70
NSX Troubleshooting Guide
n This interface is used for HA heartbeat if the DLR "Enable High Availability” is checked on the
"Name and description" screen.
n The DLR provides L3 gateway services to VMs on the “Connected To” dvPortgroup or logical
switch with IP addresses from corresponding subnets.
n “Uplink” type LIFs are created as vNICs on the Control VM, so, up to eight are supported; the last
two available vNICs are allocated to the HA interface and one reserved vNIC.
n An “Uplink” type LIF is required for dynamic routing to work on the DLR.
n And “Internal” type LIFs are created as pseudo-vNICs on the Control VM, and it is possible to
have up to 991 of them.
n Configure Default Gateway, if selected, will create a static default route on the DLR. This option is
available if an “Uplink” type LIF is created in the previous screen.
n If ECMP is used on the uplink, the recommendation is to leave this option disabled, to prevent
dataplane outage in case of next-hop failure.
Note The double right-arrow in the top right corner allows for “suspending” the wizard in progress so that
it can be resumed at a later time.
VMware, Inc. 71
NSX Troubleshooting Guide
For an ESG, “Configure Deployment” allows selection of the Edge size. If an ESG is used only for routing,
“Large” is a typical size that is suitable in most scenarios. Selecting a larger size will not provide more
CPU resources to the ESG’s routing processes, and will not lead to more throughput.
It is also possible to create an ESG without deploying it, which still requires configuration of an Edge
Appliance.
A “Non-deployed” Edge can be later deployed via an API call or with the “Deploy” UI action.
If Edge HA is selected, you must create at least one “Internal” interface, or HA will fail silently, leading to
the “split-brain” scenario.
The NSX UI and API allow an operator to remove the last “Internal” interface, which will cause HA to
silently fail.
These include:
n Syslog configuration
Syslog Configuration
Configure the ESG or DLR Control VM to send log entries to a remote syslog server.
VMware, Inc. 72
NSX Troubleshooting Guide
Notes:
n The syslog server must be configured as an IP address, because the ESG/DLR Control VM does not
get configured with a DNS resolver.
n In the ESG’s case, it is possible to “Enable DNS Service” (DNS proxy) that ESG itself will be able
to use to resolve DNS names, but generally specifying syslog server as an IP address in a more
reliable method with fewer dependencies.
n There is no way to specify a syslog port in the UI (it is always 514), but protocol (UDP/TCP) can be
specified.
n Syslog messages originate from the IP address of the Edge’s interface that is selected as egress for
the syslog server’s IP by the Edge’s forwarding table.
n For the DLR, the syslog server’s IP address cannot be on any subnets configured on any of the
DLR’s “Internal” interfaces. This is because the egress interface for these subnets on the DLR
Control VM points to the pseudo-interface “VDR,” which is not connected to the data plane.
By default, logging for the ESG/DLR routing engine is disabled. If required, enable it via UI by clicking
“Edit” for the “Dynamic Routing Configuration.”
VMware, Inc. 73
NSX Troubleshooting Guide
You must also configure the Router ID, which will typically be the IP address of the Uplink interface.
Static Routes
Static routes must have the next hop set to an IPaddress on a subnet associated with one of DLR’s LIFs
or ESG’s Interfaces. Otherwise, configuration fails.
VMware, Inc. 74
NSX Troubleshooting Guide
“Interface,” if not selected, is set automatically by matching the next hop to one of directly connected
subnets.
Route Redistribution
Adding an entry into the “Route Redistribution table” does not automatically enable redistribution for the
selected “Learner Protocol." This must be done explicitly via “Edit” for “Route Redistribution Status."
The DLR is configured with redistribution of connected routes into OSPF by default, while ESG is not.
VMware, Inc. 75
NSX Troubleshooting Guide
The “Route Redistribution table” is processed in top-to-bottom order, and processing is stopped after the
first match. To exclude some prefixes from redistribution, include more specific entries at the top.
Due to the distributed nature of the the NSX routing subsystem, there are a number of CLIs available,
accessible on various components of NSX. Starting in NSX version 6.2, NSX also has a centralized CLI
that helps reduce the “travel time” required to access and log in to various distributed components. It
provides access to most of the information from a single location: the NSX Manager shell.
VMware, Inc. 76
NSX Troubleshooting Guide
n Find the segment ID (VXLAN VNI) for each logical switch connected to the DLR in question (for
example, 5004..5007).
n On the ESXi hosts where VMs served by this DLR are running, check the state of the VXLAN control
plane for the logical switches connected to this DLR.
n “multicast proxy” and “ARP proxy” are listed; “ARP proxy” will be listed even if you disabled IP
Discovery.
n “Port Count” looks right – there will be at least 1, even if there are no VMs on that host connected to
the logical switch in question. This one port is the vdrPort, a special dvPort connected to the DLR
kernel module on the ESXi host.
VMware, Inc. 77
NSX Troubleshooting Guide
n Run the following command to make sure that the vdrPort is connected to each of the relevant
VXLANs.
~ # esxcli network vswitch dvs vmware vxlan network port list --vds-name=Compute_VDS --vxlan-id=5004
Switch Port ID VDS Port ID VMKNIC ID
-------------- ----------- ---------
50331656 53 0
50331650 vdrPort 0
~ # esxcli network vswitch dvs vmware vxlan network port list --vds-name=Compute_VDS --vxlan-id=5005
Switch Port ID VDS Port ID VMKNIC ID
-------------- ----------- ---------
50331650 vdrPort 0
n In the example above, VXLAN 5004 has one VM and one DLR connection, while VXLAN 5005 only
has a DLR connection.
n Check whether the appropriate VMs have been properly wired to their corresponding VXLANs, for
example web-sv-01a on VXLAN 5004.
~ # esxcfg-vswitch -l
DVS Name Num Ports Used Ports Configured Ports MTU Uplinks
Compute_VDS 1536 10 512 1600 vmnic0
3 Run show logical-router host hostID connection to get the status information.
Connection Information:
-----------------------
VMware, Inc. 78
NSX Troubleshooting Guide
n A DVS enabled with VXLAN will have one vdrPort created, shared by all DLR instances on that ESXi
host.
n “NumLifs” refers to the number that is the sum of LIFs from all DLR instances that exist on this host.
n “VdrVmac” is the vMAC that the DLR uses on all LIFs across all instances. This MAC is the same on
all hosts. It is never seen in any frames that travel the physical network outside of ESXi hosts.
n For each dvUplink of DVS enabled with VXLAN, there is a matching VTEP; except in cases where
LACP / Etherchannel teaming mode is used, when only one VTEP is created irrespective of the
number of dvUplinks.
n Traffic routed by the DLR (SRC MAC = vMAC) when leaving the host will get the SRC MAC
changed to pMAC of a corresponding dvUplink.
n Note that the original VM’s source port or source MAC is used to determine the dvUplink (it is
preserved for each packet in its DVS's metadata).
n When there are multiple VTEPs on the host and one of dvUplinks fails, the VTEP associated with
the failed dvUplink will be moved to one of the remaining dvUplinks, along with all VMs that were
pinned to that VTEP. This is done to avoid flooding control plane changes that would be
associated with moving VMs to a different VTEP.
n The number in “()” next to each “dvUplinkX” is the dvPort number. It is useful for packet capture on
the individual uplink.
n The MAC address shown for each “dvUplinkX” is a “pMAC” associated with that dvUplink. This MAC
address is used for traffic sourced from the DLR, such as ARP queries generated by the DLR and
any packets that have been routed by the DLR when these packets leave the ESXi host. This MAC
address can be seen on the physical network (directly, if DLR LIF is VLAN type, or inside VXLAN
packets for VXLAN LIFs).
n Pkt Dropped / Replaced / Skipped refer to counters related to internal implementation details of the
DLR, and are not typically used for troubleshooting or monitoring.
2 Check the routing table and determine the IP address of the next hop.
5 Build an L2 frame.
VMware, Inc. 79
NSX Troubleshooting Guide
n A routing table
n An ARP table
Let’s take a sample routed topology and create a set of logical switches and a DLR to create it in NSX.
192.168.10.1
Link type: internal
Transit
logical
switch
192.168.10.2
Link type: uplink
Protocol address:
192.168.10.3
Logical
Router
172.16.20.1 172.16.10.1
Link type: internal Link type: internal
App Web
logical logical
switch switch
172.16.20.10 172.16.10.10
App Web
VM VM
VMware, Inc. 80
NSX Troubleshooting Guide
n One DLR connected to the 4 logical switches; one logical switch is for the “Uplink," while the rest are
Internal
n An external gateway, which could be an ESG, serving as an upstream gateway for the DLR.
The “Ready to complete” wizard screen shows for the DLR above.
After the deployment of the DLR finishes, ESXi CLI commands can be used to view and validate the
distributed state of the DLR in question on the participating hosts.
1 From the NSX Manager shell, run show cluster all to get the cluster ID.
3 Run show logical-router host hostID dlr all verbose to get the status information.
VMware, Inc. 81
NSX Troubleshooting Guide
n This command displays all DLR instances that exist on the given ESXi host.
n “Vdr Name” consists of “Tenant” “+ “Edge Id." In the example, “Tenant” was not specified, so the word
“default” is used. The “Edge Id” is “edge-1," which can be seen in the NSX UI.
n In cases where there are many DLR instances on a host, a method for finding the right instance is
to look for the “Edge ID” displayed in the UI “NSX Edges."
n “Number of Lifs” refers to the LIFs that exist on this individual DLR instance.
n “Number of Routes” is in this case 5, which consists of 4 x directly connected routes (one for each
LIF), and a default route.
n “State,” “Controller IP,” and “Control Plane Active” refer to the state of the DLR’s control plane and
must list the correct Controller IP, with Control Plane Active: Yes. Remember, the DLR function
requires working Controllers; the output above shows what is expected for a healthy DLR instance.
n “Control Plane IP” refers to the IP address that the ESXi host uses to talk to the Controller. This IP is
always the one associated with the ESXi host’s Management vmknic, which in most cases is vmk0.
n “Edge Active” shows whether or not this host is the one where the Control VM for this DLR instance is
running and in Active state.
n The placement of the Active DLR Control VM determines which ESXi host is used to perform
NSX L2 bridging, if it is enabled.
n There is also a “brief” version of the above command that produces a compressed output useful for a
quick overview. Note that “Vdr Id” is displayed in hexadecimal format here:
State Legend: [A: Active], [D: Deleting], [X: Deleted], [I: Init]
State Legend: [SF-R: Soft Flush Route], [SF-L: Soft Flush LIF]
VMware, Inc. 82
NSX Troubleshooting Guide
The “Soft Flush” states refer to short-lived transient states of the LIF lifecycle and is not normally seen in
a healthy DLR.
1 From the NSX Manager shell, run show cluster all to get the cluster ID.
3 Run show logical-router host hostID dlr all brief to get the dlrID (Vdr Name).
4 Run show logical-router host hostID dlr dlrID interface all brief to get summarized
status information for all interfaces.
5 Run show logical-router host hostID dlr dlrID interface (all | intName) verbose to
get the status information for all interfaces or for a specific interface.
nsxmgr# show logical-router host hostID dlr dlrID interface all verbose
Name: 570d45550000000a
Mode: Routing, Distributed, Internal
Id: Vxlan:5000
Ip(Mask): 172.16.10.1(255.255.255.0)
Connected Dvs: Compute_VDS
VXLAN Control Plane: Enabled
VXLAN Multicast IP: 0.0.0.1
State: Enabled
Flags: 0x2388
DHCP Relay: Not enabled
Name: 570d45550000000c
Mode: Routing, Distributed, Internal
Id: Vxlan:5002
Ip(Mask): 172.16.30.1(255.255.255.0)
Connected Dvs: Compute_VDS
VXLAN Control Plane: Enabled
VXLAN Multicast IP: 0.0.0.1
State: Enabled
Flags: 0x2288
DHCP Relay: Not enabled
Name: 570d45550000000b
Mode: Routing, Distributed, Internal
Id: Vxlan:5001
VMware, Inc. 83
NSX Troubleshooting Guide
Ip(Mask): 172.16.20.1(255.255.255.0)
Connected Dvs: Compute_VDS
VXLAN Control Plane: Enabled
VXLAN Multicast IP: 0.0.0.1
State: Enabled
Flags: 0x2388
DHCP Relay: Not enabled
Name: 570d455500000002
Mode: Routing, Distributed, Uplink
Id: Vxlan:5003
Ip(Mask): 192.168.10.2(255.255.255.248)
Connected Dvs: Compute_VDS
VXLAN Control Plane: Enabled
VXLAN Multicast IP: 0.0.0.1
State: Enabled
Flags: 0x2208
DHCP Relay: Not enabled
n LIF's “Mode” shows whether the LIF is routing or bridging, and whether it is internal or uplink.
n “Id” shows the LIF type and the corresponding service ID (VXLAN and VNI, or VLAN and VID).
n If a LIF is connected to a VXLAN in hybrid or unicast mode, “VXLAN Control Plane” is “Enabled.”
n For VXLAN LIFs where VXLAN is in unicast mode, “VXLAN Multicast IP” is shown as “0.0.0.1”;
otherwise the actual multicast IP address is displayed.
n “State” should be “Enabled” for routed LIFs. For bridging LIFs, it is “Enabled” on the host that is
performing bridging and “Init” on all other hosts.
n “Flags” is a summary representation of the LIF’s state and shows whether the LIF is:
n Routed or Bridged
n Of note is the flag 0x0100, which is set when a VXLAN VNI join was caused by the DLR (as
opposed to a host having a VM on that VXLAN)
nsxmgr# show logical-router host hostID dlr dlrID interface all brief
VMware, Inc. 84
NSX Troubleshooting Guide
DLR’s Routes
After you have established that a DLR is present and healthy and it has all the LIFs, the next thing to
check is the routing table.
1 From the NSX Manager shell, run show cluster all to get the cluster ID.
3 Run show logical-router host hostID dlr all brief to get the dlrID (Vdr Name).
4 Run show logical-router host hostID dlr dlrID route to get the status information for all
interfaces.
Points to note:
n “Interface” shows the egress LIF that will be selected for the corresponding “Destination." It is set to
the “Lif Name” of one of the DLR’s LIFs.
n For ECMP routes, there will be more than one route with the same Destination, GenMask, and
Interface, but a different Gateway. Flags will also include “E” to reflect the ECMP nature of these
routes.
Controllers play no role in this process and are not used to distribute resulting ARP entries to other hosts.
VMware, Inc. 85
NSX Troubleshooting Guide
Inactive cached entries are kept for 600 seconds, then removed. For more information about the DLR
ARP resolution process, see DLR ARP Resolution Process.
1 From the NSX Manager shell, run show cluster all to get the cluster ID.
3 Run show logical-router host hostID dlr all brief to get the dlrID (Vdr Name).
4 Run show logical-router host hostID dlr dlrID arp to get the status information for all
interfaces.
Things to note:
n All ARP entries for the DLR’s own LIFs (“I” Flag) are the same and show the same vMAC that was
discussed in VXLAN Preparation Check.
n ARP entries with the “L” Flag correspond to the VMs running on the host where the CLI command is
run.
n “SrcPort” shows the dvPort ID where the ARP entry was originated. In cases where an ARP entry
was originated on another host, the dvUplink’s dvPort ID is shown. This dvPort ID can be cross-
referenced with the dvUplink dvPort ID discussed in VXLAN Preparation Check.
n The “Nascent” flag is not normally observed. It is set while the DLR is waiting for the ARP reply to
arrive. Any entries with that flag set might indicate that there is a problem with ARP resolution.
VMware, Inc. 86
NSX Troubleshooting Guide
DVS DVS
192.168.10.1/29
VM 1 VM 2 VM 3 :EE:CE
.11:7A:A2 .11:84:52 .11:BA:09 External
Gateway VM
n Each host has an “L2 Switch” (DVS), and a “Router on a stick” (DLR kernel module), connected to
that “switch” via a “trunk” interface (vdrPort).
n Note that this “trunk” can carry both VLANs and VXLANs; however, there are no 801.Q or
UDP/VXLAN headers present in the packets that traverse the vdrPort. Instead, the DVS uses an
internal metadata tagging method to communicate that information to the DLR kernel module.
n When the DVS sees a frame with Destination MAC = vMAC, it knows that it is for the DLR, and
forwards that frame to the vdrPort.
n After packets arrive in the DLR kernel module via the vdrPort, their metadata is examined to
determine the VXLAN VNI or VLAN ID that they belong to. This information is then used to determine
which LIF of which DLR instance that packet belongs to.
n The side effect of this system is that no more than one DLR instance can be connected to a given
VLAN or VXLAN.
In cases where more than one DLR instance exists, the diagram above would look like this:
VMware, Inc. 87
NSX Troubleshooting Guide
VXLAN B VXLAN B
VXLAN C VXLAN C
VLAN D VLAN D
DVS DVS
VM 1 VM 2 VM 4 VM 3 VM 5
IP 1 IP 2 IP 4 IP 3 IP 5
MAC 1 MAC 2 MAC 4 MAC 3 MAC 5
This would correspond to a network topology with two independent routing domains, operating in
complete separation from each other, potentially with overlapping IP addresses.
Figure 3‑10. Network Topology Corresponding with Two Hosts and Two DLR Instances
Subnet A Subnet C
IF A IF C
RTR A RTR B
IF B IF D
Subnet B Subnet D
SVR 2 SVR 5
IP 2 IP 5
MAC 2 MAC 5
VMware, Inc. 88
NSX Troubleshooting Guide
n MAC address to insert into egress frames to reach the next hops (ARP table)
This information is delivered to the instances distributed across multiple ESXi hosts.
When a UI wizard is submitted with the “Finish” button or an API call is made to deploy a new DLR, the
system processes through the following steps:
1 NSX Manager receives an API call to deploy a new DLR (directly or from vSphere Web Client,
invoked by the UI wizard).
2 NSX Manager calls its linked vCenter Server to deploy a DLR Control VM (or a pair, if HA was
requested).
a DLR Control VM is powered on and connects back to the NSX Manager, ready to receive
configuration.
b If an HA pair was deployed, NSX Manager configures an anti-affinity rule that will keep the HA
pair running on different hosts. DRS then takes action to move them apart.
a NSX Manager looks up the logical switches that are to be connected to the new DLR to
determine which transport zone they belong to.
VMware, Inc. 89
NSX Troubleshooting Guide
b It then looks up a list of clusters that are configured in this transport zone and creates the new
DLR on each host in these clusters.
c At this point, hosts only know the new DLR ID, but they do not have any corresponding
information (LIFs or routes).
a Controller Cluster allocates one of the Controller nodes to be the master for this DLR instance.
5 NSX Manager sends the configuration, including LIFs, to the DLR Control VM.
a ESXi hosts (including the one where the DLR Control VM is running) receive slicing information
from the Controller Cluster, determine which Controller node is responsible for the new DLR
instance, and connect to the Controller node (if there was no existing connection).
6 After LIF creation on DLR Control VM, the NSX Manager creates the new DLR’s LIFs on the
Controller Cluster.
7 DLR Control VM connects to the new DLR instance’s Controller node, and sends the Controller node
the routes:
a First the DLR translates its routing table into the forwarding table (by resolving prefixes to LIFs).
b Then The DLR sends the resulting table to the Controller node.
8 Controller node pushes LIFs and routes to the other hosts where the new DLR instance exists, via the
connection established in step 5.a.
VMware, Inc. 90
NSX Troubleshooting Guide
External
*LIFs *LIFs *LIFs Router
*Routes *Routes *Default Route
*DLRs *LIFs
*Per-DLR Config *Routes
1 The NSX Manager receives an API call to change the existing DLR’s configuration, in this case – add
dynamic routing.
2 The NSX Manager sends the new configuration to the DLR Control VM.
3 The DLR Control VM applies the configuration and goes through the process of establishing routing
adjacencies, exchanging routing information, and so on.
4 After the routing exchange, the DLR Control VM calculates the forwarding table and sends it to the
DLR’s master Controller node.
5 The DLR’s master Controller node then distributes the updated routes to the ESXi hosts where the
DLR instance exists.
Note that the DLR instance on the ESXi host where the DLR Control VM is running receives its LIFs and
routes only from the DLR’s master Controller node, never directly from the DLR Control VM or the NSX
Manager.
The figure shows the components and the corresponding communication channels between them.
VMware, Inc. 91
NSX Troubleshooting Guide
NSX Manager
REST/SSL
RMQ/SSL
TCP/SSL
Socket
VMCI
DLR Instance A
Master Controller VMKLINK
UDP/6999
vsfwd vsfwd
n NSX Manager:
n Has a direct permanent connection with the message bus client (vsfwd) process running on each
host prepared for NSX
n For each DLR instance, one Controller node (out of the available 3) is elected as master
n The master function can move to a different Controller node, if the original Controller node fails
n Each ESXi host runs two User World Agents (UWA): message bus client (vsfwd) and control plane
agent (netcpa)
n netcpa requires information from the NSX Manager to function (for example, where to find
Controllers and how to authenticate to them); this information is accessed via the message bus
connection provided by vsfwd
n netcpa also communicates with the DLR kernel module to program it with the relevant information
it receives from Controllers
n For each DLR instance, there is a DLR Control VM, which is running on one of the ESXi hosts; the
DLR Control VM has two communication channels:
n VMCI channel to the NSX Manager via vsfwd, which is used for configuring the Control VM
n VMCI channel to the DLR master Controller via netcpa, which is used to send the DLR’s routing
table to the Controller
VMware, Inc. 92
NSX Troubleshooting Guide
n In cases where the DLR has a VLAN LIF, one of the participating ESXi hosts is nominated by the
Controller as a designated instance (DI). The DLR kernel module on other ESXi hosts requests that
the DI perform proxy ARP queries on the associated VLAN.
n NSX Manager
n Cluster of Controllers
n ESGs
NSX Manager
NSX Manager provides the following functions relevant to NSX routing:
n Acts as a centralized management plane, providing the unified API access point for all NSX
management operations
n Installs the Distributed Routing Kernel Module and User World Agents on hosts to prepare them for
NSX functions
n Configures the Controller Cluster via a REST API and hosts via a message bus:
n Generates and distributes to hosts and controllers the certificates to secure control plane
communications
n Configures ESGs and DLR Control VMs via the message bus
n Note that ESGs can be deployed on unprepared hosts, in which case VIX will be used in lieu of
the message bus
Cluster of Controllers
NSX distributed routing requires Controllers, clustered for scale and availability, which provide the
following functions:
n Master node receives routing information from the DLR Control VM and distributes it to the hosts
VMware, Inc. 93
NSX Troubleshooting Guide
n Selects designated instance for VLAN LIFs and communicates this information to hosts; monitors
DI host via control plane keepalives (timeout is 30 seconds, and detection time can be 20-40
seconds), sends hosts an update if the selected DI host disappears
n Control Plane Agent (netcpa) is a TCP (SSL) client that communicates with the Controller using the
control plane protocol. It might connect to multiple controllers. netcpa communicates with the
Message Bus Client (vsfwd) to retrieve control plane related information from NSX Manager.
n The agent is packaged into the VXLAN VIB (vSphere installation bundle)
n Installed by NSX Manager via EAM (ESX Agency Manager) during host preparation
n Can be restarted remotely via Networking and Security UI Installation -> Host Preparation ->
Installation Status, on individual hosts or on a whole cluster
n Configured by netcpa
n Connects to DVS via the special trunk called “vdrPort," which supports both VLANs and VXLANs
n Message Bus Client (vsfwd) is used by netcpa, ESGs, and DLR Control VMs to communicate with the
NSX Manager
n vsfwd obtains NSX Manager’s IP address from /UserVars/RmqIpAddress set by vCenter via
vpxa/hosd and logs into the Message Bus server using per-host credentials stored in
other /UserVars/Rmq* variables
n Obtain host’s control plane SSL private key and certificate from NSX Manager. These are then
stored in /etc/vmware/ssl/rui-for-netcpa.*
VMware, Inc. 94
NSX Troubleshooting Guide
n Get IP addresses and SSL thumbprints of Controllers from NSX Manager. These are then stored
in /etc/vmware/netcpa/config-by-vsm.xml.
n Create and delete DLR instances on its host on instruction from NSX Manager
n Can be started / stopped / queried via its startup script /etc/init.d/ vShield-Stateful-Firewall
n ESGs and DLR Control VMs use VMCI channel to vsfwd to receive configuration from NSX Manager
n Can run one of two available dynamic routing protocol (BGP or OSPF) and/or use static routes
n Computes forwarding table from directly connected (LIF) subnets, static, and dynamic routes, and
sends it via its VMCI link to netcpa to the DLR instance’s master Controller
VMware, Inc. 95
NSX Troubleshooting Guide
n The “instance” sub-command of the “show control-cluster logical-routers” command displays list of
hosts that are connected to this Controller for this DLR Instance. In a correctly functioning
environment, this list would include all hosts from all clusters where the DLR exists.
n The “interface-summary” displays the LIFs that the Controller learned from the NSX Manager. This
information is sent to the hosts.
n The “routes” shows the routing table sent to this Controller by this DLR’s Control VM. Note that unlike
on the ESXi hosts, this table does not include any directly connected subnets because this
information is provided by the LIF configuration.
DLR Control VM
DLR Control VM has LIFs and routing/forwarding tables. The major output of DLR Control VM’s lifecycle
is the DLR routing table, which is a product of Interfaces and Routes.
VMware, Inc. 96
NSX Troubleshooting Guide
n The purpose of the Forwarding Table is to show which DLR interface is chosen as the egress for a
given destination subnet.
n The “VDR” interface is displayed for all LIFs of “Internal” type. The “VDR” interface is a pseudo-
interface that does not correspond to a vNIC.
VMware, Inc. 97
NSX Troubleshooting Guide
Notes of interest:
n Interface “VDR” does not have a VM NIC (vNIC) associated with it. It is a single “pseudo-interface”
that is configured with all IP addresses for all DLR’s “Internal” LIFs.
n The output above was taken from a DLR deployed with HA enabled, and the HA interface is
assigned an IP address. This appears as two IP addresses, 169.254.1.1/30 (auto-assigned for
HA), and 10.10.10.1/24, manually assigned to the HA interface.
n On an ESG, the operator can manually assign one of its vNICs as HA, or leave it as default for
the system to choose automatically from available “Internal” interfaces. Having the “Internal” type
is a requirement, or HA will fail.
n Note that the IP address seen on this interface is the same as the DLR’s LIF; however, the DLR
Control VM will not answer ARP queries for the LIF IP address (in this case, 192.168.10.2/29).
There is an ARP filter applied for this vNIC’s MAC address that makes it happen.
n The point above holds true until a dynamic routing protocol is configured on the DLR, when the IP
address will be removed along with the ARP filter and replaced with the “Protocol IP” address
specified during the dynamic routing protocol configuration.
n This vNIC is used by the dynamic routing protocol running on the DLR Control VM to
communicate with the other routers to advertise and learn routes.
n After edge is disconnected and post HA failover, the disconnected edge interface IP address is
removed from the active edge routing information base (RIB)/forwarding information base (FIB). But
the standby edge FIB table or the show ip forwarding command still shows the IP and is not
removed from the FIB table. This is expected behavior.
VMware, Inc. 98
NSX Troubleshooting Guide
NSX Manager
Table 3‑2. NSX Manager Faiure Modes and Effects
Failure Mode Failure Effects
Loss of network connectivity to NSX Manager VM n Total outage of all NSX Manager functions, including CRUD
for NSX routing/bridging
n No configuration data loss
n No data or control-plane outage
Loss of network connectivity between NSX Manager and ESXi n If DLR Control VM or ESG is running on affected hosts,
hosts or RabbitMQ server failure CRUD operations on them fail
n Creation and deletion of DLR instances on affected hosts
fail
n No configuration data loss
n No data or control-plane outage
n Any dynamic routing updates continue to work
Loss of network connectivity between NSX Manager and n Create, update, and delete operations for NSX distributed
Controllers routing and bridging fail
n No configuration data loss
n No data or control-plane outage
NSX Manager VM is destroyed (datastore failure) n Total outage of all NSX Manager functions, including CRUD
for NSX routing/bridging
n Risk of subset of routing/bridging instances becoming
orphaned if NSX Manager restored to an older configuration,
requiring manual clean-up and reconciliation
n No data or control-plane outage, unless reconciliation is
required
Controller Cluster
Table 3‑3. NSX Controller Faiure Modes and Effects
Failure Mode Failure Effects
Controller cluster loses network connectivity with ESXi hosts n Total outage for DLR Control Plane functions (Create,
update, and delete routes, including dynamic)
n Outage for DLR Management Plane functions (Create,
update, and delete LIFs on hosts)
n VXLAN forwarding is affected, which may cause end to end
(L2+L3) forwarding process to also fail
n Data plane continues working based on the last-known state
One or two Controllers lose connectivity with ESXi hosts n If affected Controller can still reach other Controllers in the
cluster, any DLR instances mastered by this Controller
experience the same effects as described above. Other
Controllers do not automatically take over
One Controller loses network connectivity with other Controllers n Two remaining Controllers take over VXLANs and DLRs
(or completely) handled by the isolated Controller
n Affected Controller goes into Read-Only mode, drop its
sessions to hosts, and refuse new ones
VMware, Inc. 99
NSX Troubleshooting Guide
Controllers lose connectivity with each other n All Controllers will go into Read-Only mode, close
connections to hosts, and refuse new ones
n Create, update, and delete operations for all DLRs’ LIFs and
routes (including dynamic) fail
n NSX routing configuration (LIFs) might get out of sync
between the NSX Manager and Controller Cluster, requiring
manual intervention to resync
n Hosts will continue operating on last known control plane
state
Two Controller VMs are lost n Remaining Controller will go into read-only mode; effect is
the same as when Controllers lose connectivity with each
other (above). Likely to require manual cluster recovery
Host Modules
netcpa relies on host SSL key and certificate, plus SSL thumbprints, to establish secure communications
with the Controllers. These are obtained from NSX Manager via the message bus (provided by vsfwd).
If certificate exchange process fails, netcpa will not be able to successfully connect to Controllers.
Note: This section doesn’t cover failure of kernel modules, as the effect of this is severe (PSOD) and is a
rare occurrence.
vsfwd uses username/password authentication to access n If a vsfwd on a freshly prepared ESXi host cannot reach
message bus server, which can expire NSX Manager within two hours, the temporary
login/password supplied during installation expires, and
message bus on this host becomes inoperable
Effects of failure of the Message Bus Client (vsfwd) depend on the timing.
If it fails before other parts of NSX control plane had a chance to n Distributed routing on the host stops functioning, because
reach steady running state the host is not be able to talk to Controllers
n Host do not learn DLR instances from NSX Manager
If it fails after host has reached steady state n ESGs and DLR Control VMs running on the host won’t be
able to receive configuration updates
n Host do not learn of new DLRs, and are not able to delete
existing DLRs
n Host datapath will continue operating based on the
configuration host had at the time of failure
Effects of failure of the Control Plane Agent (netcpa) depend on the timing
If it fails before NSX datapath kernel modules had a chance to n Distributed routing on the host stops functioning
reach steady running state
If it fails after host has reached steady state n DLR Control VM(s) running on the host will not be able to
send their forwarding table updates to Controller(s)
n Distributed routing datapath will not receive any LIF or route
updates from Controller(s), but will continue operating based
on the state it had before the failure
DLR Control VM
Table 3‑6. DLR Control VM Faiure Modes and Effects
Failure mode Failure Effects
DLR Control VM is lost or powered off n Create, update, and delete operations for this DLR’s LIFs
and routes fail
n Any dynamic route updates will not be sent to hosts
(including withdrawal of prefixes received via now broken
adjacencies)
DLR Control VM loses connectivity with the NSX Manager and n Same effects as above, except if DLR Control VM and its
Controllers routing adjacencies are still up, traffic to and from previously
learned prefixes will not be affected
DLR Control VM loses connection with the NSX Manager n NSX Manager’s Create, update, and delete operations for
this DLR’s LIFs and routes fail and are not re-tried
n Dynamic routing updates continue to propagate
DLR Control VM loses connection with the Controllers n Any routing changes (static or dynamic) for this DLR do not
propagate to hosts
If necessary, you can change the log level of NSX components. For more information, see "Setting the
Logging Level of NSX Components" topic in NSX Logging and System Events.
The NSX Manager log contains information related to the management plane, which covers create, read,
update, and delete (CRUD) operations.
Controller Logs
Controllers contain multiple modules, many with their own log files. Controller logs can be accessed using
the show log <log file> [ filtered-by <string> ] command. The log files relevant to routing are
as follows:
Controller logs are verbose and in most cases are only required when the VMware engineering team is
brought in to assist with troubleshooting in more difficult cases.
In addition to the show log CLI, individual log files can be observed in real time as they are being
updated, using the watch log <logfile> [ filtered-by <string> ] command.
The logs are included in the Controller support bundle that can be generated and downloaded by
selecting a Controller node in the NSX UI and clicking the Download tech support logs icon.
The logs can also be collected as part of the VM support bundle generated from vCenter Server. The log
files are accessible only to the users or user groups having the root privilege.
n From the CLI, enter enable mode, then run the export tech-support <[ scp | ftp ]> <URI>
command.
n From the vSphere Web Client, select the Download Tech Support Logs option in the Actions
menu.
n dvUplinks on the DVS enabled with VXLAN (teaming policy, names, UUID)
All of these files are created by control plane agent using information it receives from NSX Manager via
the message bus connection provided by vsfwd.
They are configuration and control-plane issues. Management plane issues, while possible, are not
common.
Protocol and forwarding IP addresses are reversed for dynamic Dynamic protocol adjacency won’t come up
routing
Transport zone is not aligned to the DVS boundary Distributed routing does not work on a subset of ESXi hosts
(those missing from the transport zone)
Dynamic routing protocol configuration mismatch (timers, MTU, Dynamic protocol adjacency does not come up
BGP ASN, passwords, interface to OSPF area mapping)
DLR HA interface is assigned an IP address and redistribution of DLR Control VM might attract traffic for the HA interface subnet
connected routes is enabled and blackhole the traffic
When necessary, use the debug ip ospf or debug ip bgp CLI commands and observe logs on the
DLR Control VM or on the ESG console (not via SSH session) to detect protocol configuration issues.
n Host Control Plane Agent (netcpa) being unable to connect to NSX Manager through the message
bus channel provided by vsfwd
n Controller cluster having issues with handling the master role for DLR/VXLAN instances
Controller cluster issues related to handling of master roles can often be resolved by restarting one of the
NSX Controllers (restart controller on the Controller’s CLI).
NSX Manager
Starting in NSX 6.2, commands that were formerly run from the NSX Controller and other NSX
components to troubleshoot NSX routing are now run directly from the NSX Manager.
n Interfaces
4 Run debug copy [ scp | ftp ] ... to download captures for offline analysis.
The debug packet command uses tcpdump in the background and can accept filtering modifiers
formatted in like tcpdump filtering modifiers on UNIX. The only consideration is that any white spaces in
the filter expression need to be replaced with underscores ("_").
For example, the following command displays all traffic through vNic_0 except SSH, to avoid looking at
the traffic belonging to the interactive session itself.
ESXi Hosts
Hosts are closely connected to NSX Routing. Figure 3‑14 shows visually the components participating in
the routing subsystem and the NSX Manager CLI commands used to display information about them:
show logical‐switch
host hostID
verbose
Packets captured in the datapath can assist with identifying problems at various stages of packet
forwarding. Figure 3‑15 covers the major capture points and respective CLI command to use.
ESXi Host A
DLR Instance A
LIF A LIF B
DLR Traffic:
Leaving DLR: pktcap-uw --switchport 50331566 --dir=0
vdrPort VXLAN A Entering DLR: pktcap-uw --switchport 50331566 --dir=1
VXLAN B
dvPort DVS(Compute_VDS) dvUplink
Unencapsulated uplink traffic:
DFW (2) DVS->Uplink: pktcap-uw --uplink vmnic0 --dir=1 --stage=0
SwSec (1) Uplink->DVS: pktcap-uw --uplink vmnic0 --dir=0 --stage=1
DLR
dvFilters Traffic encapsulated in VXLAN:
VXLAN Leaving host: pktcap-uw --uplink vmnic0 --dir=1 --stage=1
IOChain Entering host: pktcap-uw --uplink vmnic0 --dir=0 --stage=0
pNIC
VM Traffic:
Leaving VM: pktcap-uw --switchport 50331660 --dir=0
Entering VM: pktcap-uw --switchport 50331660 --dir=1
To troubleshoot issues with an NSX Edge appliance, validate that each troubleshooting step below is true
for your environment. Each step provides instructions or a link to a document, to eliminate possible
causes and take corrective action as necessary. The steps are ordered in the most appropriate sequence
to isolate the issue and identify the proper resolution. Do not skip a step.
Check the release notes for current releases to see if the problem is resolved.
Ensure that the minimum system requirements are met when installing VMware NSX Edge. See the NSX
Installation Guide.
n If the upgrade or redeploy succeeds but there is no connectivity for the Edge interface, verify
connectivity on the back-end Layer 2 switch. See https://kb.vmware.com/kb/2135285.
OR
n If the deployment or upgrade succeeds, but there is no connectivity on the Edge interfaces:
n Running the show interface command as well as Edge Support logs displays entries similar to:
In both cases, the host switch is not ready or has some issues. To resolve, investigate the host
switch.
Configuration Issues
n Collect the NSX Edge diagnostic information. See https://kb.vmware.com/kb/2079380.
Filter the NSX Edge logs by searching for the string vse_die. The logs near this string might provide
information about the configuration error.
n https://kb.vmware.com/kb/1008205
n https://kb.vmware.com/kb/1008014
n https://kb.vmware.com/kb/1010071
n https://kb.vmware.com/kb/2096171
A high value for the ksoftirqd process indicates a high incoming packet rate. Check whether logging is
enabled on the data path, such as for firewall rules. Run the show log follow command to determine
whether a large number of log hits are being recorded.
n Interface
n Driver
n L2
n L3
n Firewall
To run the command, log in to the NSX Edge CLI and enter basic mode. For more information, see the
NSX Command Line Interface Reference. For example:
Driver Errors
=============
TX TX TX RX RX RX
Interface Dropped Error Ring Full Dropped Error Out Of Buf
vNic_0 0 0 0 0 0 0
vNic_1 0 0 0 0 0 0
vNic_2 0 0 0 0 0 2
vNic_3 0 0 0 0 0 0
vNic_4 0 0 0 0 0 0
vNic_5 0 0 0 0 0 0
Interface Drops
===============
Interface RX Dropped TX Dropped
vNic_0 4 0
vNic_1 2710 0
vNic_2 0 0
vNic_3 2 0
vNic_4 2 0
vNic_5 2 0
L2 RX Errors
============
Interface length crc frame fifo missed
vNic_0 0 0 0 0 0
vNic_1 0 0 0 0 0
vNic_2 0 0 0 0 0
vNic_3 0 0 0 0 0
vNic_4 0 0 0 0 0
vNic_5 0 0 0 0 0
L2 TX Errors
============
Interface aborted fifo window heartbeat
vNic_0 0 0 0 0
vNic_1 0 0 0 0
vNic_2 0 0 0 0
vNic_3 0 0 0 0
vNic_4 0 0 0 0
vNic_5 0 0 0 0
L3 Errors
=========
IP:
ReasmFails : 0
InHdrErrors : 0
InDiscards : 0
FragFails : 0
InAddrErrors : 0
OutDiscards : 0
OutNoRoutes : 0
ReasmTimeout : 0
ICMP:
InTimeExcds : 0
InErrors : 227
OutTimeExcds : 0
OutDestUnreachs : 152
OutParmProbs : 0
InSrcQuenchs : 0
InRedirects : 0
OutSrcQuenchs : 0
InDestUnreachs : 151
OutErrors : 0
InParmProbs : 0
Ipv4 Rules
==========
Chain - INPUT
rid pkts bytes target prot opt in out source destination
0 119 30517 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 state INVALID
0 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0
Chain - POSTROUTING
rid pkts bytes target prot opt in out source destination
0 101 4040 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 state INVALID
0 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0
Ipv6 Rules
==========
Chain - INPUT
rid pkts bytes target prot opt in out source destination
0 0 0 DROP all * * ::/0 ::/0 state INVALID
0 0 0 DROP all * * ::/0 ::/0
Chain - POSTROUTING
rid pkts bytes target prot opt in out source destination
0 0 0 DROP all * * ::/0 ::/0 state INVALID
0 0 0 DROP all * * ::/0 ::/0
n NSX Edge is a virtual machine (VM) and consists of several files that are stored on a storage device.
The key files are the configuration file, virtual disk file(s), NVRAM setting file, swap file, and log file.
Based upon the VM Storage Profile applied or manual placement, the virtual machine configuration
files, virtual disk file, swap file can be placed in the same location, or in separate locations on different
datastores. In the case where the virtual machine files are present in different locations,
NSX Manager shows and uses the datastore which has the VMX file for the VM deployment. During
redeployment or upgrade operations, NSX Manager deploys the NSX Edge VM(s) on the configured
datastore or the live datastore which hosts the VMX files. The datastore name and the datastore ID
(which hosts VMX file of the VM) are returned as part of the Appliance parameter, and is displayed
on the UI or provided as response to REST API. You must refer to vCenter Server for details on the
exact layout each of the NSX Manager VM files and one or more datastores where the files are
placed. For more information, refer to the following documentation:
To run the command, log in to the NSX Edge CLI and enter basic mode. For more information, see the
NSX Command Line Interface Reference. For example:
Ipv4 Rules
==========
Chain - INPUT
rid pkts bytes target prot opt in out source destination
0 119 30517 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 state INVALID
0 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0
Chain - POSTROUTING
rid pkts bytes target prot opt in out source destination
0 101 4040 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 state INVALID
0 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0
Ipv6 Rules
==========
Chain - INPUT
rid pkts bytes target prot opt in out source destination
0 0 0 DROP all * * ::/0 ::/0 state INVALID
0 0 0 DROP all * * ::/0 ::/0
Chain - POSTROUTING
rid pkts bytes target prot opt in out source destination
0 0 0 DROP all * * ::/0 ::/0 state INVALID
0 0 0 DROP all * * ::/0 ::/0
1 Check the firewall rules table with the show firewall command. The usr_rules table displays the
configured rules.
Check for an incrementing value of a DROP invalid rule in the POST_ROUTING section of the show
firewall command. Typical reasons include:
n TCP-based applications that have been inactive for more than one hour. If there are inactivity
time-out issues and applications are idle for an unusually long time, increase inactivity-timeout
settings using the REST API. See https://kb.vmware.com/kb/2101275
Name: 0_131074-os-v6-1
Type: bitmap:if (Interface Match)
Revision: 3
Header: range 0-64000
Size in memory: 8116
References: 1
Number of entries: 1
Members:
vse (vShield Edge Device)
Name: 1_131075-ov-v4-1
Type: hash:oservice (Match un-translated Ports)
Revision: 2
Header: family inet hashsize 64 maxelem 65536
Size in memory: 704
References: 1
Number of entries: 2
Members:
Name: 1_131075-ov-v6-1
Type: hash:oservice (Match un-translated Ports)
Revision: 2
Header: family inet hashsize 64 maxelem 65536
Size in memory: 704
References: 1
Number of entries: 2
Members:
Proto=89, DestPort=Any, SrcPort=Any (encoded: 0.89.0.0/16,0.89.0.0/16)
Proto=6, DestPort=179, SrcPort=Any (encoded: 0.6.0.179,0.6.0.0/16)
3 Enable logging on a particular firewall rule using the REST API or the Edge user interface, and
monitor the logs with the show log follow command.
If logs are not seen, enable logging on the DROP Invalid rule using the following REST API.
URL : https://NSX_Manager_IP/api/4.0/edges/{edgeId}/firewall/config/global
PUT Method
Input representation
<globalConfig> <!-- Optional -->
<tcpPickOngoingConnections>false</tcpPickOngoingConnections> <!-- Optional. Defaults to false -->
<tcpAllowOutOfWindowPackets>false</tcpAllowOutOfWindowPackets> <!-- Optional. Defaults to false
-->
<tcpSendResetForClosedVsePorts>true</tcpSendResetForClosedVsePorts> <!-- Optional. Defaults to
true -->
<dropInvalidTraffic>true</dropInvalidTraffic> <!-- Optional. Defaults to true -->
<logInvalidTraffic>true</logInvalidTraffic> <!-- Optional. Defaults to false -->
<tcpTimeoutOpen>30</tcpTimeoutOpen> <!-- Optional. Defaults to 30 -->
<tcpTimeoutEstablished>3600</tcpTimeoutEstablished> <!-- Optional. Defaults to 3600 -->
<tcpTimeoutClose>30</tcpTimeoutClose> <!-- Optional. Defaults to 30 -->
<udpTimeout>60</udpTimeout> <!-- Optional. Defaults to 60 -->
<icmpTimeout>10</icmpTimeout> <!-- Optional. Defaults to 10 -->
<icmp6Timeout>10</icmp6Timeout> <!-- Optional. Defaults to 10 -->
<ipGenericTimeout>120</ipGenericTimeout> <!-- Optional. Defaults to 120 -->
</globalConfig>
Output representation
No payload
Use the show log follow command to look for logs similar to:
2016-04-18T20:53:31+00:00 edge-0 kernel: nf_ct_tcp: invalid TCP flag combination IN= OUT=
SRC=172.16.1.4 DST=192.168.1.4 LEN=40 TOS=0x00 PREC=0x00 TTL=64 ID=43343 PROTO=TCP
SPT=5050 DPT=80 SEQ=0 ACK=1572141176 WINDOW=512 RES=0x00 URG PSH FIN URGP=0
2016-04-18T20:53:31+00:00 edge-0 kernel: INVALID IN= OUT=vNic_1 SRC=172.16.1.4
DST=192.168.1.4 LEN=40 TOS=0x00 PREC=0x00 TTL=63 ID=43343 PROTO=TCP SPT=5050 DPT=80
WINDOW=512 RES=0x00 URG PSH FIN URGP=0
4 Check for matching connections in the Edge firewall state table with the show flowtable rule_id
command:
Compare the active connection count and the maximum allowed count with the show flowstats
command:
5 Check the Edge logs with the show log follow command, and look for any ALG drops. Search for
strings similar to tftp_alg, msrpc_alg, or oracle_tns. For additional information, see:
n https://kb.vmware.com/kb/2126674
n https://kb.vmware.com/kb/2137751
2 Capture traffic simultaneously on both interfaces, write the output to a file, and export it using SCP.
For example:
For simultaneous packet capture, use the ESXi packet capture utility pktcap-uw tool in ESXi. See
https://kb.vmware.com/kb/2051814.
If the packet drops are consistent, check for configuration errors related to:
n Asymmetric routing
n RP filter checks
b If there are missing routes at the data plane, run these commands:
n show ip route
c Check the routing table for needed routes by running the show ip forwarding command.
n top
VIX
n VIX is used for NSX Edge if the ESXi host is not prepared.
n The NSX Manager gets host credentials from the vCenter Server to connect to the ESXi host first.
n The NSX Manager uses the Edge credentials to log in to the Edge appliance.
VIX Debugging
Check for VIX errors VIX_E_<error> in the NSX Manager logs to narrow down the cause. Look for errors
similar to:
In general, if the same failure occurs for many Edges at the same time, the issue is not on the Edge side.
When you encounter issues, the NSX Manager logs might contain entries similar to:
GMT ERROR taskScheduler-6 PublishTask:963 - Failed to configure VSE-vm index 0, vm-id vm-117,
edge edge-5. Error: RPC request timed out
vmci_tx_err : 0
vmci_closed_by_peer: 8
vmci_tx_no_socket : 0
app_rx : 3648
app_tx : 3649
app_rx_err : 0
app_tx_err : 0
app_conn_req : 1
app_closed_by_peer : 0
app_tx_no_socket : 0
-----------------------
Forwarder Event Channel
vmci_conn : up
app_client_conn : up
vmci_rx : 1143
vmci_tx : 13924
vmci_rx_err : 0
vmci_tx_err : 0
vmci_closed_by_peer: 0
vmci_tx_no_socket : 0
app_rx : 13924
app_tx : 1143
app_rx_err : 0
app_tx_err : 0
app_conn_req : 1
app_closed_by_peer : 0
app_tx_no_socket : 0
-----------------------
cli_rx : 1
cli_tx : 1
cli_tx_err : 0
counters_reset : 0
In the example, the output vmci_closed_by_peer: 8 indicates the number of times the connection
has been closed by the host agent. If this number is increasing and vmci conn is down, the host
agent cannot connect to the RMQ broker. In show log follow, look for repeated errors in the Edge
logs: VmciProxy: [daemon.debug] VMCI Socket is closed by peer
n To check if the ESXi host connects to the RMQ broker, run this command:
Edge Diagnosis
n Check if vmtoolsd is running with this command:
Use the show eventmgr command to verify that the query command is received and processed.
status_ver : 1
status_sys : 5
status_cmd : 0
status_svr_err : 0
status_evt_err : 0
status_sys_err : 0
status_ha_err : 0
status_ver_err : 0
status_cmd_err : 0
evt_report : 1
evt_report_err : 0
hc_report : 10962
hc_report_err : 0
cli_rx : 2
cli_resp : 1
cli_resp_err : 0
counter_reset : 0
---------- Health Status -------------
system status : good
ha state : active
cfg version : 7
generation : 0
server status : 1
syslog-ng : 1
haproxy : 0
ipsec : 0
sslvpn : 0
l2vpn : 0
dns : 0
dhcp : 0
heartbeat : 0
monitor : 0
gslb : 0
---------- System Events -------------
Edge Recovery
If the vmtoolsd is not running or the NSX Edge is in a bad state, reboot the edge.
To recover from a crash, a reboot should be sufficient. A redeploy should not be required.
Note Note down all logging information from the old edge when a redeploy is done.
n Either the vmss (VM suspend) or vmsn (VM snapshot) file for the edge VM while it is still in the
crashed state. If there is a vmem file, this is also needed. This can be use to extract a kernel core
dump file, which VMware Support can analyze.
n The Edge support log, generated right after the crashed edge has been rebooted (but not
redeployed). You can also check the edge logs. See https://kb.vmware.com/kb/2079380.
n A screen shot of the Edge console is also helpful, although this does not usually contain the complete
crash report.
n Identity Firewall
NSX Distributed Firewall is a hypervisor kernel-embedded firewall that provides visibility and control for
virtualized workloads and networks. You can create access control policies based on VMware vCenter
objects like datacenters and clusters, virtual machine names and tags, network constructs such as
IP/VLAN/VXLAN addresses, as well as user group identity from Active Directory. Consistent access
control policy is now enforced when a virtual machine gets vMotioned across physical hosts without the
need to rewrite firewall rules. Since Distributed Firewall is hypervisor-embedded, it delivers close to line
rate throughput to enable higher workload consolidation on physical servers. The distributed nature of the
firewall provides a scale-out architecture that automatically extends firewall capacity when additional
hosts are added to a datacenter.
The NSX Manager web application and NSX components on ESXi hosts communicate with each other
through a RabbitMQ broker process that runs on the same virtual machine as the NSX Manager web
application. The communication protocol that is used is AMQP (Advanced Message Queueing Protocol)
and the channel is secured using SSL. On an ESXi host, the VSFWD (vShield Firewall Daemon) process
establishes and maintains the SSL connection to the broker and sends and receives messages on behalf
of other components, which talks to it through IPC.
NSX Controller
Manager Cluster
Controller Connections
socket
vsfwd
Core VXLAN Routing
User
Kernel vmlink
ESXi Host
1 Log in to the NSX Manager central CLI using the admin credentials.
b Run the show cluster <clusterID> command to show hosts in a specific cluster.
d Run the show vm <vmID> command to show information for a VM, which includes filter names
and vNIC IDs:
e Note the vNIC ID and run further commands like show dfw vnic <vnicID> and show dfw
host <hostID> filter <filter ID> rules:
ruleset domain-c33_L2 {
# Filter rules
rule 1004 at 1 inout ethertype any from any to any accept;
}
n List of filters
n List of containers
n SpoofGuard details
This command also removes any temporary files on the NSX Manager.
1 Log in to the NSX Manager central CLI using the admin credentials.
Problem
Cause
Validate that each troubleshooting step below is true for your environment. Each step provides
instructions or a link to a document to eliminate possible causes and take corrective action as necessary.
The steps are ordered in the most appropriate sequence to isolate the issue and identify the proper
resolution. After each step, re-attempt to update/publish the Distributed Firewall rules.
Solution
1 Verify that the NSX VIBs are successfully installed on each of the ESXi hosts in the cluster. To do this,
on each of the ESXi host that is on the cluster, run these commands.
Starting in NSX 6.3.3 with ESXi 6.0 or later, the esx-vxlan and esx-vsip VIBs are replaced with esx-
nsxv.
For example:
# /etc/init.d/vShield-Stateful-Firewall status
vShield-Stateful-Firewall is running
3 Verify that the Message Bus is communicating properly with the NSX Manager.
The process is automatically launched by the watchdog script and restarts the process if it terminates
for an unknown reason. Run this command on each of the ESXi hosts on the cluster.
For example:
# ps | grep vsfwd
There should be at least 12 or more vsfwd processes running in the command output. If there are
less (most likely only 2) processes running, vsfwd is not running correctly.
4 Verify that port 5671 is opened for communication in the firewall configuration.
This command shows the VSFWD connectivity to the RabbitMQ broker. Run this command on ESXi
hosts to see a list of connections from the vsfwd process on the ESXi host to the NSX Manager.
Ensure that the port 5671 is open for communication in any of the external firewall on the
environment. Also, there should be at least two connections on port 5671. There can be more
connections on port 5671 as there are NSX Edge virtual machines deployed on the ESXi host which
also establish connections to the RMQ broker.
For example:
# esxcfg-advcfg -g /UserVars/RmqIpAddress
6 If you are using a host-profile for this ESXi host, verify that the RabbitMQ configuration is not set in
the host profile.
See:
n https://kb.vmware.com/kb/2092871
n https://kb.vmware.com/kb/2125901
7 Verify if the RabbitMQ credentials of the ESXi host are out of sync with the NSX Manager. Download
the NSX Manager Tech Support Logs. After gathering all the NSX Manager Tech Support logs,
search all the logs for entries similar to:
8 If such entries are found on the logs for the suspected ESXi host, resynchronize the message bus.
To resynchronize the message bus, use REST API. To better understand the issue, collect the logs
immediately after the Message Bus is resynchronized.
POST https://NSX_Manager_IP/api/2.0/nwfabric/configure?action=synchronize
Request Body:
<nwFabricFeatureConfig>
<featureId>com.vmware.vshield.vsm.messagingInfra</featureId>
<resourceConfig>
<resourceId>{HOST/CLUSTER MOID}</resourceId>
</resourceConfig>
</nwFabricFeatureConfig>
9 Use the export host-tech-support <host-id> scp <uid@ip:/path> command to gather host-
specific firewall logs.
For example:
10 Use the show dfw host host-id summarize-dvfilter command to verify that the firewall rules
are deployed on a host and are applied to virtual machines.
In the output, module: vsip shows that the DFW module is loaded and running. The name shows
the firewall that is running on each vNic.
You can get the host IDs by running the show dfw cluster all command to get the cluster domain
IDs, followed by the show dfw cluster domain-id to get the host IDs.
For example:
Fastpaths:
agent: dvfilter-faulter, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: dvfilter
agent: ESXi-Firewall, refCount: 5, rev: 0x1010000, apiRev: 0x1010000, module: esxfw
agent: dvfilter-generic-vmware, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: dvfilter-
generic-fastpath
agent: dvfilter-generic-vmware-swsec, refCount: 4, rev: 0x1010000, apiRev: 0x1010000, module:
dvfilter-switch-security
agent: bridgelearningfilter, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: vdrb
agent: dvfg-igmp, refCount: 1, rev: 0x1010000, apiRev: 0x1010000, module: dvfg-igmp
agent: vmware-sfw, refCount: 4, rev: 0x1010000, apiRev: 0x1010000, module: vsip
Slowpaths:
Filters:
world 342296 vmm0:2-vm_RHEL63_srv_64-shared-846-3f435476-8f54-4e5a-8d01-59654a4e9979 vcUuid:'3f 43
54 76 8f 54 4e 5a-8d 01 59 65 4a 4e 99 79'
port 50331660 2-vm_RHEL63_srv_64-shared-846-3f435476-8f54-4e5a-8d01-59654a4e9979.eth1
vNic slot 2
name: nic-342296-eth1-vmware-sfw.2
agentName: vmware-sfw
state: IOChain Attached
vmState: Detached
failurePolicy: failClosed
slowPathID: none
filter source: Dynamic Filter Creation
vNic slot 1
name: nic-342296-eth1-dvfilter-generic-vmware-swsec.1
agentName: dvfilter-generic-vmware-swsec
state: IOChain Attached
vmState: Detached
failurePolicy: failClosed
slowPathID: none
filter source: Alternate Opaque Channel
port 50331661 (disconnected)
vNic slot 2
name: nic-342296-eth2-vmware-sfw.2 <======= DFW filter
agentName: vmware-sfw
state: IOChain Detached
vmState: Detached
failurePolicy: failClosed
slowPathID: none
filter source: Dynamic Filter Creation
11 Run the show dfw host hostID filter filterID rules command.
For example:
ruleset domain-c33 {
# Filter rules
rule 1012 at 1 inout protocol any from addrset ip-securitygroup-10 to addrset ip-
securitygroup-10 drop with log;
rule 1013 at 2 inout protocol any from addrset src1013 to addrset src1013 drop;
rule 1011 at 3 inout protocol tcp from any to addrset dst1011 port 443 accept;
rule 1011 at 4 inout protocol icmp icmptype 8 from any to addrset dst1011 accept;
rule 1010 at 5 inout protocol tcp from addrset ip-securitygroup-10 to addrset ip-
securitygroup-11 port 8443 accept;
rule 1010 at 6 inout protocol icmp icmptype 8 from addrset ip-securitygroup-10 to addrset ip-
securitygroup-11 accept;
rule 1009 at 7 inout protocol tcp from addrset ip-securitygroup-11 to addrset ip-
securitygroup-12 port 3306 accept;
rule 1009 at 8 inout protocol icmp icmptype 8 from addrset ip-securitygroup-11 to addrset ip-
securitygroup-12 accept;
rule 1003 at 9 inout protocol ipv6-icmp icmptype 136 from any to any accept;
rule 1003 at 10 inout protocol ipv6-icmp icmptype 135 from any to any accept;
rule 1002 at 11 inout protocol udp from any to any port 67 accept;
rule 1002 at 12 inout protocol udp from any to any port 68 accept;
rule 1001 at 13 inout protocol any from any to any accept;
}
ruleset domain-c33_L2 {
# Filter rules
rule 1004 at 1 inout ethertype any from any to any accept;
For example:
addrset dst1011 {
ip 172.16.10.10,
ip 172.16.10.11,
ip 172.16.10.12,
ip fe80::250:56ff:feae:3e3d,
ip fe80::250:56ff:feae:f86b,
}
addrset ip-securitygroup-10 {
ip 172.16.10.11,
ip 172.16.10.12,
ip fe80::250:56ff:feae:3e3d,
ip fe80::250:56ff:feae:f86b,
}
addrset ip-securitygroup-11 {
ip 172.16.20.11,
ip fe80::250:56ff:feae:23b9,
}
addrset ip-securitygroup-12 {
ip 172.16.30.11,
ip fe80::250:56ff:feae:d42b,
}
addrset src1013 {
ip 172.16.10.12,
ip 172.17.10.11,
ip fe80::250:56ff:feae:cf88,
ip fe80::250:56ff:feae:f86b,
}
13 If you have validated each of the above troubleshooting steps and cannot publish firewall rules to the
host virtual machines, execute a host-level force synchronization via the NSX Manager UI or via the
following REST API call.
URL : [https:]https://<nsx-mgr-ip>/api/4.0/firewall/forceSync/<host-id>
HTTP Method : POST
Headers ,
Authorization : base64encoded value of username password
Accept : application/xml
Content-Type : application/xml
Notes:
n Ensure that VMware Tools is running on the virtual machines if firewall rules do not use IP addresses.
For more information, see https://kb.vmware.com/kb/2084048.
VMware NSX 6.2.0 introduced the option to discover the virtual machine IP address using DHCP
snooping or ARP snooping. These new discovery mechanisms enable NSX to enforce IP address-
based security rules on virtual machines that do not have VMware Tools installed. For more
information, see the NSX 6.2.0 Release Notes.
DFW is activated as soon as the host preparation process is completed. If a virtual machine needs no
DFW service at all, it can be added in the exclusion list functionality (by default, NSX Manager, NSX
Controllers and Edge Services Gateways are automatically excluded from DFW function). There is a
possibility that the vCenter Server access gets blocked after creating a Deny All rule in DFW. For
more information, see https://kb.vmware.com/kb/2079620.
n When troubleshooting VMware NSX 6.x Distributed Firewall (DFW) with VMware Technical Support,
these are required:
n Output of the command show dfw host hostID summarize-dvfilter on each of the ESXi
host on the cluster.
n Distributed Firewall Configuration from the Networking and Security > Firewall > General tab
and click Export Configuration. This exports the Distributed Firewall configuration to an XML
format.
Identity Firewall
Problem
Cause
User-based distributed firewall rules are determined by membership in an Active Directory (AD) group
membership. IDFW monitors where Active Directory users are logged into and maps the login to an IP
Address, which is used by DFW to apply firewall rules. IDFW requires either Guest Introspection
framework, and/or Active Directory event log scraping.
Solution
1 Make sure that the Active Directory server full/delta sync is working on the NSX Manager.
a In the vSphere Web Client, log in to the vCenter linked to the NSX Manager.
b Navigate to Home > Networking & Security> NSX Managers, and then select your NSX
Manager from the list.
c Choose the Manage tab, then the Domains tab. Select your domain from the list. Verify that the
Last Synchronization Status column displays SUCCESS and the Last Synchronization Time
is current.
2 If your firewall environment uses the event log scraping method of login detection, follow these steps
to verify that you have configured an event log server for your domain:
a In the vSphere Web Client, log in to the vCenter linked to the NSX Manager.
b Navigate to Home > Networking & Security> NSX Managers, and then select your NSX
Manager from the list.
c Choose the Manage tab and then the Domains tab. Select your domain from the list. Here you
can view and edit the detailed domain configuration.
d Select Event Log Servers from the domain details and verify that your Event Log Server is
added.
e Select your Event Log Server, and verify that the Last Sync Status column displays SUCCESS
and the Last Sync Time is current.
3 If your firewall environment uses Guest Introspection, the framework must be deployed to the
compute clusters where your IDFW protected VMs will reside. The Service Health Status on the UI
should be green. Guest Introspection diagnostic information is found in the following the Knowledge
Base articles: Troubleshooting vShield Endpoint / NSX Guest Introspection
https://kb.vmware.com/kb/2094261 and Collecting logs in VMware NSX for vSphere 6.x Guest
Introspection Universal Service Virtual Machine https://kb.vmware.com/kb/2144624.
4 After verifying the correct configuration of your logon detection method, ensure that the NSX Manager
is receiving logon events;
b Run the following command to query for login events. Verify your user is returned in the results.
GET https://<nsxmgr-ip>/1.0/identity/userIpMapping.
Example output:
<UserIpMappings>
<UserIpMapping>
<ip>50.1.111.192</ip>
<userName>user1_group20</userName>
<displayName>user1_group20</displayName>
<domainName>cd.ad1.db.com</domainName>
<startTime class="sql-timestamp">2017-05-11 22:30:51.0</startTime>
<startType>EVENTLOG</startType>
<lastSeenTime class="sql-timestamp">2017-05-11 22:30:52.0</lastSeenTime>
<lastSeenType>EVENTLOG</lastSeenType>
</UserIpMapping>
</UserIpMappings>
5 Verify that your security group is used in a firewall rule, or has an assigned security policy. Security
group processing in IDFW will not take place unless one of these conditions is true.
6 After verifying that IDFW is detecting logons correctly, verify that the ESXi host where your desktop
VM resides is receiving the correct configuration. These steps will use the NSX Manager central CLI.
To check the desktop VM IP address populated in the ip-securitygroup list:
a See CLI Commands for DFW to retrieve the filter name applied on the desktop VM.
b Run the show dfw host hostID filter filterID rules command to view the locate DFW
rules items.
c Run the show dfw host hostID filter filterID addrsets command to view the IP
address populated in the ip-securitygroup list. Verify that your IP is displayed in the list.
Note: When troubleshooting Identity IDFW with VMware Technical Support, this data is helpful:
# of Forests
# of Users / Forest
# of Users / Domain
# of Domain Controllers
n # of VDI desktops / VC
# of hosts / VC
Prior to beginning troubleshooting and configuration verification, get an accurate description of the error,
create a topology map in relation to the client, virtual server and backend server, and understand the
application requirements. For example a client cannot connect, is different than random session errors
after connection. While troubleshooting load balancer, always starts by verifying connectivity error.
Phase 4 Phase 3
IP DST 172.30.40.7 IP DST 192.168.1.20
SRC 192.168.1.20 SRC 192.168.1.3
TCP DST 1025 TCP DST 4099
SRC 80 SRC 80
SLB
192.168.1.3
(Selected)
192.168.1.20
TCP 80
VLAN Logical
V Switch
(VXLAN) 192.168.1.1
In proxy mode, the load balancer uses its own IP address as the source address to send requests to a
backend server. The backend server views all traffic as being sent from the load balancer and responds
to the load balancer directly. This mode is also called SNAT mode or non-transparent mode. For more
information, refer to NSX Administration Guide.
A typical NSX one-armed load balancer is deployed on the same subnet with its backend servers, apart
from the logical router. The NSX load balancer virtual server listens on a virtual IP for incoming requests
from client and dispatches the requests to backend servers. For the return traffic, reverse NAT is required
to change the source IP address from the backend server to a virtual IP (VIP) address and then send the
virtual IP address to the client. Without this operation, the connection to the client would break.
After the ESG receives the traffic, it performs two operations: Destination Network Address Translation
(DNAT) to change the VIP address to the IP address of one of the load balanced machines, and Source
Network Address Translation (SNAT) to exchange the client IP address with the ESG IP address.
Then the ESG server sends the traffic to the load balanced server and the load balanced server sends
the response back to the ESG then back to the client. This option is much easier to configure than the
Inline mode, but has two potentials caveats. The first is that this mode requires a dedicated ESG server,
and the second is that the load balancer servers are not aware of the original client IP address. One
workaround for HTTP/HTTPS applications is to enable Insert X-Forwarded-For in the HTTP application
profile so that the client IP address will be carried in the X-Forwarded-For HTTP header in the request
sent to the backend server.
If client IP address visibility is required on the backend server for applications other than HTTP/HTTPS,
you can configure the IP pool to be transparent. In case clients are not on the same subnet as the
backend server, inline mode is recommended. Otherwise, you must use the load balancer IP address as
the default gateway of the backend server.
n Inline/transparent mode
In DSR mode, the backend server responds directly to the client. Currently, NSX load balancer does not
support DSR.
Procedure
1 As an example, lets configure a one-armed virtual server with SSL offload. Create a certificate by
double-clicking the Edge and then selecting Manage > Settings > Certificate.
2 Enable the load balancer service by selecting Manage > Load Balancer > Global Configuration >
Edit.
3 Create an HTTPS application profile by selecting Manage > Load Balancer > Application Profiles.
Note The screenshot above uses self-signed certificates for documentation-purposes only.
4 Optionally, click Manage > Load Balancer > Service Monitoring and edit the default service
monitoring to change it from basic HTTP/HTTPS to specific URL/URIs, as required.
5 Create server pools by selecting Manage > Load Balancer > Pools.
To use SNAT mode, leave the Transparent check box unchecked in the pool configuration.
6 Optionally, click Manage > Load Balancer > Pools > Show Pool Statistics to check the status.
7 Create a virtual server by selecting Manage > Load Balancer > Virtual Servers.
If you would like to use the L4 load balancer for UDP or higher-performance TCP, check Enable
Acceleration. If you check Enable Acceleration, make sure that the firewall status is Enabled on
the load balancer NSX Edge, because a firewall is required for L4 SNAT.
8 Optionally, if using an application rule, check the configuration in Manage > Load Balancer >
Application Rules.
9 If using an application rule, ensure that the application rule is associated with the virtual server in
Manage > Load Balancer > Virtual Servers > Advanced.
In non-transparent mode, the backend server cannot see the client IP, but can see the load balancer
internal IP address. As a workaround for HTTP/HTTPS traffic, check Insert X-Forwarded-For HTTP
header. With this option checked, the Edge load balancer adds the header "X-Forwarded-For" with
the value of the client source IP address.
Verify
Cannot
topology
Connect to
meets
Virtual Server
requirements
Verify Pool
Fix issue
Can you Check Gateway and
Yes Down affecting
ping Virtual the Status Edge Services
monitors/ensure
Server? of the pool Gateway topology
Members
members (toggle
are up
transparent)
No No
If issue
Verify persists Simplify
Verify App
App without App Virtual
rules
Server rules, switch Servr
to accelerated
Problem
Solution
After understanding what should be functioning and defining a problem, verify the configuration through
the UI as follows.
Prerequisites
n The topology that was intended - inline or one-armed. For details, refer to the Logical Load
Balancer topic in NSX Administration Guide.
n Verify the trace route and use other network connectivity tools to see that the packets are going to
the correct location (edge services gateway).
n Define the problem that you are facing. For example, DNS records for the virtual server are
correct, but you are not getting back any content, or incorrect content, and so on.
Procedure
1 Verify the following application requirements - Protocols required to be supported on the load
balancer (TCP, UDP, HTTP, HTTPs), ports, persistence requirements, and pool members.
n Is the load balancer and firewall enabled and does the edge services gateway have proper
routes?
n What IP address, port and protocol should the virtual server be listening to?
n Is SSL offload being used? Do you need to use SSL when communicating with the backend
servers?
n What is the topology? The NSX load balancer needs to parse all the traffic from the client and
the server.
n Is the NSX load balancer inline or is the client source address translated to ensure return
traffic travels back to the load balancer?
2 Navigate to the NSX Edge, and verify the configurations that are required to enable load
balancing and allow traffic to flow as follows:
b Verify the firewall is Enabled. The firewall MUST be enabled for accelerated virtual servers.
Non Accelerated TCP and L7 HTTP/HTTPS VIPs must have a policy that allows traffic. Note
that the firewall filters will not impact accelerated virtual servers.
c Verify that the NAT rules are created for the virtual server. On the NAT tab, click the Hide
internal rules or Unhide internal rules link to verify.
Note If you have load balancing enabled and services configured, but have not configured
any NAT rules, it means that the auto rule configuration was not enabled.
d You can change the auto rule configurations. For details, refer to Change Auto Rule
Configuration topic in the NSX Administration Guide. When an NSX edge services gateway is
deployed, you have the option to configure auto rule configuration. If this option was not
selected while deploying the edge services gateway, you must enable it for the load balancer
to function correctly. Check the pool member status through the UI.
e Verify routing, and verify that the edge services gateway has a default route or a static route
to your client systems and the backend servers. If there is no route to the servers, health
check will not pass. If you are using a dynamic routing protocol you may have to use the CLI.
For more information, refer to NSX Routing CLI.
a Verify default route.
interface in the subnet. Many times the application servers are connected to these servers.
c Verify static routes from the Routing tab > Static Routes.
a Double-click an NSX Edge and navigate to Manage > Settings> Interfaces. Verify that IP
address for the virtual server is added to an interface.
b Verify the virtual server has the proper IP address, port(s) and protocols configured to support
the application.
a Verify the application profile used by the virtual server.
c Verify the application profile meets the persistent method supported, type (protocol), and SSL
(if necessary). If using SSL, ensure you are using a certificate with the correct name and
expiration date.
e Verify if you require a client certificate, but the clients are not configured. Also, verify if you
have selected a narrow cipher list that is too narrow (for example, are clients using older
browsers).
a Verify the pool status, at least one member must be up to serve traffic, but one member may
not be enough to serve all the traffic. If zero, or a limited member of pool members are up, try
to rectify the problem as described in next steps.
b Verify if the topology is correct. SNAT client traffic is controlled in the pool configuration. If the
edge services gateway hosting the load balancer function is not inline to see all the traffic,
then it will fail. To preserve the IP of the client source, select the Transparent mode. For
information, refer to theNSX Administration Guide.
5 If you are using application rules, verify the rules. Remove the rules if necessary to see if traffic
flows.
a Reorder the rules to see if the order of the rules is causing the logic to interrupt the traffic
flow. For information on how to add an application rule and view application rule examples,
see the Add an Application Rule topic in NSX Administration Guide.
What to do next
If you could not find the problem, you may need to use the CLI (Command Line Interface) to find out what
is happening. For more information, refer to Load Balancer Troubleshooting Using the CLI.
Problem
Solution
1 Enable or verify you can SSH to the virtual appliance. The edge services gateway is a virtual
appliance that has the option to enable SSH while deploying. If you need to enable SSH, select the
required appliance, and in the Actions menu, click Change CLI Credentials.
2 The edge services gateway has multiple show commands to look at the run time state, and the
configuration state. Use the commands to show configuration and statistics information.
3 For load balancing and NAT to function correctly the firewall should be enabled. Use the #show
firewall command. If you do not see any meaningful output using the command, refer to the Load
Balancer Configuration Verification and Troubleshooting Using the UI section.
4 Load balancer requires NAT to function correctly. Use the show nat command. If you do not see any
meaningful output using the command, refer to the Load Balancer Configuration Verification and
Troubleshooting Using the UI section.
5 In addition to the firewall being enabled and the load balancer having NAT rules, you should also
make sure the load balancing process is enabled. Use the show service loadbalancer command
to check the load balancer engine status (L4/L7).
L7 Loadbalancer : running
-----------------------------------------------------------------------
L7 Loadbalancer Statistics:
STATUS PID MAX_MEM_MB MAX_SOCK MAX_CONN MAX_PIPE CUR_CONN CONN_RATE
CONN_RATE_LIMIT MAX_CONN_RATE
running 1580 0 2081 1024 0 0 0
0 0
-----------------------------------------------------------------------
L4 Loadbalancer Statistics:
MAX_CONN ACT_CONN INACT_CONN TOTAL_CONN
0 0 0 0
a Use the show service loadbalancer session command to view the load balancer session
table. You will see sessions if there is traffic on the system.
-----------------------------------------------------------------------
L4 Loadbalancer Statistics:
MAX_CONN ACT_CONN INACT_CONN TOTAL_CONN
0 0 0 0
b Check the show service loadbalancer command to view the load balancer Layer 7 sticky-
table status. Note that this table does not display information on accelerated virtual servers.
6 If all the required services are running properly, look at the routing table and you need to have a route
to the client and to the servers. Use the show ip route and show ip forwarding commands
which maps routes to the interfaces.
7 Make sure that you have an ARP entry for the systems, such as the gateway or next hop, and the
backend servers using the show arp command.
8 The logs provide information to help find traffic which might help to diagnose issues. Use the show
log or show log follow commands to tail the log that will help to find the traffic. Note that you must
be running the load balancer with Logging enabled, and set to Info or Debug.
9 After verifying that the basic services are running with proper paths to the clients, lets look at what is
happening in the application layer. Use the show service loadbalancer pool command to view
the load balancer pool status (L4/L7). One pool member must be up to serve content, and usually
more than one is needed as the volume of requests exceeds the capacity of single workload. If health
monitor is provided by built-in health check, the output displays last state change time and
failure reason when health check fails. If health monitor is provided by monitor service, beside the
above two outputs, last check time is also displayed.
POOL Web-Tier-Pool-01
| LB METHOD round-robin
| LB PROTOCOL L7
| Transparent disabled
| SESSION (cur, max, total) = (0, 0, 0)
| BYTES in = (0), out = (0)
+->POOL MEMBER: Web-Tier-Pool-01/web-01a, STATUS: UP
| | HEALTH MONITOR = BUILT-IN, default_https_monitor:L7OK
| | | LAST STATE CHANGE: 2016-05-16 07:02:00
| | SESSION (cur, max, total) = (0, 0, 0)
| | BYTES in = (0), out = (0)
+->POOL MEMBER: Web-Tier-Pool-01/web-02a, STATUS: UP
| | HEALTH MONITOR = BUILT-IN, default_https_monitor:L7OK
| | | LAST STATE CHANGE: 2016-05-16 07:02:01
| | SESSION (cur, max, total) = (0, 0, 0)
| | BYTES in = (0), out = (0)
10 Check the service monitor status (OK, WARNING, CRITICAL) to see the health of all the configured
backend servers.
For the show service load balancer monitor command, three types of health monitor values
are displayed in the CLI output:
n Built-in: Health check is enabled and is performed by L7 engine (HA proxy).
n Monitor Service: Health check is enabled and is performed by monitor service engine (NAGIOS).
The monitor service running status can be checked with show service monitor and show
service monitor service CLI commands. The Status field should be OK, WARNING or
CRITICAL.
The last column of the output is the health status of the pool member. Following status are displayed:
11 When the error code is L4TOUT/L4CON, it is usually connectivity issues on the underlying
networking. Duplicate IP often happens as root cause with such reason. When this error happens,
troubleshoot as follows:
a Check the High Availability (HA) status of edges, when HA is enabled by using the show
service highavailability command on both the edges. Check if the HA link is DOWN and
all the edges are Active, so there are no duplicate edge IP on the network.
b Check edge ARP table by show arp command, and verify if the backend server’s ARP entry is
changed between the two MAC addresses.
c Check backend server ARP table or use the arp-ping command and check whether any other
machine has the same IP similar to the edge IP.
12 Check the load balancer object statistics (VIPs, pools, members). Look at the specific pool and verify
that the members are up and running. Check if the transparent mode is enabled. If yes, the edge
services gateway should be inline between the client and the server. Verify if the servers are showing
session counter increments.
13 Now look at the virtual server and verify if there is a default pool, and see the pool is also bound to it.
If you use pools via application rules, you need to look at the specific pools as shown in the #show
service loadbalancer pool command. Specify the name of the virtual server.
-----------------------------------------------------------------------
Loadbalancer VirtualServer Statistics:
VIRTUAL Web-Tier-VIP-01
| ADDRESS [172.16.10.10]:443
| SESSION (cur, max, total) = (0, 0, 0)
| RATE (cur, max, limit) = (0, 0, 0)
| BYTES in = (0), out = (0)
+->POOL Web-Tier-Pool-01
| LB METHOD round-robin
| LB PROTOCOL L7
| Transparent disabled
| SESSION (cur, max, total) = (0, 0, 0)
| BYTES in = (0), out = (0)
+->POOL MEMBER: Web-Tier-Pool-01/web-01a, STATUS: UP
| | HEALTH MONITOR = BUILT-IN, default_https_monitor:L7OK
| | | LAST STATE CHANGE: 2016-05-16 07:02:00
14 If everything looks to be configured correctly and still you have an error, you should capture traffic to
understand what is going on. There are two connections: the client to the virtual server, and the edge
services gateway to the backend pool (with or without the transparent configuration at the pool level).
The #show ip forwarding command listed the vNic interfaces, and you can use that data.
For example, assume the client computer is on vNic_0 and the server on vNic_1. You use a client IP
address of 192.168.1.2, a VIP IP of 192.168.2.2 running on port 80. Load balancer interface IP
192.168.3.1 and a backend server IP of 192.168.3.3. There are two different packet capture
commands, one displays the packets, whereas the other captures the packets to file that you can
download. Capture the packets to detect the load balancer abnormal failure. You can capture packets
from two directions:
#debug packet capture interface interface-name [filter using _ for space]- creates a packet
capture file that you can download
#debug packet display interface interface-name [filter using _ for space]- outputs packet data to
the console
#debug show files - to see a list of packet capture
#debug copy scp user@url:path file-name/all - to download the packet capture
For example:
n Capture on vNIC_0: debug packet display interface vNic_0
n Capture on all interfaces: debug packet display interface any
n Capture on vNIC_0 with a filter: debug packet display interface vNic_0
host_192.168.11.3_and_host_192.168.11.41
n A packet capture of the client to virtual server traffic: #debug packet display|capture
interface vNic_0 host_192.168.1.2_and_host_192.168.2.2_and_port_80
n A packet capture between the edge services gateway and the server where the pool is in
transparent mode: #debug packet display|capture interface vNic_1 host
192.168.1.2_and_host_192.168.3.3_and_port_80
n A packet capture between the edge services gateway and the server where the pool is not in
transparent mode: #debug packet display|capture interface vNic_1 host
192.168.3.1_and_host_192.168.3.3_and_port_80
The following issues are common when using NSX load balancing:
n Load balancing on TCP port (for example, port 443) does not work.
n Verify the virtual server IP address is reachable with ping, or look at the upstream router to
ensure the ARP table is populated.
n Capture packets.
n Verify the server is in the pool, enabled, and monitor health status.
n Verify the pool and persistence configuration. If you have persistence configured and you are
using a small number of clients, you may not see even distribution of connections to backend
pool members.
n Ensure the application server is able to respond to the specified health probe.
n If the application works with only one server in the pool (and not two), it is most likely a
persistence problem.
Basic Troubleshooting
1 Check the load balancer configuration status in the vSphere Web Client:
2 Before troubleshooting the load balancer service, run the following command on the NSX Manager to
ensure that the service is up an running:
L7 Loadbalancer : running
-----------------------------------------------------------------------
L7 Loadbalancer Statistics:
STATUS PID MAX_MEM_MB MAX_SOCK MAX_CONN MAX_PIPE CUR_CONN CONN_RATE
CONN_RATE_LIMIT MAX_CONN_RATE
running 1580 0 2081 1024 0 0 0
0 0
-----------------------------------------------------------------------
L4 Loadbalancer Statistics:
MAX_CONN ACT_CONN INACT_CONN TOTAL_CONN
0 0 0 0
Note You can run show edge all to look up the names of the NSX Edges.
1 Change the Edge logging level in NSX Manager from INFO to TRACE or DEBUG using this REST
API call.
URL: https://NSX_Manager_IP/api/1.0/services/debug/loglevel/com.vmware.vshield.edge?level=TRACE
Method: POST
e Select your load balancer pool. click Show Pool Statistics, and verify that the pool state is UP.
3 You can get more detailed load balancer pool configuration statistics from the NSX Manager using the
following REST API call:
URL: https://NSX_Manager_IP/api/4.0/edges/{edgeId}/loadbalancer/statistics
Method: GET
<member>
<memberId>member-2</memberId>
<name>web-02a</name>
<ipAddress>172.16.10.12</ipAddress>
<status>UP</status>
<lastStateChangeTime>2016-05-16 07:02:01</lastStateChangeTime>
<bytesIn>0</bytesIn>
<bytesOut>0</bytesOut>
<curSessions>0</curSessions>
<httpReqTotal>0</httpReqTotal>
<httpReqRate>0</httpReqRate>
<httpReqRateMax>0</httpReqRateMax>
<maxSessions>0</maxSessions>
<rate>0</rate>
<rateLimit>0</rateLimit>
<rateMax>0</rateMax>
<totalSessions>0</totalSessions>
</member>
<status>UP</status>
<bytesIn>0</bytesIn>
<bytesOut>0</bytesOut>
<curSessions>0</curSessions>
<httpReqTotal>0</httpReqTotal>
<httpReqRate>0</httpReqRate>
<httpReqRateMax>0</httpReqRateMax>
<maxSessions>0</maxSessions>
<rate>0</rate>
<rateLimit>0</rateLimit>
<rateMax>0</rateMax>
<totalSessions>0</totalSessions>
</pool>
<virtualServer>
<virtualServerId>virtualServer-1</virtualServerId>
<name>Web-Tier-VIP-01</name>
<ipAddress>172.16.10.10</ipAddress>
<status>OPEN</status>
<bytesIn>0</bytesIn>
<bytesOut>0</bytesOut>
<curSessions>0</curSessions>
<httpReqTotal>0</httpReqTotal>
<httpReqRate>0</httpReqRate>
<httpReqRateMax>0</httpReqRateMax>
<maxSessions>0</maxSessions>
<rate>0</rate>
<rateLimit>0</rateLimit>
<rateMax>0</rateMax>
<totalSessions>0</totalSessions>
</virtualServer>
</loadBalancerStatusAndStats>
4 To check load balancer statistics from the command line, run the following commands on the NSX
Edge.
For a particular virtual server: First run show service loadbalancer virtual to get the virtual
server name. Then run show statistics loadbalancer virtual <virtual-server-name>.
For a particular TCP pool: First run show service loadbalancer pool to get the pool name. Then
run show statistics loadbalancer pool <pool-name>.
n L2 VPN
n SSL VPN
n IPSec VPN
L2 VPN
With L2 VPN, you can stretch multiple logical L2 networks (both VLAN and VXLAN) across L3
boundaries, tunneled within an SSL VPN. In addition, you can configure multiple sites on an L2 VPN
server. Virtual machines remain on the same subnet when they are moved between sites and their IP
addresses do not change. You also have the option to deploy a standalone edge on a remote site without
that site being “NSX Enabled”. Egress optimization enables the edge to route any packets sent towards
the Egress Optimization IP address locally, and bridge everything else.
L2 VPN thus allows enterprises to seamlessly migrate workloads backed by VXLAN or VLAN between
physically separated locations. For cloud providers, L2 VPN provides a mechanism to on-board tenants
without modifying existing IP addresses for workloads and applications.
Problem
n L2 VPN client is configured to validate server certificate, but it is not configured with correct CA
certificate or FQDN.
n L2 VPN server is configured, but NAT / firewall rule is not created on internet facing firewall.
n Trunk interface is not backed by either a distributed port group or a standard port group.
Note L2 VPN server listens on port 443 by default. This port is configurable from L2 VPN server
settings.
L2 VPN client makes an outgoing connection to port 443 by default. This port is configurable from L2 VPN
client settings.
Solution
b Run the show process monitor command, and verify if you can find a process with name
l2vpn.
c Run the show service network-connections command, and verify if l2vpn process is
listening on port 443.
b Run the show process monitor command, and verify if you can find a process with name
naclientd.
c Run the show service network-connections command, and verify if naclientd process is
listening on port 443.
b A portal login page should be displayed. If portal page is displayed, it means that L2 VPN server
is reachable over internet.
4 Check if trunk interface is backed by a distributed port group or a standard port group.
a If the trunk interface is backed by a distributed port group, a sink port is automatically set.
b If the trunk interface is backed by a standard port group, you should manually configure the
vSphere Distributed Switch as follows:
a Two major issues are observed if NIC teaming is not configured correctly — MAC flapping, and
duplicate packets. Verify configuration as described in L2VPN Options to Mitigate Looping.
a Log in to L2 VPN server CLI, and capture packet on the corresponding tap interface debug
packet capture interface name.
b Log in to L2 VPN client, and capture packet capture on the corresponding tap interface debug
packet capture interface name
c Analyze these captures to check if ARP is getting resolved and data traffic flow.
d Check if Allow Forged Transmits: dvSwitch property is set to L2 VPN trunk port.
e Check if sink port is set to L2 VPN trunk port. To do so, log in to host and issue command net-
dvs -l. Check for sink property set for L2 VPN edge internal port
(com.vmware.etherswitch.port.extraEthFRP = SINK). Internal port refers to the dvPort
where the NSX Edge trunk is connected to.
Option 1: Separate ESXi hosts for the L2VPN Edges and the VMs
Teaming Active/Active:
Not Supported Standby Active
dvPortGroup dvPortGroup
SINK
Trunk
vDS
interface
2 Configure the Teaming and Failover Policy for the Distributed Port Group associated with the Edge’s
Trunk vNic as follows:
b Configure only one uplink as Active and the other uplink as Standby.
3 Configure the teaming and failover policy for the distributed port group associated with the VMs as
follows:
4 Configure Edges to use sink port mode and disable promiscuous mode on the trunk vNic.
Note
n Disable promiscuous mode: If you are using vSphere Distributed Switch.
n Enable promiscuous mode: If you are using virtual switch to configure trunk interface.
If a virtual switch has promiscuous mode enabled, some of the packets that come in from the uplinks that
are not currently used by the promiscuous port, are not discarded. You should enable and then disable
ReversePathFwdCheckPromisc that will explicitly discard all the packets coming in from the currently
unused uplinks, for the promiscuous port.
To block the duplicate packets, activate RPF check for the promiscuous mode from the ESXi CLI where
NSX Edge is present:
In PortGroup security policy, set PromiscousMode from Accept to Reject and back to Accept to
activate the configured change.
Standby Standby
a Configure the teaming and failover policy for the distributed port group associated with Edge’s
trunk vNic as follows:
b Configure the teaming and failover policy for the distributed port group associated with the VMs
as follows:
3 The order of the active/standby uplinks must be the same for the VMs' distributed port group
and the Edge’s trunk vNic distributed port group.
c Configure the client-side standalone edge to use sink port mode and disable promiscuous mode
on the trunk vNic.
Problem
Solution
2 Use the following commands on both the client and server edge:
n show configuration l2vpn - Check the four following key values to verify the server.
n show service l2vpn bridge - The number of interfaces depends on the number of L2 VPN
clients. In below output, a single L2 VPN client (na1) is configured. Port1 refers to vNic_2. The
MAC address of 02:50:56:56:44:52 has been learned on the vNic_2 interface, and is not local to
the edge ( L2 VPN server). Row 3 in the following example refers to na1 interface.
n show service l2vpn conversion table - In the following example, an Ethernet frame which
arrives on tunnel #1 will have its VLAN ID #1 converted to VXLAN with a VLAN # of 5001 before
the packet is passed to the VDS.
n show process monitor - Identify if the l2vpn (server) and naclientd (client) processes are
running.
n show service network-connections - Identify if the l2vpn (server) and naclientd (client)
processes are listening on port 443.
SSL VPN
You can use this information to troubleshoot problems with your setup.
Problem
n Driver installation failed for reason E000024B: please try rebooting the
machine.
n The installation failed. The installer encountered an error that cause the
installation to fail. Contact the software manufacturer for assistance.
Solution
1 Ensure that the operating system of SSL VPN client is supported. For more information, see the SSL
VPN-Plus section of the NSX Administration Guide.
2 For Windows 8.1 - Auto downloaded installer is blocked by default. This issue occurs if you use of the
Hide SSL client network adapter option while adding installation package of the SSL VPN-Plus
client for the remote user. To resolve this issue, perform the following steps:
d Select the installation package that you want to edit, and then click the Edit icon.
Go to your Windows 8.1 machine and install the client without any errors.
3 For SSL VPN Client - Install SSL VPN client on the end users machine. Installation requires
administration rights.
4 For SSL VPN Portal - You should be able to access from any browser with cookies and java script
enabled.
5 If SSL VPN client installation fails on Mac OS High Sierra, perform the following steps:
a The SSL VPN OS X client installation requires explicit user approval for loading a kernel
extension (or kext). To do this, go to your Mac OS machine and open the System Preferences >
Security & Privacy window.
b At the bottom of the window, you can see a message similar to "Some system software was
blocked from loading." Click the "Allow" button.
c To proceed with you installation, click the Allow button. For more details, refer to
https://developer.apple.com/library/content/technotes/tn2459/_index.html.
Problem
n Communication issues.
n The SSL VPN-Plus Client - Statistics screen on client machine shows virtual IP address as Not
yet assigned.
Solution
a Log in to the Edge appliance from the CLI. For more information, see the NSX Command Line
Interface Reference.
b Run the show process monitor command, and locate the sslvpn process.
c Run the show service network-connections command, and see if the sslvpn process is
listed on port 443.
2 The SSL VPN Portal/SSL VPN-Plus Client displays Maximum users reached/Maximum count of
logged in user reached as per SSL VPN license. Please try after some time or SSL
read has failed.
a To resolve this issue, increase the concurrent users (CCU) further by increasing the NSX Edge
form factor. For more information, see the NSX Administration Guide. Note that the connected
users get disconnected from VPN when you perform this operation.
a The back end (Private Network) and IP Pool should not be in same subnet.
4 Add an static IP pool as explained in Add an IP Pool topic in the NSX Administration Guide.
Make sure you add the IP address in the Gateway field. The gateway IP address is assigned
to na0 interface. All non-TCP traffic flows through virtual adapter named as na0 interface. You
can create multiple IP pools with different gateway IP addresses assigned to same na0
interface.
5 Use the ifconfig command to verify the provided IP address and see if all IP pools are
assigned to the same na0 interface.
6 Log in to the client machine, go to theSSL VPN-Plus Client - Statistics screen and verify the
assigned virtual IP address.
c Log in to the Edge Command Line Interface (CLI), and take a packet capture on na0 interface by
running the debug packet capture interface na0 command.
Note Packet capture continues to run in the background until you stop the capture by running
the no debug packet capture interface na0 command.
e For non-TCP traffic, make sure back end network has default gateway set as internal interface of
the edge.
f For Linux client, log in to the Linux system on which SSL VPN client is installed and take packet
capture on tap0 interface or virtual adapter by running the tcpdump -i tap0 -s 1500 -w
filepath command.
a If language is not set to English, set the language to English and see if issue persists.
b Check if AES cipher is selected on SSL VPN server. Some browsers like Internet Explorer 8 do
not support AES encryption.
5 If the above steps do not resolve the issue, use the following commands to troubleshoot further.
n To check status of SSL VPN, run the show service sslvpn-plus command.
n To check statistics for SSL VPN, run the show service sslvpn-plus stats command.
n To check VPN clients that are connected, run the show service sslvpn-plus tunnels
command.
Problem
Solution
a Ensure that the external authentication server is reachable from the NSX Edge. From the NSX
Edge, ping the authentication server and verify if the server is reachable.
b Check the external authentication server configuration using tools such as the LDAP browser and
see if the configuration works. Only LDAP and AD authentication servers can be checked using
the LDAP browser.
c Ensure that the local authentication server is set to lowest priority if configured in authentication
process.
d If using Active Directory (AD), set it to no-ssl mode and take packet capture on the interface
from which AD Server is reachable.
e If authentication is successful in the syslog server, you see a message similar to: Log Output -
SVP_LOG_NOTICE, 10-28-2013,09:28:39,Authentication,a,-,-,
10.112.243.61,-,PHAT,,SUCCESS,,,10-28-2013,09:28:39,-,-,,,,,,,,,,-,,-,
f If authentication fails, in the syslog server, you see a message similar to: Log Output -
SVP_LOG_NOTICE, 10-28-2013,09:28:39,Authentication,a,-,-,
10.112.243.61,-,PHAT,,FAILURE,,,10-28-2013,09:28:39,-,-,,,,,,,,,,-,,-,
n The following log output shows that the user a is successfully authenticated with Network Access
Client on 28th of October 2016 at 0928 hour.
SVP_LOG_NOTICE,10-28-2016,09:28:39,Authentication,a,-,-,
10.112.243.61,-,PHAT,,SUCCESS,,,10-28-2016,09:28:39,-,-,,,,,,,,,,,,-,,-,-
Authentication Failure
n The following log output shows that the user a failed to authenticate with Network Access Client on
28th of October 2016 at 0928 hour.
SVP_LOG_NOTICE,10-28-2016,09:28:39,Authentication,a,-,-,
10.112.243.61,-,PHAT,,FAILURE,,,10-28-2016,09:28:39,-,-,,,,,,,,,,,,-,,-,-
n The following log output shows that the user a is successfully connected with Network Access Client
over TCP on 28th of October 2016 at 0941 hour to the back end web server 192.168.10.8 .
SVP_LOG_INFO,10-28-2016,09:41:03,TCP Connect,a,-,-,10.112.243.61,-,PHAT,,SUCCESS,,,
10-28-2013,09:41:03,-,-,192.168.10.8,80,,,,,,,,,,-,,-,-
n The following log output shows that the user a failed to connect with Network Access Client over TCP
on 28th of October 2016 at 0941 hour to the back end web server 192.168.10.8 .
SVP_LOG_INFO,10-28-2016,09:41:03,TCP Connect,a,-,-,10.112.243.61,-,PHAT,,FAILURE,,,
10-28-2013,09:41:03,-,-,192.168.10.8,80,,,,,,,,,,-,,-,-
IPSec VPN
Use this information to help you troubleshoot negotiation problems with your setup.
NSX Edge
From the NSX Edge command line interface (ipsec auto -status, part of show service ipsec command):
Cisco
Active SA: 1
Rekey SA: 0 (A tunnel will report 1 Active and 1 Rekey SA during rekey)
Total IKE SA: 1
NSX Edge
NSX Edge hangs in STATE_MAIN_I1 state. Look in /var/log/messages for information showing that the
peer sent back an IKE message with "NO_PROPOSAL_CHOSEN" set.
Cisco
If debug crypto is enabled, an error message is printed to show that no proposals were accepted.
NSX Edge
NSX Edge hangs at STATE_QUICK_I1. A log message shows that the peer sent a
NO_PROPOSAL_CHOSEN message.
type: ISAKMP_NEXT_NONE
Aug 26 12:33:54 weiqing-desktop ipsec[6933]: | length: 32
Aug 26 12:33:54 weiqing-desktop ipsec[6933]:
| DOI: ISAKMP_DOI_IPSEC
Aug 26 12:33:54 weiqing-desktop ipsec[6933]: | protocol ID: 3
Aug 26 12:33:54 weiqing-desktop ipsec[6933]: | SPI size: 16
Aug 26 12:33:54 weiqing-desktop ipsec[6933]: | Notify Message
Type: NO_PROPOSAL_CHOSEN
Aug 26 12:33:54 weiqing-desktop ipsec[6933]: "s1-c1" #3:
ignoring informational payload, type NO_PROPOSAL_CHOSEN
msgid=00000000
Cisco
Debug message show that Phase 1 is completed, but Phase 2 failed because of policy negotiation failure.
PFS Mismatch
The following lists PFS Mismatch Error logs.
NSX Edge
PFS is negotiated as part of Phase 2. If PFS does not match, the behavior is similar to the failure case
described in Phase 2 Not Matching.
| DOI: ISAKMP_DOI_IPSEC
Aug 26 12:35:52 weiqing-desktop ipsec[7312]: | protocol ID: 3
Aug 26 12:35:52 weiqing-desktop ipsec[7312]: | SPI size: 16
Aug 26 12:35:52 weiqing-desktop ipsec[7312]: | Notify Message
Type: NO_PROPOSAL_CHOSEN
Aug 26 12:35:52 weiqing-desktop ipsec[7312]: "s1-c1" #1: ignoring
informational payload, type NO_PROPOSAL_CHOSEN
msgid=00000000
Aug 26 12:35:52 weiqing-desktop ipsec[7312]: | info: fa 16 b3 e5
91 a9 b0 02 a3 30 e1 d9 6e 5a 13 d4
Aug 26 12:35:52 weiqing-desktop ipsec[7312]: | info: 93 e5 e4 d7
Aug 26 12:35:52 weiqing-desktop ipsec[7312]:
| processing informational NO_PROPOSAL_CHOSEN (14)
Cisco
NSX Edge
PSK is negotiated in the last round of Phase 1. If PSK negotiation fails, NSX Edge state is
STATE_MAIN_I4. The peer sends a message containing INVALID_ID_INFORMATION.
Cisco
Sharding is used to distribute workloads across NSX Controller cluster nodes. Sharding is the action of
dividing NSX Controller workloads into different shards so that each NSX Controller instance has an
equal portion of the work.
Objects
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Shards
1 4 7 2 5 8 3 6 9
2 5 8 1 4 7 3 6 9
This demonstrates how distinct controller nodes act as master for given entities such as logical switching,
logical routing and other services. After a master NSX Controller instance is chosen for a role, that
NSX Controller divides the different logical switches and routers among all available NSX Controller
instances in a cluster.
Each numbered box on the shard represents shards that the master uses to divide the workloads. The
logical switch master divides the logical switches into shards and assigns these shards to different
NSX Controller instances. The master for the logical routers also divides the logical routers into shards
and assigns these shards to different NSX Controller instances.
These shards are assigned to the different NSX Controller instances in that cluster. The master for a role
decides which NSX Controller instances are assigned to which shard. If a request comes in on router
shard 3, the shard is told to connect to the third NSX Controller instance. If a request comes in on logical
switch shard 2, that request is processed by the second NSX Controller instance.
When one of the NSX Controller instances in a cluster fails, the masters for the roles redistribute the
shards to the remaining available clusters. One of the controller nodes is elected as a master for each
role. The master is responsible for allocating shards to individual controller nodes, determining when a
node has failed, and reallocating the shards to the other nodes. The master also informs the ESXi hosts
about the failure of the cluster node.
The election of the master for each role requires a majority vote of all active and inactive nodes in the
cluster. This is the primary reason why a controller cluster must always be deployed with an odd number
of nodes.
ZooKeeper
ZooKeeper is a client server architecture that is responsible for NSX Controller cluster mechanism. The
controller cluster is discovered and created using Zookeeper. When cluster is coming up, it literally means
ZooKeeper is coming up between all the nodes. ZooKeeper nodes goes through election process to form
the control cluster. There must be a ZooKeeper master node in the cluster. This is done via inter-node
election.
When a new controller node is created,NSX Manager propagates the node information to the current
cluster, with node IP and ID. As such, each node knows the total number of nodes available for clustering.
During ZooKeeper master election, each node casts one vote to elect a master node. The election is
triggered again until one node has a majority of the votes. For example, in a three node cluster, the
master must have received at least two of the votes.
Note To prevent scenario where a ZooKeeper master cannot be elected, the number of nodes in the
cluster MUST be three.
n When the first controller is deployed, it’s a special case and the first controller becomes master. As
such, when deploying controllers, the first node must complete deployment before any other nodes
are added.
n When adding the second controller, it’s also a special case, because the number of nodes at this time
is even.
n When the third node is added, the cluster reaches a supported stable state.
ZooKeeper can sustain only one failure at a time. This means that if one controller node goes down, it
must be recovered before any other failures. Otherwise, there can be problems with the cluster breaking.
Domain manager is responsible to start all domains. To join the cluster, CCP domain talks to CCP domain
on other machines. The component of CCP domain that helps with cluster initialization is zk-cluster-
bootstrap.
When a logical switch is created, the controller nodes within the cluster determines which node will be
master or owner for that logical switch. The same applies when a logical router is added.
Once ownership is established for a logical switch or logical router, the node sends that ownership to the
ESXi hosts that belong to that switch or router’s transport zone. The entire election of ownership and
propagation of the ownership information to the hosts is called ‘sharding’. Note that ownership means that
node is responsible for all NSX related operations for that logical switch or logical router. The other nodes
will not perform any operation for that logical switch.
Because only one owner must be the source of truth for a logical switch and logical router, any time the
controller cluster breaks in such a way that two or more nodes are elected as owner for a logical switch or
logical router, each host in the network may have a different information regarding the source of truth for
that logical switch or logical router. If this happens, there will be network outage because network control
and data plane operations can only have one source of truth.
If a controller node goes down, the remaining nodes in the cluster will rerun sharding to determine
ownership of the logical switch and logical routing.
It is recommended that you implement DRS anti-affinity rules to keep NSX Controllers on separate hosts.
You must deploy THREE NSX Controllers.
n Running the show control-cluster status command shows the Majority status flapping
between Connected to cluster majority to Interrupted connection to cluster
majority.
Note When you see controller node is disconnected, do NOT use join cluster or force
join command. This command is not designed to join node to cluster. Doing this, cluster might
enter in to a totally uncertain state.
Cluster startup nodes are just a hint to the cluster members on where to look when the members
start up. Do not be alarmed if this list contains cluster members no longer in service. This will not
impact cluster functionality.
All cluster members should have the same cluster ID. If they do not, then the cluster is in a
broken status and you should work with VMware technical support to repair it.
n The show control-cluster startup-nodes command was not designed to display all nodes
currently in the cluster. Instead, it shows which other controller nodes are used by this node to
bootstrap membership into the cluster when the controller process restarts. Accordingly, the
command output may show some nodes which are shut down or have otherwise been pruned
from the cluster.
n In addition, the show control-cluster network ipsec status command allows to inspect
the Internet Protocol Security (IPsec) state. If you see that controllers are unable to communicate
between themselves for a few minutes to hours, run the cat /var/log/syslog | egrep
"sending DPD request|IKE_SA" command and see if the log messages indicate absence of
traffic. You can also run the ipsec statusall | egrep "bytes_i|bytes_o" command and
verify that there are no two IPsec tunnels established. Provide the output of these commands and
the controller logs when reporting a suspected control cluster issue to your VMware technical
support representative.
n IP connectivity issues between the NSX Manager and the NSX controllers. This is generally caused
by physical network connectivity issues or a firewall blocking communication.
n Insufficient resources such as storage available on vSphere to host the controllers. Viewing the
vCenter events and tasks log during controller deployment can identify such issues.
n DNS on ESXi hosts and NSX Manager have not been configured properly.
n When newly connected VMs have no network access, this is likely caused by a control-plane issue.
Check the controller status.
Also try running the esxcli network vswitch dvs vmware vxlan network list --vds-name
<name> command on ESXi hosts to check the control-plane status. Note that the Controller
connection is down.
n Running the show log manager follow NSX Manager CLI command can identify any other
reasons for a failure to deploy controllers.
n Check for any abnormal error statistics using the show log cloudnet/cloudnet_java-vnet-
controller*.log filtered-by host_IP command.
n Verify the logical switch/router message statistics or high message rate using the following
commands:
n You can use the show host hostID health-status command to check the health status of hosts
in your prepared clusters. For controller troubleshooting, the following health checks are supported:
n Check whether the VXLAN Network Identifier (VNI) is created and whether the configuration is
correct.
n Verify that all NSX Controllers display a Connected status. If any of the controller nodes display a
Disconnected status, ensure that the following information is consistent by running the show
control-cluster status command on all controller nodes:
Type Status
n Verify that vnet-controller process is running. Run the show process command on all controller
nodes and ensure that java-dir-server service is running.
n Verify the cluster history and ensure there is no sign of host connection flapping, or VNI join failures
and abnormal cluster membership change. To verify this, run the show control-cluster history
command. The commands also shows if the node is frequently restarted. Verify that there are not
many log files with zero (0) size and with different process IDs.
n Verify that VXLAN Network Identifier (VNI) is configured. For more information, see the VXLAN
Preparation Steps section of the VMware VXLAN Deployment Guide.
n Verify that SSL is enabled on the controller cluster. Run the show log cloudnet/cloudnet_java-
vnet-controller*.log filtered-by sslEnabled command on each of the controller nodes.
To view the disk latency alerts for NSX Controller, perform the following procedure:
Prerequisites
Procedure
3 Under Management, go to the required controller, and click the Disk Alert link.
You can view the latency details for the selected controller. The alert logs are stored for seven days in the
cloudnet/run/iostat/iostat_alert.log file. You can use the show log
cloudnet/run/iostat/iostat_alert.log command to display the log file.
What to do next
For more troubleshooting information on disk latency, refer to Disk Latency Issues.
For more information about log messages, refer to NSX Logging and System Events.
Problem
n If the storage system does not meet these requirements, the cluster can become unstable and cause
system downtime.
n TCP listeners applicable to a functioning NSX Controller, no longer appear in the output of the show
network connections of-type tcp command.
n The disconnected controller attempts to join the cluster using an all-zeroes UUID, which is not valid.
cloudnet_java-zookeeper.20150530-000550.1806.log-2015-05-30
13:25:07,382 47956539 [SyncThread:1] WARN
org.apache.zookeeper.server.persistence.FileTxnLog - fsync-ing the write ahead
log in SyncThread:1 took 3219ms which will adversely effect operation latency.
See the ZooKeeper troubleshooting guide
Cause
This issue occurs due to slow disk performance, which adversely impacts the NSX Controller cluster.
n Check for slow disks by looking for fsync messages in the /var/log/cloudnet/cloudnet_java-
zookeeper log file. If fsync takes more than one second, Zookeeper displays a fsync warning
message, and it is a good indication that the disk is too slow. VMware recommends dedicating a
Logical Unit Number (LUN) specifically for the control-cluster and/or moving the storage array closer
to the control-cluster in terms of latencies.
n You can view the read latency and write latency calculations that are inputted into a 5-second (by
default) moving average, which in turn is used to trigger an alert upon breaching the latency limit. The
alert is turned off after the average comes down to the low watermark. By default, the high watermark
is set to 200 ms, and the low watermark is set to 100 ms. You can use the show disk-latency-
alert config command. The output is displayed as follows:
n Use the GET /api/2.0/vdn/controller REST API to indicate whether a disk latency alert is
detected on a controller node.
Solution
2 Each controller should use its own disk storage server. Do not share same disk storage server
between two controllers.
What to do next
For more information on how to view alerts, refer to View Disk Latency Alerts.
Prerequisites
Procedure
1 In the vSphere Web Client, navigate to Networking & Security > Installation > Management.
2 Click the Error link to see the detailed reason for this out of sync state.
If the controller VM is powered off, Management Plane triggers a power on command for the controller.
If the controller VM is deleted, the entries of the controller are deleted from the Management Plane and
Management Plane communicates the controller deletion to the Central Control Plane.
What to do next
Create a new controller using the Add Node option. For details, refer to the NSX Administration Guide.
Problem
Solution
3 In the NSX Controller nodes section, go to the Peers columns. If the Peers column shows green
boxes, it represents no error in the peer controller connectivity in the cluster. A red box indicates an
error with a peer. Click the box to view details.
4 If Peers column displays problem in the controller cluster, log in to each NSX Controller CLI to
perform detailed diagnosis. Run the show control-cluster status command to diagnose the
state of each controller. All controllers in the cluster should have the same cluster UUID, however
cluster UUID may not be same as the UUID of the master controller. You can find information about
deployment issues as described in NSX Controller Deployment Issues.
5 You can try the following methods to resolve the issue before redeploying the controller node:
b Try to ping to and from affected controller to other nodes and manager to check network paths. If
you find any network issues, address them as described in NSX Controller Deployment Issues.
c Check the Internet Protocol Security (IPSec) status using the following CLI commands.
u Verify if IPSec is enabled using the show control-cluster network ipsec status
command.
u Verify the status of the IPSec tunnels using the show control-cluster network ipsec
tunnels command.
You can also use the IPSec status information to report VMware technical support.
d If the issue is not a network issue, then you can choose whether to reboot or redeploy.
If you want to reboot a node, note that only one controller reboot should be done at a time. However,
if the controller cluster is in a state where more than one node has failed, reboot all of them at the
same time. When rebooting a node from a healthy cluster, always confirm that the cluster is reformed
properly afterwards, then confirm that resharding has been done properly.
6 If you decide to redeploy, you must first delete the broken controller and then deploy a new controller.
What to do next
n The controller cluster is healthy, and a controller cluster API request can be processed.
n The host state, as obtained from the vCenter Server inventory, shows connected and powered on.
Forceful removal procedure does not check the above mentioned conditions before removing the
controller node.
n Do not attempt to delete the controller VM before deleting it through the vSphere Web Client UI or
API. When the UI is not usable, use the DELETE /2.0/vdn/controller/{controllerId} API
to delete the controller.
n After deletion of a node, ensure that the existing cluster stays stable.
n When deleting all the nodes in a cluster, the last remaining node must be deleted using the
Forcefully remove the controller option. Always verify that the controller VM is deleted
successfully. If not, manually power down the VM and delete the controller VM using the UI.
n If the delete operation fails, it means that the VM could not get deleted. In such case, invoke
controller delete through UI with the Forcefully remove the controller option. For API, set the
forceRemoval parameter to true. After forceful removal, manually power down the VM and
delete the controller VM using the UI.
n Since a multi-node cluster can only sustain one failure, deletion counts as a failure. The deleted
node must be redeployed before another failure occurs.
n Deleting the controller VM or powering it off directly in vCenter Server is not a supported
operation. The Status column displays Out of sync status.
n If controller deletion succeeds only partially, and an entry is left behind in the NSX Manager
database in a Cross-vCenter NSX environment, use the DELETE
api/2.0/vdn/controller/external API.
n If the controller was imported through the NSX Manager API, use
the removeExternalControllerReference API with the forceRemoval option.
n When deleting a controller, NSX requests to delete a controller VM via vCenter Server using the
Managed Object ID (MOID) of the VM. If vCenter Server cannot find VM by its MOID, NSX
reports failure for the controller delete request and aborts the operation.
If the Forcefully Delete option is selected, NSX do not abort the controller delete operation and
will clear the controller's information. NSX also update all the hosts to no longer trust the deleted
controller. However, if the controller VM is still active and running with a different MOID, it still has
credentials to participate as a member of the controller cluster. Under this scenario, any logical
switch or router that is assigned to this controller node will not function properly because the ESXi
hosts no longer trust the deleted controller.
Procedure
u When you select the Forcefully Delete option, the controller gets deleted forcefully and not
gracefully. This option ignores any failures and clears the data from database. You should verify
that any possible failures are taken care of manually. You must confirm that the controller VM is
successfully deleted. If not, you must delete it through vCenter Server.
Note If you are deleting the last controller in the cluster, you must select the Forcefully Delete
option to remove the last controller node. When there are no controllers in the system, the hosts are
operating in what is called "headless" mode. New VMs or vMotioned VMs will have networking issues
until new controllers are deployed and the synchronization is completed.
c If the cluster is not healthy, power on the controller, and fail the removal request.
d If the cluster is healthy, remove the controller VM, and release the IP address of the node.
7 Re-synchronize the controller state by clicking Actions > Update Controller State.
What to do next
In case of an NSX Controller failure, you may still have two controllers that are working. The cluster
majority is maintained, and the control plane continues to function. For more information, refer to
Redeploy an NSX Controller.
For more information about controller out of sync , refer to NSX Controller Is Disconnected.
Prerequisites
Verify if you can resolve the issue as described in NSX Controller Cluster Failures.
Procedure
3 In the NSX Controller nodes section, click the affected controller and take screen shots/print-screens
of the NSX Controller Details screen or write down the configuration information for later reference.
For example:
4 Deploy a new NSX Controller node by clicking the Add Node (+) icon.
5 In the Add Controller dialog box, select the datacenter on which you are adding the nodes, and
configure the controller settings.
d Select the IP pool from which IP addresses are to be assigned to the node.
e Click OK, wait for installation to complete, and ensure the node have a status of Normal.
6 Resynchronize the controller state by clicking Actions > Update Controller State.
Update Controller State pushes the current VXLAN and Distributed Logical Router configuration
(including Universal Objects in a Cross-vCenter NSX deployment) from NSX Manager to the
controller cluster.
What to do next
For more information about deploying controller cluster, refer to the NSX Administration Guide.
For more information about how to delete the affected controller, refer to Delete an NSX Controller.
Phantom Controller
A phantom controller can be a live controller virtual machine (VM) or non-existent VM that can be
participating or not participating in the cluster. NSX Manager synchronises the list of all VMs from the
vCenter Server inventory. A phantom controller is created when the vCenter Server or host deletes a
controller VM without a request from NSX Manager, or when vCenter Server inventory changes the
reference MOID of the controller VMs.
When controller is created from NSX, the configuration information is stored inside the NSX Manager.
NSX Manager deploys the new controller VM through the vCenter Server.
NSX administrator provides configuration, including IP address pool to the NSX Manager to create a
controller. NSX Manager removes an IP address from the pool, and pushes that IP with the rest of the
controller configuration as a VM creation request to the vCenter Server. NSX Manager waits for
vCenter Server to confirm the status of the request.
n The controller creation process was successful: If the controller VM is created successfully,
vCenter Server starts the controller VM. NSX Manager stores the Managed Object ID (MOID) of the
VM with the rest of the controller’s configuration information. The MOID (or MO-REF) is a unique
identifier that vCenter assigns to every object in its inventory. vCenter Server also use this MOID to
track the VM if it remains part of the vCenter Server inventory.
n The controller creation process was not successful: If the IP and network connection
configurations were incorrect, then NSX Manager might not be able to contact vCenter Server.
NSX Manager waits for a preset amount of time to create a single node controller cluster (for the first
one) or new controller to join the active cluster. If timer expires, NSX Manager requests
vCenter Server to delete the VM. The IP address is returned back to the pool and NSX declares
controller creation failure.
However, if any vCenter activities result in removal of the controller VM from the vCenter Server inventory,
vCenter removes the MOID from its database. Note that the controller VM can still be alive and active on
the NSX Manager even after getting removed from the vCenter inventory. But for the vCenter Server,
controller VM no longer exists. Even though vCenter Server has removed the VM from its inventory, the
VM may not be deleted. If the VM is still active, then it is still participating or attempting to participate in
the NSX controller cluster.
Following are the most common example of how phantom controller gets created:
n The vCenter Server administrator removes the host that contains the controller VM from the
inventory. Later adds the host back. When the host is removed, vCenter Server delete all the MOIDs
associated with the host and the VMs within it. When the host is added back later, vCenter Server
assigns brand new MOID to the host and the VMs. For the NSX users, the host and VM are still the
same, but from the vCenter Server’s perspective, the hosts and VMs are brand new objects.
However, for all practical purposes, the hosts and VMs are still the same. The applications that run
within the host and VMs do not change.
n The vCenter Server administrator deletes the controller VM through vCenter Server or using Host
Management. The deletion was not initiated by NSX Manager.
n Delete in this case also includes any host/storage failures that result in the loss of the VM. In this
case, the VM is lost to vCenter Server and also lost to the cluster and NSX Manager. But because the
deletion was not initiated by NSX Manager, both NSX Manager and the controller cluster thinks that
the controller is still valid. The controller status returned to the NSX Manager indicates that this
controller node is down and not part of the cluster and displayed on the UI. NSX Manager also have
logs indicating that the controller is no longer reachable.
2 See the log entries. For cases where the controller VM got deleted accidentally or got corrupted, you
must use the Forcefully Delete option to clear the entry from the NSX Manager database. For
details, refer to Delete an NSX Controller.
n The syslog entries for the NSX Manager no longer shows an extra controller.
From NSX 6.2.7 or later, NSX Manager verifies with the vCenter inventory to ensure that the controller
VM still exist in the inventory based on the original MOID. If NSX Manager cannot find controller VM in the
inventory, NSX Manager searches the VM using the VM’s instance UUID. The instance UUID is stored
within the VM, so it does not change even when the VM is added back to the vCenter inventory. If
NSX Manager is able to find the VM with the instance UUID, NSX Manager updates its database with the
new MOID.
However, if you clone the controller VM, the cloned VM has same properties as the original VM along with
a new instance UUID. NSX Manager cannot detect MOID for the cloned VM.
Problem
Cause
If there is any missing connection, then control plane agent may not be working properly.
Solution
1 Validate the connection status when the channel goes into a wrong state using the following
command:
GET
https://<NSX_Manager_IP>/api/2.0/vdn/inventory/host/{hostId}/connection/status
<errorMessage>Connection Refused</errorMessage>
</hostToControllerConnectionError>
<hostToControllerConnectionError>
<controllerIp>10.160.203.237</controllerIp>
<errorCode>1255603</errorCode>
<errorMessage>SSL Handshake Failure</errorMessage>
</hostToControllerConnectionError>
</hostToControllerConnectionErrors>
</hostConnStatus>
2 Determine the reason for the control plane agent being down as follows:
a Check the control plane agent status on hosts by running the /etc/init.d/netcpad status
command on ESXi hosts.
b Check the control plane agent configurations using the more /etc/vmware/netcpa/config-
by-vsm.xml command. The IP addresses of the NSX Controllers should be listed.
<thumbprint>BD:DB:BA:B0:DC:61:AD:94:C6:0F:7E:F5:80:19:44:51:BA:90:2C:8D</thumbprint>
</connection>
</connectionList>
...
3 Validate connections to the controllers from the control plane agent using the following command.
The output is one connection for each controller.
4 Validate the connections to the controllers from the control plane agent to show CLOSED or
CLOSE_WAIT status by running the following command:
esxcli network ip
connection list |grep "1234.*netcpa*" | egrep "CLOSED|CLOSE_WAIT"
5 If the control plane agent has been down for a significantly long time, the connections may not be
present at all. To validate this, run the following command. The output is one connection for each
controller.
esxcli network ip
connection list |grep "1234.*netcpa*" |grep ESTABLISHED
6 Control Plane Agent (netcpa) auto-recovery mechanism: The automatic control plane agent
monitoring process detects the control plane agent in wrong status. When the control plane agent is
in a wrong status, it stops responding and then automatically tries to recover.
a When the control plane agent stops responding, live core file is generated. You can find the core
file as follows:
ls /var/core
netcpa-worker-zdump.000
Note If the control plane agent monitor experiences a temporary failure due to a delayed response
to the status check, a warning message similar to the following may be reported in the VMKernel
logs.
7 If the issue is not recovered automatically, restart the control plane agent as follows:
a Log in as root to the ESXi host through SSH or through the console.
b Run the /etc/init.d/netcpad restart command to restart the control plane agent on the
ESXi host.
n Troubleshooting EPSecLib
NSX Manager
Configuration/Status
Service vSphere ESXi (Partner defined)
Configuration Health
Insertion Agent Manager
Manager Monitoring
Framework (EAM)
Configuration Health
data Events GI SVM Partner SVM
(RMQ) (RMQ) Deployment Deployment
ESXi Hypervisor
GI SVM
Guest VM Partner SVM
VMware Tools
GI ESXi Module
Legend
/var/log/syslog
var/run/syslog.log
For example:
To turn on full logging, perform these steps on the ESXi host command shell:
1 Run the ps -c |grep Mux command to find the ESX GI Module processes that are currently running.
For example:
~ # ps -c | grep Mux
192223 192223 sh /bin/sh /sbin/watchdog.sh -s vShield-Endpoint-Mux -q 100 -t
1000000 /usr/lib/vmware/vShield-Endpoint-Mux 900 -c 910
192233 192233 vShield-Endpoint-Mux /usr/lib/vmware/vShield-Endpoint-Mux 900 -c 910
192236 192233 vShield-Endpoint-Mux /usr/lib/vmware/vShield-Endpoint-Mux 900 -c 910
2 If the service is not running, you can restart it with these commands:/etc/init.d/vShield-
Endpoint-Mux start or /etc//init.d/vShield-Endpoint-Mux restart.
3 To stop the running ESX GI Module processes, including the watchdog.sh process, run the ~ # kill
-9 192223 192233 192236 command.
4 Start an ESX GI Module with a new -doption. Note that option -d does not exist for epsec-mux builds
5.1.0-01255202 and 5.1.0-01814505 ~ # /usr/lib/vmware/vShield-Endpoint-Mux –d 900 –c
910
5 View the ESX GI Module log messages in the /var/log/syslog.log file on the ESXi host. Check
that the entries corresponding to the global solutions, solution ID, and port number are specified
correctly.
<EndpointConfig>
<InstalledSolutions>
<Solution>
<id>100</id>
<ipAddress>xxx.xxx.xxx.xxx</ipAddress>
<listenOn>ip</listenOn>
<port>48655</port>
<uuid>42383371-3630-47b0-8796-f1d9c52ab1d0</uuid>
<vmxPath>/vmfs/volumes/7adf9e00-609186d9/EndpointService (216)/EndpointService
(216).vmx</vmxPath>
</Solution>
<Solution>
<id>102</id>
<ipAddress>xxx.xxx.xxx.xxx</ipAddress>
<listenOn>ip</listenOn>
<port>48651</port>
<uuid>423839c4-c7d6-e92e-b552-79870da05291</uuid>
<vmxPath>/vmfs/volumes/7adf9e00-609186d9/apoon/EndpointSVM-alpha-01/EndpointSVM-
alpha-01.vmx</vmxPath>
</Solution>
<Solution>
<id>6341068275337723904</id>
<ipAddress>xxx.xxx.xxx.xxx</ipAddress>
<listenOn>ip</listenOn>
<port>48655</port>
<uuid>42388025-314f-829f-2770-a143b9cbd1ee</uuid>
</Solution>
</InstalledSolutions>
<DefaultSolutions/>
<GlobalSolutions>
<solution>
<id>100</id>
<tag></tag>
<order>0</order>
</solution>
<solution>
<id>102</id>
<tag></tag>
<order>10000</order>
</solution>
<solution>
<id>6341068275337723904</id>
<tag></tag>
<order>10001</order>
</solution>
</GlobalSolutions>
</EndpointConfig>
The thin agent logs are on the ESXi host, as part of the VCenter Log Bundle. The log path
is /vmfs/volumes/<datastore>/<vmname>/vmware.log For
example: /vmfs/volumes/5978d759-56c31014-53b6-1866abaace386/Windows10-(64-
bit)/vmware.log
Thin agent messages follow the format of <timestamp> <VM Name><Process Name><[PID]>:
<message>.
In the log example below Guest: vnet or Guest:vsep, indicate log messages related to the respective
GI drivers, followed by debug messages.
For example:
This procedure requires you to modify the Windows registry. Before you modify the registry, ensure to
take a backup of the registry. For more information on backing up and restoring the registry, see the
Microsoft Knowledge Base article 136393.
1 Click Start > Run. Enter regedit, and click OK. The Registry Editor window opens. For more
information seen the Microsoft Knowledge Base article 256986.
3 Under the newly created parameters key, create these DWORDs. Ensure that hexadecimal is
selected when putting in these values:
Name: log_dest
Type: DWORD
Value: 0x2
Name: log_level
Type: DWORD
Value: 0x10
Audit 0x1
Error 0x2
Warn 0x4
Info 0x8
Debug 0x10
4 Open a command prompt as an administrator. Run these commands to unload and reload the vShield
Endpoint filesystem mini driver:
You can find the log entries in the vmware.log file located in the virtual machine.
This procedure requires you to modify the Windows registry. Before you modify the registry, ensure to
take a backup of the registry. For more information on backing up and restoring the registry, see the
Microsoft Knowledge Base article 136393.
1 Click Start > Run. Enter regedit, and click OK. The Registry Editor window opens. For more
information seen the Microsoft Knowledge Base article 256986.
Alternatively you can set the log_dest registry setting to DWORD:0x000000002, in which case the driver
logs will be printed to vmware.log file, which is located in the corresponding virtual machine folder on the
ESXi Host.
1 On Windows XP and Windows Server 2003, create a tools config file if it doesn’t exist in the
following path: C:\Documents and Settings\All Users\Application Data\VMware\VMware
Tools\tools.conf.
2 On Windows Vista, Windows 7 and Windows Server 2008, create a tools config file if it doesn’t
exist in the following path: C:\ProgramData\VMWare\VMware Tools\tools.conf
3 Add these lines in the tools.conf file to enable UMC component logging.
[logging]
log = true
vsep.level = debug
vsep.handler = vmx
With the vsep.handler = vmx setting, the UMC component logs into the vmware.log file, which is
located in the corresponding virtual machine folder on the ESXi host.
With the following setting logs, the UMC component logs will be printed in the specified log file.
vsep.handler = file
vsep.data = c:/path/to/vsep.log
/var/log/syslog
var/run/syslog
EPSecLib messages follow the format of <timestamp> <VM Name><Process Name><[PID]>: <message>
In the example below [ERROR] is the type of message and (EPSEC) represents the messages that are
specific to Guest Introspection.
For example:
Collecting Logs
To enable debug logging for the EPSec library, which is a component inside GI SVM:
1 Log in to the GI SVM by obtaining the console password from NSX Manager.
ENABLE_DEBUG=TRUE
ENABLE_SUPPORT=TRUE
This enables debug logging for EPSecLib on the GI SVM and the debug logs can be found
in /var/log/messages which are applicable for NSX for vSphere 6.2.x & 6.3.x. Because the debug
setting can flood the vmware.log file to the point that it can make it to throttle, we recommend you
disable the debug mode as soon as you have collected all the required information.
GI SVM Logs
Before you capture logs, determine the Host ID, or Host MOID:
n Run the show cluster all and show cluster <cluster ID> commands in the NSX Manager.
For example:
Datacenter: RegionA01
Cluster: RegionA01-COMP01
No. Host Name Host Id Installation Status
1 esx-01a.corp.local host-29 Ready
2 esx-02a.corp.local host-31 Ready
GET https://nsxmanager/api/1.0/usvmlogging/host-##/com.vmware.vshield.usvm
GET https://nsxmanager/api/1.0/usvmlogging/host-##/root
POST https://nsxmanager/api/1.0/usvmlogging/host-##/changelevel
GET https://NSXMGR_IP/api/1.0/hosts/host.###/techsupportlogs
Note that this command generates GI SVM logs and saves the file as techsupportlogs.log.gz file.
Because the debug setting can flood the vmware.log file to the point that it can make it to throttle, we
recommend you disable the debug mode as soon as you have collected all the required information.
1 Determine if NSX Guest Introspection is used in the customer environment. If it is not, remove the
Guest Introspection service for the virtual machine, and confirm the issue is resolved.
a ESXi build version - Run the commanduname –a on the ESXi host or click on a host in the
vSphere Web Client and look for the build number at top of the right-hand pane.
Build number
------------------
Ubuntu
dpkg -l | grep vmware-nsx-gi-file
SLES12 and RHEL7
rpm -qa | grep vmware-nsx-gi-file
® ®
3 VMware NSX for vSphere version, and the following:
n EPSec Library version number used by the partner solution: Log into the GI SVM and run
#strings path to EPSec library/libEPSec.so | grep BUILD
4 ESX GI Module (MUX) version - run the command esxcli software vib list | grep epsec-mux.
6 Collect ESXi host logs. For more information, see Collecting diagnostic information for VMware
ESX/ESXi (653).
7 Collect service virtual machine (GI SVM) logs from the partner solution. Reach out to your partner for
more details on GI SVM log collection.
8 Collect a suspend state file while the problem is occurring, see Suspending a virtual machine on
ESX/ESX (2005831) to collect diagnostic information.
9 After collecting date, compare the compatibility of the vSphere components. For more information,
see the VMware Product Interoperability Matrices.
1 Check the compatibility of all the components involved. Compatibility is one of the main issues with
Endpoint. You need the build numbers for ESXi, vCenter Server, NSX Manager, and which ever
Security solution you have chosen (Trend Micro, McAfee, Kaspersky, Symantec etc). Once this data
has been collected, compare the compatibility of the vSphere components. For more information, see
the VMware Product Interoperability Matrices.
3 Verify that the thin agent is running by with the service vsepd status command. Once this command is
executed you should see the vsep service in running state.
4 If you believe that the thin agent is causing a performance issue with the system, stop the service by
running the service vsepd stop command.
5 Then perform a test to get a baseline. You can then start the vsep service and perform another test by
running the service vsepd start command.
Note Enabling full logging may result in heavy log activity flooding the vmware.log file, causing
it to potentially grow to be very large. Disable full logging as soon as possible.
2 Ensure that VMware Tools ™ is up-to-date. If you see that only a particular virtual machine is
affected, see Installing and upgrading VMware Tools in vSphere (2004754).
3 Verify that the thin agent is loaded by running the Powershell command fltmc.
Once this command is executed, You should see the name vsepflt on the list of drivers. If the driver is
not loaded, you should be able to load the driver with the fltmc load vsepflt command.
4 If t the thin agent is causing a performance issue with the system, unload the driver with this
command: fltmc unload vsepflt.
Next, perform a test to get a baseline. You can then load the driver and perform another test by
running this command:
If you do verify that there is a performance problem with the Thin agent, see Slow VMs after
upgrading VMware tools in NSX and vCloud Networking and Security (2144236).
5 If you are not using Network Introspection, remove or disable this driver.
Network Introspection can also be removed through the Modify VMware Tools installer:
e Find NSX File Introspection. There should be a sub folder just for Network Introspection.
6 Enable debug logging for the thin agent. For more information, see Guest Introspection Logs. All
debugging information is configured to log to the vmware.log file for that virtual machine.
7 Review the file scans of the thin agent by reviewing the procmon logs. For more information, see
Troubleshooting vShield Endpoint performance issues with anti-virus software (2094239).
a ESXi build version - Run the command uname –a on the ESXi host or click on a host in the
vSphere Web Client and look for the build number at top of the right-hand pane.
Build number
------------------
Ubuntu
dpkg -l | grep vmware-nsx-gi-file
SLES12 and RHEL7
rpm -qa | grep vmware-nsx-gi-file
® ®
3 VMware NSX for vSphere version, and the following:
n EPSec Library version number used by the partner solution: Log into the SVM and run #strings
path to EPSec library/libEPSec.so | grep BUILD
4 ESX GI Module (MUX) version - run the command esxcli software vib list | grep epsec-mux.
6 Collect ESXi host logs. For more information, see Collecting diagnostic information for VMware
ESX/ESXi (653).
7 Collect service virtual machine (SVM) logs from the partner solution. Reach out to your partner for
more details on SVM log collection.
8 Collect a suspend state file while the problem is occurring, see Suspending a virtual machine on
ESX/ESX (2005831) to collect diagnostic information.
# file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/sbin/vsep'
1 Check to see if the service is running on the ESXi host by running the # /etc/init.d/vShield-Endpoint-
Mux status command:
For example:
# /etc/init.d/vShield-Endpoint-Mux status
vShield-Endpoint-Mux is running
2 If you see that the service is not running, restart it or start it with this command:
/etc/init.d/vShield-Endpoint-Mux start
or
/etc/init.d/vShield-Endpoint-Mux restart
Note that it is safe to restart this service during production hours as it does not have any great impact,
and restarts in a couple of seconds.
3 To get a better idea of what the ESX GI Module is doing or check the communication status, you can
check the logs on the ESXi host. ESX GI Module logs are written to the host /var/log/syslog file.
This is also included in the ESXi host support logs.
For more information, see Collecting diagnostic information for ESX/ESXi hosts and vCenter Server
using the vSphere Web Client (2032892)
4 The default logging option for ESX GI Module is info and can be raised to debug to gather more
information:
5 Re-installing the ESX GI Module module can also fix many issues, especially if the wrong version is
installed, or the ESXi host was brought into the environment which previously had Endpoint installed
on it. This needs to be removed and re-installed.
To remove the VIB, run this command: esxcli software vib remove -n epsec-mux
6 If you run into issues with the VIB installation, check the /var/log/esxupdate.log file on the ESXi host.
This log shows the most clear information as to why the driver did not successfully get installed. This
is a common issue for ESX GI Module installation issues. For more information, see Installing NSX
Guest Introspection services (ESX GI Module VIB) on the ESXi host fails in VMware NSX for vSphere
6.x (2135278).
7 To check for a corrupt ESXi image look for a message similar to this:
8 To verify that the image is corrupted run the command cd /vmfs/volumes on the ESXi host.
a Search for the imgdb.tgz file by running this command: find * | grep imgdb.tgz.
0ca01e7f-cc1ea1af-bda0-1fe646c5ceea/imgdb.tgz or edbf587b-
da2add08-3185-3113649d5262/imgdb.tgz
For example:
The default size for the imgdb.tgz file is far greater than the other file or if one of the files is only a
couple of bytes, it indicates that the file is corrupt. The only supported way to resolve this is to re-
install ESXi for that particular ESXi host.
Troubleshooting EPSecLib
EPSecLib
The NSX Manager handles the deployment of this virtual machine. In the past (with vShield), the third
party SVA solution handles the deployment. That solution now connects to the NSX Manager. The NSX
Manager handles the deployment of this SVA. If there are alarms on the SVA's in the environment, re-
deploy them through the NSX Manager.
n Any configuration is lost as this is all stored inside the NSX Manager.
n NSX relies on EAM for deploying and monitoring VIBs and SVMs on host such as the SVA.
n The Install status in NSX User Interface (UI) can only tell if the VIBs are installed, or if the SVM is
powered on.
n The Service status in NSX UI indicates if the functionality in the virtual machine is working
SVA deployment and relationship between NSX and vCenter Server Process
1 When the Cluster is selected to be prepared for Endpoint, an Agency is created on EAM to deploy the
SVA.
2 EAM then deploys the ovf to the ESXi host with the agency info it created.
5 NSX Manager communicates to the Partner SVA Solution Manager that the virtual machine was
powered on and registered.
7 Partner SVA Solution Manager sends an event to NSX to indicate that the service inside the SVA
virtual machine is up and running.
8 If you are having an issue with the SVA, there are two places you can look at the logs. You can check
the EAM logs, as EAM handles the deployment of these virtual machines. For more information, see
Collecting diagnostic information for VMware vCenter Server 4.x, 5.x and 6.0 (1011641). Alternatively,
look at the SVA logs.
9 If there is a problem with the SVA deployment, it is possible that there is an issue with EAM and the
communication to NSX Manager. You can check the EAM logs, and the simplest thing to do is to
restart the EAM Service. For more information, see Host Preparation.
10 If all of the above seems to be working, but you want to test the Endpoint functionality, you can test
this with an Eicar Test file:
n Create any new text file with any label. For example: eicar.test.
X5O!P%@AP[4\PZX54(P^)7CC)7}$EICAR-STANDARD-ANTIVIRUS-TEST-FILE!$H+H*
n Save the file. Upon saving, you should see that the file is deleted. This verifies that the Endpoint
solution is working. For more information see Eicar.