Intel 40G Card Controller Internals

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

INTEL 40GBE ETHERNET CONTROLLER

M Jay & Helin Zhang - Intel


DPDK US Summit - San Jose - 2016

About this Document

The performance measurement and analysis of an embedded platform for


communication and security processing can be very challenging due to the
diverse applications and workload inherent in the platform. The Internet of
Things Group (IoTG) and Network Platform Group(NPG) are dedicated to
performing lab measurements which will assist customers in understanding
the performance of combinations of Intel architecture microprocessors and
chipsets.

This document publishes a set of indicative performance data for selected


Intel processors and chipsets. However, the data should be regarded as
reference material only and the reader is reminded of the important
Disclaimers that appear in this document.

Intel, Intel Core and the Intel logo are trademarks or registered trademarks of
Intel Corporation or its subsidiaries in the United States and other countries.

Copyright Intel Corporation 2016. All rights reserved.


* Other names and brands may be claimed as the property of
others.

Disclaimers

By using this document, in addition to any agreements you have with Intel, you accept the terms set forth below.

You may not use or facilitate the use of this document in connection with any

infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim
thereafter drafted which includes subject matter disclosed herein.

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL
ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING
LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU
PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES,
SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES
AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY
WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR
WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any
features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or
incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications.
Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go
to: http://www.intel.com/design/literature.htm

* Other names and brands may be claimed as the property of


others.

Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel
microprocessors for optimizations that are not unique to Intel microprocessors. These
optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations.
Intel does not guarantee the availability, functionality, or effectiveness of any
optimization on microprocessors not manufactured by Intel.
Microprocessor-dependent optimizations in this product are intended for use with
Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are
reserved for Intel microprocessors. Please refer to the applicable product User and
Reference Guides for more information regarding the specific instruction sets covered
by this notice.

Notice revision #20110804

* Other names and brands may be claimed as the property of


others.

Flexible Packet Processing XL710

Server Virtualization VMDq for Emulated path; SR-IOV for Direct Assignment

Network virtualization Overlay stateless offloads for VXLAN, NVGRE, VXLAN GRE

Flexible Add new features after production by upgrading firmware

Intelligent load distribution for high performance traffic flows Flow Director

Virtual Bridging support that delivers control & management of virtual I/O

Both host-side and switch-side

XL710 Internals

Helin

Block Diagram - XL710/X710

Classification XL710 Vs 82599

NIC Anatomy

Classification

Customer Usage Models- Requirements


M Jay

What issues you see With 3-Tier Traditional Data Center


Network?

What Scaling Problems Do You See?

12

Issues With Traditional 3-Tier Enterprise Data Center Network


2) Frame header processing at
very high bandwidth- adds more
congestion to the network

Larger Switches @ high bandwidth + l3 features => Expensive

40 Gig Advantages - Flat Data Center


Networks
2) Smaller Table Sizes with Tunneling Label
(Cost effective): Compared to ToR
Switches, smaller Table sizes since core
switches can simply use tables containing
the address of ToR switches in the network
based on some sort of tunneling label.

3) Simplify Frame Processing (Cost


effective) : Frame processing can be
simplified since tunneling functions
can be moved to the edge of the
network i.e., ToR Switch or vSwitch and
not necessarily be done by core
switch.

Low Latency, High Quality Network + Simplified Core Switch => Cost Effective

Possible Tunneling Locations

Tunneling at Vswitch

Tunneling at NIC

Tunneling at ToR Switch

VEPA Consistent Treatment Of All Network


Traffic
VEPA An Overview

VEPA XL710

VXLAN Packet Flow

Question: What is UDP Source Port Used For?

NVGRE

VXLAN

NVGRE

UDP + VXLAN header

Only GRE header

Inner L2 header contains VLAN


tag

No VLAN tag in inner L2 Tag

UDP Port for Hash

Reserved 8 bits (Random for


uniform distribution) + VSID for
Hash

40 GbE - Step by Step Walk Through


#Description

Requirement

Reference

1What is important in my h/w


Platform?

Ensure all the 4 memory channels are


populated. AND use n 4 in the command line
also

use "dmidecode -t memory" to check the memory


status.

* Note: This is one important element to affect


the performance

2Where the NIC should be plugged in? Use PCIe Gen3 slots, such as Gen3 x8 or Gen3
And Why?
x16

4What needs to be updated in NIC?

NUMA considerations
Make Sure each NIC has flashed the latest
version of NVM/firmware.

Since this is very important please procure


additional memory and populate all the memory
channels

Because PCIe Gen2 slots can't provide enough


bandwidth for 2x10G and above.

Go do downloadcenter.intel.com and search for


XL710 NVM Update.
It takes you here:

https://downloadcenter.intel.com/search?keyword=
NVM+Update+Utility+for+Intel%C2%AE+Ethernet+C
onverged+Network+Adapter+XL710+%26+X710+Seri
es

40 GbE - Step by Step Walk Through

#Description

Requirement

5BIOS settings
6Linux System Essentials
7Huge Page

Refer BIOS Settings


Real Time Nature of the Process, cgroup
1) Size of the FIB Table, 2) Locality challenges of
packets
Isolcpus option under title Grub Parameters
- Essential Requirement

8Scheduler

Reference

TLB Miss, Page Walk

BIOS Setting
Menu (Advanced)
CPU Configuration ->Advanced
Power Management
Configuration
-> CPU P State Control
-> CPU P State Control
-> CPU P State Control
-> CPU C State Control
-> CPU C State Control
-> CPU C State Control
-> CPU C State Control
-> CPU C State Control
Chipset Configuration
-> North Bridge -> QPI
Configuration

-> North Bridge -> Memory


Configuration

-> North Bridge -> IIO


Configuration
PCIe/PCI/PnP Configuration

BIOS Setting

Required Setting

BIOS default

Disable
Disable
Disable
HW_ALL
Disable
Disable
Disable
[C6 (Retention)]
Disable

Custom
Enable
Enable
HW_ALL
Enable
Enable
Enable
[C6 (Retention)]
Enable

Isoc Mode
COD Enable

Disable

Disable

Disable

Auto

Early Snoop

Disable

Auto

Enforce POR

Disable

Auto

Memory Frequency
DRAM RAPL Baseline

2133
Disable

Auto
Auto

Intel VT for Directed I/O (VT-d)

Disable

Enable

ASPM

Disable

Disable

Power Technology
ESIT (P-States)
Turbo Mode
P-State Coordination
Turbo Mode
CPU C3 Report
CPU C6 Report
Package C State Limit
Enhanced Halt State(C1E)

40 GbE - Step by Step Walk Through


# Description

Requirement

Reference

9 For Intel 40 Gig NICs, special


configurations should be set before
compiling it. This is very Important.

For at least DPDK release 1.8, 2.0 and 2.1,


in <dpdk_folder>/config/common_linuxapp
[this step is not needed from R16.07]
CONFIG_RTE_PCI_CONFIG=y and
CONFIG_RTE_PCI_EXTENEDED_TAG=on
10 Phase 1: Running l3fwd application & With 2 core, 2 Threads, 2 Ports (with
command to run for testing 2 x 10 G only 1 Queue/port)
Please only run l3fwd, to start with,
Note: Please do not run full application.
to have a baseline performance for
Run l3fwd to benchmark your platform
comparison purpose.
and configuration.
11 In l2fwd, #define NB_MBUF 16384
[This increases buffer count to 16K
from 8K] in the file
examples/l2fwd/main.c
12 Phase 2: Running l3fwd application &
command to run for testing 4 x 10 G

This helps increase the efficiency of PCIe by


increasing the number of outstanding
transactions from 36 to 256.
./l3fwd -c 0x3fc00 -n 4 -w 05:00.0 -w
05:00.1 -- -p 0x3 --config '(0,0,10),(1,0,11)'
*Note config (port, queue, core ID) is the
format above

Change in examples/l2fwd/main.c the


L2fwd. After making the changes, Save. Build
values of RTE_TEST_RX_DESC_DEFAULT and l2fwd with make.
RTE_TEST_TX_DESC_DEFAULT both to 1024.
With 4 core, 4 Threads, 4 Ports (with
only 1 Queue/port) Single port x 40
Gig configuration

./l3fwd -c 0x3fc00 -n 4 -- -p 0xf --config


'(0,0,10),(1,0,11),(2,0,12),(3,0,13)
*Note config (port, queue, core ID)
is the format above

Core 2

Queue 1

Core 1

Port

Queue 2

Use 2 Cores

System Configuration

Other names and brands


may be claimed as the property of others.

24

BIOS Tuning Settings

Other names and brands may be claimed


as the property of others.

25

Latency & Throughput How To Improve?

Latency Hiding Prefetch

Throughput - Bulk

Admin Queues DOs and Donts


XL710 Admin Queue Versus 82599 Mail Box
Run time changing MTU? - Think Again. Why?
Run time Resetting VFs from PF?

Functional Performance
Measurement for
Communications:
Layer 3 Forwarding using
10GbE And 40GBE
* Other names and brands may be claimed as the property of
others.

Test Setup for 10G Cards

Device Under Test (DUT)

DDR4-2400 ECC 1Rx8

14 Core Intel Xeon


E5-2658 v4 Processor

Lynx point

4x 10GbE
Ports

E10/100

Ixia
Ixia* 10 Gigabit Ethernet Traffic
Generator

* Other names and brands may be claimed as the property of


others.

X710-DA4
adapter

Test Setup for 40G Cards

Device Under Test (DUT)

DDR4-2400 ECC 1Rx8

XL710-DA2
adapter

40 GbE
ports

14 core Intel Xeon E5-2658 v4


Processor

PCI-E Gen3 x8
Slot 0

Lynx point
PCI-E Gen3 x8
Slot 1

40GbE
Ports

10G/40G

Ixia
Ixia* 10/40 Gigabit Ethernet Traffic
Generator

* Other names and brands may be claimed as the property of


others.

XL710-DA2
adapter

Test Setup -Cont.

DUT:

Intel Xeon E5-2658 v4 processor,35MB L3 cache


Super Micro* Platform (X10DRX)
DDR4 2400 MHz, 4 x 1Rx4 registered ECC 16GB (total 64GB), 4 memory channels per
socket Configuration, 1 DIMM per channel
1 x Intel X710-DA4-FH PCI-E Gen3X8 Quad Port Ethernet Controller (NVM: 5p04)
2 x Intel XL710-DA2 PCI-E Gen3x8 Dual Port 40GbE Ethernet Controller (NVM: 5p04)

IXIA* Traffic Parameters:

Acceptable Frame Loss: 0.00001%


Resolution: 0.1
Traffic Duration: 20 Seconds

Software:

BIOS version: Version: 2.0 & Date: 12/17/2015


Operating system: Fedora 23
Kernel version: 4.2.3-300.fc23.x86_64
IxNetwork* : 7.40 EA
DPDK version: 16.04
DPDK L3fwd example application on Linux user space (LPM for route lookup)

* Other names and brands may be claimed as the property of


others.

.hw_ip_checksum = 0, /**< IP checksum offload enabled */


#define RTE_TEST_RX_DESC_DEFAULT 1024
#define RTE_TEST_TX_DESC_DEFAULT 1024

Flow Traffic Configuration


4 x10G Ports

256 flows

256 flows

Socket 0

256 flows

10 Gigabit
Ethernet
X710

Slot 0

IA PLATFORM
(DEVICE UNDER TEST)

256 flows

4 X 10G PORTS WITH 256 BIDIRECTIONAL FLOWS

2 port configuration with 256 bi-directional flows per


port

Port 0 -> Port 1


Port 1 -> Port 0

* Other names and brands may be claimed as the property of


others.

Port 2 -> Port 3


Port 3 -> Port 2

Flow Traffic Configuration


2 x40G Ports

Slot 0

Socket 0

256 flows

40 Gigabit
Ethernet
XL710

40 Gigabit
Ethernet
XL710

Slot 1

IA PLATFORM
(DEVICE UNDER TEST)

256 flows

2 DUAL PORT 40G WITH 256 BIDIRECTIONAL FLOWS

2 port configuration with 256 bi-directional flows per


port

Port 0 -> Port 1


Port 1 -> Port 0

* Other names and brands may be claimed as the property of


others.

Polling Affinity for Ethernet Queues- 4x10G ports

2 ports (1 Core/1 Thread /1Queue)


CPU1 (Core 1 SMT 0) polls port 0
CPU1 (Core 1 SMT 0) polls port 1
CPU1 (Core 1 SMT 0) polls port 2
CPU1 (Core 1 SMT 0) polls port 3

2 ports - (2 Core / 2 Threads/1 Queue)


CPU1 (Core 1 SMT 0) polls port 0
CPU1 (Core 2 SMT 0) polls port 1
CPU1 (Core 1 SMT 0) polls port 2
CPU1 (Core 2 SMT 0) polls port 3

2 ports - (1 Core / 2 Threads/1 Queue)


CPU1 (Core 1 SMT 0) polls port
CPU2 (Core 15 SMT 1) polls port
CPU1 (Core 1 SMT 0) polls port
CPU2 (Core 15 SMT 1) polls port

0
1
2
3

Each polling core has 100% CPU Utilization.


Remaining cores are IDLE
* Other names and brands may be claimed as the property of
others.

Polling Affinity for Ethernet Queues-2x40G ports

2 ports (1 Core / 1 Thread/2 Queues)

CPU1 (Core 1 SMT 0) polls port 0 queue 0


CPU1 (Core 1 SMT 0) polls port 0 queue 1
CPU1 (Core 1 SMT 0) polls port 1 queue 0
CPU1 (Core 1 SMT 0 polls port 1 queue 1

2 ports (1 Core / 2 Thread/2 Queues)

CPU1 (Core 1 SMT 0) polls port 0 queue 0


CPU2 (Core 15 SMT 1) polls port 0 queue 1
CPU1 (Core 1 SMT 0) polls port 1 queue 0
CPU2 (Core 15 SMT 1) polls port 1 queue 1

2 ports (2 Core / 4 Thread/2 Queues)

CPU1 (Core 1 SMT 0) polls port 0 queue 0


CPU2 (Core 15 SMT 1) polls port 0 queue 1
CPU1 (Core 2 SMT 0) polls port 1 queue 0
CPU2 (Core 16 SMT 1) polls port 1 queue 1

2 ports (4 Core / 4 Thread/2 Queues)

CPU1 (Core 1 SMT 0) polls port 0 queue 0


CPU1 (Core 2 SMT 0) polls port 0 queue 1
CPU1(Core 3 SMT 0) polls port 1 queue 0
CPU1(Core 4 SMT 0) polls port 1 queue 1

2 ports (2 Core / 2 Thread/2 Queues)

CPU1 (Core 1 SMT 0) polls port 0 queue 0


CPU1 (Core 2 SMT 0) polls port 0 queue 1
CPU1 (Core 1 SMT 0) polls port 1 queue 0
CPU1 (Core 2 SMT 0) polls port 1 queue 1
Each polling core has 100% CPU Utilization.
Remaining cores are IDLE
* Other names and brands may be claimed as the property of
others.

Reference & Acknowledgements

Cloud Networking Understanding Cloud-Based Data Center Networks Gary Lee.


http://cat.intel.com Get NDA performance foils here.
DPDK Cook Book on Vtune M Jay

https://software.intel.com/en-us/articles/profile-dpdk-code-with-intel-vtune-amplifier
DST 2016: v-ISG-Fortville: Explaining Fortville Features Enabled with DPDK Rel 16.04 Hash
and Flow Director Filters, Native MPLS (Virtual) Andrey Chilikin, Eoin Walsh.
CISCO White Paper January 2016 VXLAN Best Practices
Intel XL710/X710 Data Sheet
George for Performance setup
Rashmin foils for Virtualization
http://blog.jgriffiths.org/?p=929

Deep Dive: How does NSX Distributed Router Work

* Other names and brands may be claimed as the property of


others.

M Jay
[email protected]

Questions?

Helin Zhang
[email protected]

Comparing XL710/X710 to Prior NIC 82599

GENEVE & NSH added after the chip is released - Flexibility !

You might also like