Lecture 3 P4 NetFPGA

CS344 – Lecture 3
Copyright © 2018 – P4.org

P4 Toolchain for BMv2 software simulation
Copyright © 2018 – P4.org 2

Basic Workflow
simple_switch_CLI
Program-independent
test.p4 CLI and Client
TCP Socket
(Thrift)
Program-independent
Control Server
p4c-bm2-ss test.json
L
o
PRE g
simple_switch (BMv2)
Egress
Ingress
P4
D Debugger
test.json e
b
u
g
Parser Deparser
Packet Packet
generator Port Interface sniffer
veth0..n
Linux Kernel
Step 1: P4 Program Compilation
test.p4 $ p4c-bm2-ss -o test.json test.p4
p4c-bm2-ss
test.json

Step 2: Preparing veth Interfaces
test.p4 $ sudo ~/p4lang/tutorials/examples/veth_setup.sh
# ip link add name veth0 type veth peer name veth1

# for iface in “veth0 veth1”; do
ip link set dev ${iface} up
sysctl net.ipv6.conf.${iface}.disable_ipv6=1
TOE_OPTIONS="rx tx sg tso ufo gso gro lro rxvlan txvlan rxhash”
for TOE_OPTION in $TOE_OPTIONS; do
/sbin/ethtool --offload $intf "$TOE_OPTION”
done
done
test.jso
test.json veth
n 0 2 4 2n
Linux
Kernel
2n
veth 1 3 5 +1

Step 3: Starting the model
$ sudo simple_switch --log-console --dump-packet-data 64 \
–i 0@veth0 -i 1@veth2 … [--pcap] \
test.json
test.p4
TCP Socket
(Thrift)
Program-independent
Control Server
L
PRE o
g
g
i
Egress
n
BMv2
Ingress
g
test.jso
test.json
n
Parser Deparser
Port Interface veth0.pcap
veth0..n
Linux Kernel
Step 4: Starting the CLI
$ simple_switch_CLI BMv2 CLI
Program-independent
CLI and Client
TCP Socket
(Thrift)
test.p4 Program-independent
Control Server
L
test.json
PRE o
g
g
i
Egress
n
Ingress
BMv2
g
test.jso
test.json
n Parser Deparser
Port Interface
veth0..n
Linux Kernel
Step 5: Sending and Receiving Packets
Program-independent
Control Server
L
test.json
PRE o
g
g
• scapy i
n
p = Ethernet()/IP()/UDP()/”Payload” • scapy
Egress
BMv2
Ingress
g
sendp(p, iface=“veth0”) sniff(iface=“veth9”, prn=lambda x: x.show())
• Ethereal, etc.. • Wireshark, tshark, tcpdump
Packet Packet
Generator Sniffer
Parser Deparser
Port Interface
veth 0 2 4 2n
Linux
Kernel 2n
veth 1 3 5 +1

Overview

NetFPGA = Networked FPGA
• A line-rate, flexible, open networking platform for teaching and
research

NetFPGA Family of Boards
NetFPGA-1G (2006)
NetFPGA-10G (2010)
NetFPGA-1G-CML (2014) NetFPGA-SUME (2014)

International Community
• Over 1,200 users, using over 3500 cards at 200 universities in
over 47 countries
• Join the mailing list: [email protected]

NetFPGA board
Networking
Software CPU Memory
running on a
standard PC
PCI-Express
PC with NetFPGA
10GbE
A hardware
accelerator built FPGA 10GbE
with FPGA driving
1/10/ 100Gb/s 10GbE
network links Memory
10GbE

NetFPGA consists of …
Four elements:
• NetFPGA board
• Tools + reference designs
• Contributed projects
• Community

Xilinx Virtex 7 690T
• Optimized for high-
performance
applications
• 690K Logic Cells
• 52Mb RAM
• 3 PCIe Gen. 3
Hard cores

Memory Interfaces
• DRAM:
2 x DDR3 SoDIMM
1866MT/s, 4GB
• SRAM:
3 x 9MB QDRII+, 500MHz

Host Interface
• PCIe Gen. 3
• x8 (only)
• Hardcore IP

Front Panel Ports
• 4 SFP+ Cages
• Directly connected to
the FPGA
• Supports 10GBase-R
transceivers (default)
• Also Supports
1000Base-X
transceivers and
direct attach cables

Expansion Interfaces
• FMC HPC connector
◦ VITA-57 Standard
◦ Supports Fabric Mezzanine
Cards (FMC)
◦ 10 x 12.5Gbps serial links
• QTH-DP
◦ 8 x 12.5Gbps serial links

Storage
• 128MB FLASH
• 2 x SATA connectors
• Micro-SD slot
• Enable standalone
operation

Reference Switch Pipeline
• Five stages
10GE 10GE 10GE 10GE
◦ Input port RxQ RxQ RxQ RxQ DMA
◦ Input arbitration
◦ Forwarding decision and packet
modification Input Arbiter
◦ Output queuing
◦ Output port Output Port
Lookup
• Packet-based module
interface
• Pluggable design Output Queues
10GE 10GE 10GE 10GE

DMA
Tx Tx Tx Tx
Full System Components
Software
nf0 nf1 nf2 nf3 ioctl
PCIe Bus Registers
CPU CPU AXI Lite

RxQ TxQ
NetFPGA user data path
10GE 10GE
Tx Rx
Ports
NetFPGA – Host Interaction
• Linux driver interfaces with hardware
◦ Packet interface via standard Linux network stack
◦ Register reads/writes voa ioctl system call with wrapper functions
■ rwaxi(int address, unsigned *data);
■ Eg: rwaxi(0x7d4000000, &val)

NetFPGA to Host Packet Transfer
1. Packet arrives –
forwarding table
sends to DMA
queue
PCIe Bus
2. Interrupt notifies 3. Driver sets up and
driver of packet arrival initiates DMA transfer

NetFPGA to Host Packet Transfer
PCIe Bus
4. NetFPGA transfers 5. Interrupt signals
packet via DMA completion of DMA
6. Driver passes packet

to network stack
Host to NetFPGA Packet Transfer
PCIe Bus
2. Driver sets up and 3. Interrupt signals
initiates DMA transfer completion of DMA
1. Software sends packet

via network sockets.
Packet delivered to driver

NetFPGA Register Access
PCIe Bus
2. Driver performs
PCIe memory
read/write
1. Software makes ioctl call
on network socket. ioctl
passed to driver

Overview

General Process for Programming a P4 Target
P4->NetFPGA tools
Control Plane
RUNTIME
P4 Program P4 Compiler Add/remove Extern Packet-in/out
table entries control
CPU port
P4 Architecture Target-specific Extern
configuration Load Tables Data Plane
Model objects
binary
Target
SimpleSumeSwitch
Architecture Copyright © 2018 – P4.org
NetFPGA SUME 29
P4àNetFPGA Compilation Overview
NetFPGA Reference Switch
P4 Program
10GE 10GE 10GE 10GE

RxQ RxQ RxQ RxQ DMA
Xilinx P416 Compiler

Input Arbiter
Xilinx SDNet Tools Output Port

SimpleSume
Lookup
Switch
SimpleSumeSwitch Architecture
Output Queues
10GE 10GE 10GE 10GE

DMA
Copyright © 2018 – P4.org Tx Tx Tx Tx 30
Xilinx SDNet Design Flow & Use Model
.sdnet
Firmware
Packet Processing Spec.

• PX (domain specific language)
• describe function in
packet-oriented terms
HDL description
SDNet Compiler
• Throughput & Latency
• Resources
• Programmability
Tailored Packet Processor
Copyright © 2018 – P4.org Page 31

Xilinx P4 Design Flow & Use Model
.p4
Xilinx P416 Compiler .sdnet
$ p4c-sdnet switch.p4
Verification Environment
Verilog Lookup Engine High level C++

Top level Verilog System Verilog
Engines C++ Drivers Testbench
wrapper Testbench
(Encrypted)

Considerations When Mapping to SDNet
• Identifying parallelism within P4 parser and control blocks
◦ table lookups
◦ actions
◦ etc.
• P4 packet processing model
◦ extract entire header from packet
◦ updates apply directly to header
◦ deparser re-inserts header back into packet
• SDNet packet processing model
◦ stream packet through “engines”
◦ modify header values in-line without removing and re-inserting

Mapping P4 Architectures to SDNet
Ingress Egress
Parser Match+Action Match+Action Deparser
Lookup Editing
Parsing Lookup Editing Editing
Engine
Lookup Engine
Editing
Engine Engine Engine Engine
Engine Engine
read packet read tuples read tuples

write tuples write tuples write packet

Support for Multiple Architectures
SimpleSumeSwitch Only Parser

Ø Pull information from packet w/o updates

SimpleSumeSwitch Architecture Model for SUME Target
tdata tdata
tuser tuser
AXI AXI
Lite Lite
• P4 used to describe parser, match-action pipeline, and deparser

Standard Metadata in SimpleSumeSwitch Architecture
/* standard sume switch metadata */
struct sume_metadata_t {
bit<16> dma_q_size;
bit<16> nf3_q_size;
bit<16> nf2_q_size;
bit<16> nf1_q_size;
bit<16> nf0_q_size;
bit<8> send_dig_to_cpu; // send digest_data to CPU
bit<8> dst_port; // one-hot encoded
bit<8> src_port; // one-hot encoded
bit<16> pkt_len; // unsigned int
}
*_q_size – size of each output queue, measured in terms of 32-byte words, when packet starts being
processed by the P4 program
src_port/dst_port – one-hot encoded
user_metadata/digest_data – structs defined by the user

Interface Naming Conventions
src / dst port fields:

x-x-x-x-x-x-x-x
nf0 nf1 nf2 nf3 ioctl
Registers
CPU CPU AXI Lite

RxQ TxQ
user data path
10GE 10GE
Tx Rx
nf3 nf2 nf1 nf0 Ports

Overall P4 Program Structure
#include <core.p4>
#include <sume_switch.p4>
/******** CONSTANTS ********/

#define IPV4_TYPE 0x0800
/******** TYPES ********/

typedef bit<48> EthAddr_t;
header Ethernet_h {...}
struct Parsed_packet {...}
struct user_metadata_t {...}
struct digest_data_t {...}
/******** EXTERN FUNCTIONS ********/

extern void const_reg_rw(...);
/******** PARSERS and CONTROLS ********/

parser TopParser(...) {...}
control TopPipe(...) {...}
control TopDeparser(...) {...}
/******** FULL PACKAGE ********/

SimpleSumeSwitch(TopParser(), TopPipe(), TopDeparser()) main;

P4àNetFPGA Extern Function library
• Implement platform specific functions

• Black box to P4 program
• Implemented in HDL
• Stateless – reinitialized for each packet
• Stateful – keep state between packets
• Xilinx Annotations
• @Xilinx_MaxLatency() – maximum number of clock cycles an extern function needs to
complete
• @Xilinx_ControlWidth() – size in bits of the address space to allocate to an extern
function

Stateless vs. stateful operations
Stateless operation: pkt.f4 = pkt.f1 + pkt.f2 – pkt.f3
f1 f1 f1
f2 f2 f2
pkt.tmp = pkt.f4 =
f3 f3 f3
pkt.f1 + pkt.f2 pkt.tmp - pkt.f3
f4 f4 f4 =
tmp – f3
tmp tmp = f1 tmp = f1
Can pipeline stateless
+ f2 operations + f2
Stateful operation: x = x + 1 X should be 2,
not 1!
X = 01
tmp tmp
tmp pkt.tmp = x pkt.tmp ++ x = pkt.tmp
=0 =1
tmp tmp
tmp
=0 =1
Stateful operation: x = x + 1
tmp X++
Cannot pipeline, need atomic operation in h/w

P4àNetFPGA Extern Function library
• HDL modules invoked from within P4 programs
• Stateful Atoms [1] Atom Description
R/W Read or write state
RAW Read, add to, or overwrite state
PRAW Predicated version of RAW
ifElseRAW Two RAWs, one each for when predicate is true or false
Sub IfElseRAW with stateful subtraction capability
• Stateless Externs
Atom Description
IP Checksum Given an IP header, compute IP checksum
LRC Longitudinal redundancy check, simple hash function
• Add your own! timestamp Generate timestamp (granularity of 5 ns)
[1] Sivaraman, Anirudh, et al. "Packet transactions: High-level programming for line-rate switches." Proceedings of the 2016 ACM SIGCOMM Conference. ACM, 2016.
Adding Custom Externs
1. Implement verilog extern module

2. Add entry to $SUME_SDNET/bin/extern_data.py
• No need to modify and existing code

• AXI Lite control interface module auto generated

Using Atom Externs in P4 – Resetting Counter
Packet processing pseudo code:
count[NUM_ENTRIES];
if (pkt.hdr.reset == 1):
count[pkt.hdr.index] = 0
else:
count[pkt.hdr.index]++

Using Atom Externs in P4 – Resetting Counter
#define REG_READ 0 u State can be accessed exactly 1 time
#define REG_WRITE 1
#define REG_ADD 2 u Using RAW atom here
// count register
@Xilinx_MaxLatency(64)
@Xilinx_ControlWidth(3)
extern void count(in bit<3> index, in bit<32> newVal, Instantiate atom
in bit<32> incVal,in bit<8> opCode,
out bit<32> result);
bit<16> index = pkt.hdr.index;

bit<32> newVal; bit<32> incVal; bit<8> opCode;
if(pkt.hdr.reset == 1) {
newVal = 0;
incVal = 0; // not used
opCode = REG_WRITE; Set metadata for state access
} else {
newVal = 0; // not used
incVal = 1;
opCode = REG_ADD;
}
47
bit<32> result; // the new value stored in count reg
count_reg_raw(index, newVal, incVal, opCode, result);
Single state access!
API & Interactive CLI Tool Generation
• Both Python API and C API

• Manipulate tables and stateful elements in P4 switch
• Used by control-plane program
• CLI tool
• Useful debugging feature
• Query various compile-time information
• Interact directly with tables and stateful externs in at run time

P4àNetFPGA Workflow
1. Write P4 program All of your effort

will go here
2. Write externs
3. Write python gen_testdata.py script
fail
4. Compile to Verilog / generate API & CLI tools

5. Run simulations
pass
6. Build bitstream
7. Check implementation results
8. Test the hardware Copyright © 2018 – P4.org 49
Debugging P4 Programs
• SDNet HDL Simulation

• SDNet C++ simulation
◦ Verbose packet processing info
◦ Output PCAP file
• Full SUME HDL simulation

• Custom Python Model

Assignment 1: Switch as a Calculator

Switch as a Calculator
• Supported Operations
◦ ADD – add two operands
◦ SUBTRACT – subtract two operands
◦ ADD_REG – add operand to current value in the register
◦ SET_REG – overwrite the current value in the register
◦ LOOKUP – Lookup the given key in the table
header Calc_h {
bit<32> op1;
bit<8> opCode;
bit<32> op2;
bit<32> result;
}
User PC NetFPGA SUME
Ethernet DST: MAC1

SRC: MAC2
Type: CALC_TYPE
Calc op1: 1
opCode: ADD
op2: 2
result: 0
Payload…

Ethernet DST: MAC1

SRC: MAC2
Type: CALC_TYPE
Calc op1: 1
opCode: ADD
op2: 2
result: 0
X 3
Payload…

Ethernet DST: MAC2

SRC: MAC1
Type: CALC_TYPE
Calc op1: 1
opCode: ADD
op2: 2
result: 3
Payload…

Switch Calc Operations
ADD SUB ADD_REG
op1 op2 op1 op2 op2 const[op1]
+ - +
result result: op1-op2 result

LOOKUP
SET_REG
key val
op2 result:
key: 0 1
op1 val
1 16
2 162
const[op1]
3 163
56
FIN

Research topics

Examples of ongoing P4 Research Topics
• P4 Infrastructure
◦ Programmable scheduling
◦ Programmable target architectures
◦ PacketMod
• Data-plane Programs
◦ In-band network telemetry
◦ Congestion control
◦ Load balancing
• Networking-Offloading Applications
◦ Aggregation for MapReduce applications
◦ Key-value caching
◦ Consensus

Programmable Scheduling
Sivaraman, Anirudh, et al. "Programmable packet scheduling at line rate." Proceedings of the 2016 ACM SIGCOMM Conference. ACM, 2016.

Why scheduler is not programmable ... so far
● Plenty of scheduling algorithms, but no consensus on right
abstractions. Contrast to:
○ Parse graphs for parsing
○ Match-Action tables for forwarding
● Scheduler has tight timing requirements

○ One decision every few ns

What does the Scheduler do?
Decides:
● In what order are packets sent?
○ Ex: FCFS, Priorities, WFQ
● At what time are packets sent?
Key observation:
● For many algorithms, the relative order in which packets are
sent does not change with future arrivals
○ i.e. scheduling order can be determined before enqueue

PIFO
● PIFO - proposed abstraction that can be used to implement
many scheduling algorithms
● Packets are pushed into an arbitrary location based on
computed rank
Rank Computation PIFO scheduler
(programmable) (fixed logic)

PIFO Tree

PIFO Remarks
● Very limited scheduling in modern switching chips
○ Deficit Round Robin, traffic shaping, strict priorities
● Scheduling algorithms that can be implemented with PIFO
○ Weighted Fair Queueing, Token Bucket Filtering, Hierarchical Packet Fair
Queueing, Least-Slack Time-First, the Rate Controlled Service Disciplines,
and fine-grained priority scheduling (e.g., Shortest Job First)
● PIFO cannot implement algorithms that require
○ Changing the scheduling order of all packets of a flow
○ Output rate limiting
● PIFO implementation feasibility?

Programmable Target Architectures
Observations:
◦ Current P4 expectation: target architectures are fixed, specified in English
◦ FPGAs can support many different architectures
Idea:
◦ Extend P4 to allow description of target architectures
■ More precise definition than English description
◦ Generate implementation on FPGA

◦ Easily integrate custom modules
◦ Explore performance tradeoffs of different architectures

Many Possible Architectures…
SimpleSumeSwitch
Output
Parser M/A Deparser
Queues
V1 Model
Output
Parser M/A TM M/A Deparser
Queues
Portable Switch Architecture
Output
Parser M/A Deparser TM Parser M/A Deparser
Queues

Many Possible Architectures…
Custom Traffic Manager
My Output
Parser M/A M/A Deparser
TM Queues
Programmable Packet Generator

Pkt
Gen
Output
Parser M/A TM M/A Deparser
Queues

Programmable Target Architectures
package SimpleSumeSwitch<H, M, D>( // TopDeparser <-- TopPipe
Parser<H, M, D> TopParser, TopDeparser.p = TopPipe.p;
Pipe<H, M, D> TopPipe, TopDeparser.user_metadata = TopPipe.user_metadata;
Deparser<H, M, D> TopDeparser) { TopDeparser.digest_data = TopPipe.digest_data;
TopDeparser.sume_metadata = TopPipe.sume_metadata;
// Top level I/O
packet_in instream; // TopDeparser output connections
inout sume_metadata_t sume_metadata; digest_data = TopDeparser.digest_data;
out D digest_data; sume_metadata = TopDeparser.sume_metadata;
packet_out outstream; outstream = TopDeparser.b;
}
// Connectivity of the architecture }
connections {
// TopParser input connections
TopParser.b = instream;
TopParser.sume_metadata = sume_metadata;
// TopPipe <-- TopParser

TopPipe.p = TopParser.p;
TopPipe.user_metadata = TopParser.user_metadata;
TopPipe.digest_data = TopParser.digest_data;
TopPipe.sume_metadata = TopParser.sume_metadata;

Workflow
• Two Actors: (1) Target Architecture Designer, (2) P4 Programmer
Provides: Implements:
• P4+ architecture declaration • non-P4 elements
• externs
in target architecture
• Someone who is more familiar with FPGA development

Workflow
• Two Actors: (1) Target Architecture Designer, (2) P4 Programmer
Implementation of P4 elements P4+ architecture description

Compile to PX Compile to PX
PX subsystems Partial PX System
Complete PX System
Compile to Verilog
HDL switch design

In-band Network Telemetry (INT)
“I visited: switch 1 @ 780ns,
1 Which path did my packet take? switch 9 @ 1.3us,
switch 12 @ 2.4us
# Rule
3 “In switch 1, I followed rules

… 2 Which rules did my packet follow? 75 and 250.In switch 9, rules
75 192.168.0 3 and 80”
/24
Copyright © 2018 – P4.org INT Slides courtesy of Nick McKeown 72

3 “Delay: 100ns, 200ns, 19740ns”

How long did my packet queue at each switch?
Queue
4 Who did my packet share the queue with?
Time
1 Which path did my packet take?
2 Which rules did my packet follow?
3 How long did my packet queue at each switch?
4 Who did my packet share the queue with?
No need to add a single additional packet!

Congestion Control
Reactive Congestion Control
Adjust Flow
Rate
Measure
Congestion
• No use of explicit information about traffic matrix

• Can only react and move in right direction
• Typical flows will finish in just a few RTTs as we
• Reactive techniques are slow to converge move towards higher link speeds
(10s-100s of RTTs)
Proactive Congestion Control
10 Gbps • Proactive
techniques
converge much
more quickly than
reactive
Reactive
• Faster
convergence
times lead to
lower flow
Proactive
proactive
completion times
proactive

An example proactive scheme
10 Gb/s link
Flow A
Sending host
adjusts sending
rate
Switch Computation Control Packet

N = 1 flow Per-flow state:
C = 10Gb/s • BW demand
Fair share = C/N = 1 Gb/s • Current bottleneck
Per-link state

An example proactive scheme
10 Gb/s link
Flow A
Flow B
Switch Computation
N = 2 flow
C = 10 Gb/s
Fair share = C/N = 5 Gb/s

Proactive Algorithm in P4
Set low priority
L2 Priority
Forwarding Set high priority Output
Logic is ctrl pkt Queues
Read/update link state
Compute fair share rate
Set bandwidth demand

In-Network Computation
• Programmable data plane hardware à opportunity to reconsider
division of computation
• What kinds of computation should be delegated to network?
• Network computations are constrained:
◦ Limited memory size (10’s of MB of SRAM)
◦ Limited set of actions (simple arithmetic, hashing, table lookups)
◦ Few operations per packet (10’s of ns to process each packet)
• Goals:
◦ Reduce: application runtime, load on servers, network congestion
◦ Increase: application scalability
Sapio, Amedeo, et al. "In-Network Computation is a Dumb Idea Whose Time Has Come." Proceedings of the 16th ACM Workshop on Hot Topics in
Networks. ACM, 2017.

In-Network Aggregation
• Aggregate data at intermediate network nodes to reduce network
traffic
• Simple arithmetic operations at switches
• Widely applicable to many distributed applications
◦ Machine learning training
◦ Graph analytics
◦ MapReduce applications

In-Network Aggregation
• Network controller is informed of MapReduce job
◦ Configures switches in aggregation tree to perform aggregation
• Significant network traffic reduction à reduced run time
• How to make robust to loss? Encryption?
Aggregation Tree Reduction Results

91 50
90
45
89
88 40
Reduction [%]
87 35
86
30
85
84 25
83 20
82
15
81
80 10
Data Reduce # packets # packets
volume time (UDP baseline) (TCP baseline)
Figure 3: Reduction on the amount of data, running time

and number of packets received at reducers.
The P4 Language Consortium
• http://p4.org
• Consortium of academic
and industry members
• Open source, evolving,

domain-specific language
• Permissive Apache license,

code on GitHub today
• Membership is free:
contributions are welcome
• Independent, set up as a
California nonprofit

Lecture 3 P4 NetFPGA

Uploaded by

Copyright:

Available Formats

Lecture 3 P4 NetFPGA

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3 P4 NetFPGA

Uploaded by

Copyright:

Available Formats

CS344 – Lecture 3

Copyright © 2018 – P4.org

Copyright © 2018 – P4.org 2

test.p4 $ p4c-bm2-ss -o test.json test.p4

Copyright © 2018 – P4.org 4

test.p4 $ sudo ~/p4lang/tutorials/examples/veth_setup.sh

# ip link add name veth0 type veth peer name veth1

Copyright © 2018 – P4.org 5

Port Interface veth0.pcap

Copyright © 2018 – P4.org 8

Copyright © 2018 – P4.org 9

Copyright © 2018 – P4.org 10

NetFPGA-1G-CML (2014) NetFPGA-SUME (2014)

• Join the mailing list: [email protected]

Copyright © 2018 – P4.org 13

• Tools + reference designs

Copyright © 2018 – P4.org 14

• 690K Logic Cells

Copyright © 2018 – P4.org 15

Copyright © 2018 – P4.org 16

Copyright © 2018 – P4.org 17

Copyright © 2018 – P4.org 18

Copyright © 2018 – P4.org 19

Copyright © 2018 – P4.org 20

10GE 10GE 10GE 10GE

PCIe Bus Registers

CPU CPU AXI Lite

NetFPGA user data path

• Linux driver interfaces with hardware

◦ Packet interface via standard Linux network stack

◦ Register reads/writes voa ioctl system call with wrapper functions

■ rwaxi(int address, unsigned *data);

■ Eg: rwaxi(0x7d4000000, &val)

Copyright © 2018 – P4.org 23

Copyright © 2018 – P4.org 24

6. Driver passes packet

1. Software sends packet

Copyright © 2018 – P4.org 26

Copyright © 2018 – P4.org 27

Copyright © 2018 – P4.org 28

10GE 10GE 10GE 10GE

Xilinx P416 Compiler

Xilinx SDNet Tools Output Port

10GE 10GE 10GE 10GE

Packet Processing Spec.

Tailored Packet Processor

Copyright © 2018 – P4.org Page 31

Verilog Lookup Engine High level C++

Copyright © 2018 – P4.org Page 32

Copyright © 2018 – P4.org Page 33

read packet read tuples read tuples

Copyright © 2018 – P4.org Page 34

SimpleSumeSwitch Only Parser

Copyright © 2018 – P4.org Page 35

• P4 used to describe parser, match-action pipeline, and deparser

Copyright © 2018 – P4.org 37

src / dst port fields:

CPU CPU AXI Lite

user data path

nf3 nf2 nf1 nf0 Ports

/ CONSTANTS /

/ TYPES /

/ EXTERN FUNCTIONS /

/ PARSERS and CONTROLS /

/ FULL PACKAGE /