Lecture 3 P4 NetFPGA
Lecture 3 P4 NetFPGA
Lecture 3 P4 NetFPGA
p4c-bm2-ss test.json
L
o
PRE g
simple_switch (BMv2)
Egress
Ingress
P4
D Debugger
test.json e
b
u
g
Parser Deparser
Packet Packet
generator Port Interface sniffer
veth0..n
Linux Kernel
Copyright © 2018 – P4.org 3
Step 1: P4 Program Compilation
p4c-bm2-ss
test.json
test.jso
test.json veth
n 0 2 4 2n
Linux
Kernel
2n
veth 1 3 5 +1
L
PRE o
g
g
i
Egress
n
BMv2
Ingress
g
test.jso
test.json
n
Parser Deparser
veth0..n
Linux Kernel
Copyright © 2018 – P4.org 6
Step 4: Starting the CLI
$ simple_switch_CLI BMv2 CLI
Program-independent
CLI and Client
TCP Socket
(Thrift)
test.p4 Program-independent
Control Server
L
test.json
PRE o
g
g
i
Egress
n
Ingress
BMv2
g
test.jso
test.json
n Parser Deparser
Port Interface
veth0..n
Linux Kernel
Copyright © 2018 – P4.org 7
Step 5: Sending and Receiving Packets
Program-independent
Control Server
L
test.json
PRE o
g
g
• scapy i
n
p = Ethernet()/IP()/UDP()/”Payload” • scapy
Egress
BMv2
Ingress
g
sendp(p, iface=“veth0”) sniff(iface=“veth9”, prn=lambda x: x.show())
• Ethereal, etc.. • Wireshark, tshark, tcpdump
Packet Packet
Generator Sniffer
Parser Deparser
Port Interface
veth 0 2 4 2n
Linux
Kernel 2n
veth 1 3 5 +1
NetFPGA-1G (2006)
NetFPGA-10G (2010)
Networking
Software CPU Memory
running on a
standard PC
PCI-Express
PC with NetFPGA
10GbE
A hardware
accelerator built FPGA 10GbE
with FPGA driving
1/10/ 100Gb/s 10GbE
network links Memory
10GbE
Four elements:
• NetFPGA board
• Contributed projects
• Community
• 52Mb RAM
• 3 PCIe Gen. 3
Hard cores
• SRAM:
3 x 9MB QDRII+, 500MHz
• PCIe Gen. 3
• x8 (only)
• Hardcore IP
• QTH-DP
◦ 8 x 12.5Gbps serial links
• 2 x SATA connectors
• Micro-SD slot
• Enable standalone
operation
◦ Output queuing
◦ Output port Output Port
Lookup
• Packet-based module
interface
• Pluggable design Output Queues
Software
nf0 nf1 nf2 nf3 ioctl
10GE 10GE
Tx Rx
Ports
Copyright © 2018 – P4.org 22
NetFPGA – Host Interaction
1. Packet arrives –
forwarding table
sends to DMA
queue
PCIe Bus
2. Interrupt notifies 3. Driver sets up and
driver of packet arrival initiates DMA transfer
PCIe Bus
4. NetFPGA transfers 5. Interrupt signals
packet via DMA completion of DMA
PCIe Bus
2. Driver sets up and 3. Interrupt signals
initiates DMA transfer completion of DMA
PCIe Bus
2. Driver performs
PCIe memory
read/write
1. Software makes ioctl call
on network socket. ioctl
passed to driver
P4->NetFPGA tools
Control Plane
RUNTIME
P4 Program P4 Compiler Add/remove Extern Packet-in/out
table entries control
CPU port
P4 Architecture Target-specific Extern
configuration Load Tables Data Plane
Model objects
binary
Target
SimpleSumeSwitch
Architecture Copyright © 2018 – P4.org
NetFPGA SUME 29
P4àNetFPGA Compilation Overview
NetFPGA Reference Switch
P4 Program
SimpleSumeSwitch Architecture
Output Queues
.sdnet
Firmware
SDNet Compiler
• Throughput & Latency
• Resources
• Programmability
.p4
Xilinx P416 Compiler .sdnet
$ p4c-sdnet switch.p4
Verification Environment
Ingress Egress
Parser Match+Action Match+Action Deparser
Lookup Editing
Parsing Lookup Editing Editing
Engine
Lookup Engine
Editing
Engine Engine Engine Engine
Engine Engine
tuser tuser
AXI AXI
Lite Lite
*_q_size – size of each output queue, measured in terms of 32-byte words, when packet starts being
processed by the P4 program
src_port/dst_port – one-hot encoded
user_metadata/digest_data – structs defined by the user
Registers
10GE 10GE
Tx Rx
• Implemented in HDL
• Stateless – reinitialized for each packet
• Stateful – keep state between packets
• Xilinx Annotations
• @Xilinx_MaxLatency() – maximum number of clock cycles an extern function needs to
complete
• @Xilinx_ControlWidth() – size in bits of the address space to allocate to an extern
function
f1 f1 f1
f2 f2 f2
pkt.tmp = pkt.f4 =
f3 f3 f3
pkt.f1 + pkt.f2 pkt.tmp - pkt.f3
f4 f4 f4 =
tmp – f3
tmp tmp = f1 tmp = f1
Can pipeline stateless
+ f2 operations + f2
Stateless vs. stateful operations
Stateful operation: x = x + 1 X should be 2,
not 1!
X = 01
tmp tmp
tmp pkt.tmp = x pkt.tmp ++ x = pkt.tmp
=0 =1
tmp tmp
tmp
=0 =1
Stateless vs. stateful operations
Stateful operation: x = x + 1
tmp X++
• Stateless Externs
Atom Description
IP Checksum Given an IP header, compute IP checksum
LRC Longitudinal redundancy check, simple hash function
• Add your own! timestamp Generate timestamp (granularity of 5 ns)
[1] Sivaraman, Anirudh, et al. "Packet transactions: High-level programming for line-rate switches." Proceedings of the 2016 ACM SIGCOMM Conference. ACM, 2016.
Copyright © 2018 – P4.org 44
Adding Custom Externs
count[NUM_ENTRIES];
if (pkt.hdr.reset == 1):
count[pkt.hdr.index] = 0
else:
count[pkt.hdr.index]++
6. Build bitstream
7. Check implementation results
8. Test the hardware Copyright © 2018 – P4.org 49
Debugging P4 Programs
header Calc_h {
bit<32> op1;
bit<8> opCode;
bit<32> op2;
bit<32> result;
}
Copyright © 2018 – P4.org 52
Switch as a Calculator
User PC NetFPGA SUME
Payload…
+ - +
2 162
const[op1]
3 163
56
Copyright © 2018 – P4.org
FIN
Sivaraman, Anirudh, et al. "Programmable packet scheduling at line rate." Proceedings of the 2016 ACM SIGCOMM Conference. ACM, 2016.
Key observation:
● For many algorithms, the relative order in which packets are
sent does not change with future arrivals
○ i.e. scheduling order can be determined before enqueue
Observations:
◦ Current P4 expectation: target architectures are fixed, specified in English
◦ FPGAs can support many different architectures
Idea:
◦ Extend P4 to allow description of target architectures
■ More precise definition than English description
Output
Parser M/A Deparser
Queues
V1 Model
Output
Parser M/A TM M/A Deparser
Queues
Output
Parser M/A Deparser TM Parser M/A Deparser
Queues
My Output
Parser M/A M/A Deparser
TM Queues
Output
Parser M/A TM M/A Deparser
Queues
Provides: Implements:
• P4+ architecture declaration • non-P4 elements
• externs
in target architecture
Complete PX System
Compile to Verilog
# Rule
Queue
Time
Copyright © 2018 – P4.org INT Slides courtesy of Nick McKeown 73
In-band Network Telemetry (INT)
Adjust Flow
Rate
Measure
Congestion
10 Gbps • Proactive
techniques
converge much
more quickly than
reactive
Reactive
• Faster
convergence
times lead to
lower flow
Proactive
proactive
completion times
proactive
Sending host
adjusts sending
rate
Switch Computation
N = 2 flow
C = 10 Gb/s
Fair share = C/N = 5 Gb/s
L2 Priority
Forwarding Set high priority Output
Logic is ctrl pkt Queues
Read/update link state
Sapio, Amedeo, et al. "In-Network Computation is a Dumb Idea Whose Time Has Come." Proceedings of the 16th ACM Workshop on Hot Topics in
Networks. ACM, 2017.
Reduction [%]
87 35
86
30
85
84 25
83 20
82
15
81
80 10
Data Reduce # packets # packets
volume time (UDP baseline) (TCP baseline)
• http://p4.org
• Consortium of academic
and industry members
• Membership is free:
contributions are welcome
• Independent, set up as a
California nonprofit