Ai Engine Development For Versal: Olivier Tremois, PHD SW Technical Marketing Ai Engine Tools

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

AI Engine Development for

Versal
Olivier TREMOIS, PhD
SW Technical Marketing AI Engine Tools

© Copyright 2020 Xilinx


Versal Architecture Overview
Adaptable Engines
2X compute density

Scalar Engines Intelligent Engines


• Platform Management • AI Compute
Controller (PMC) • Diverse DSP workloads
• Edge Compute

Network-on-Chip
Protocol Engines • Guaranteed Bandwidth
• Integrated 600G cores • Enables SW Programmability
• 4X encrypted bandwidth

Programmable I/O DDR Memory


• Any sensor, any interface • 2X bandwidth/pin
• Extendable peripheral set • Server-class density

PCIe & CCIX


Transceivers
• 2X PCIe & DMA bandwidth
• Broad range, 25G →112G
• Cache-coherent interface
• 58G in mainstream devices
to accelerators

>> 2 © Copyright 2020 Xilinx


AI Engines
Hardened Compute, Memory & Interconnect

MEMORY

MEMORY
AI AI
Engine Engine

MEMORY

MEMORY
AI AI
Engine Engine

Huge performance improvements versus UltraScale+ Terabytes/sec of interface bandwidth to other engines
˃ 8x compute density @ 40% lower power ˃ Direct, massive throughput to adaptable HW engines
1GHz+ VLIW / SIMD vector processors ˃ Implement core application with AI for “Whole App Acceleration”
˃ Versatile core for ML and other advanced DSP workloads SW programmable for any developer
Massive array of interconnected cores ˃ C programmable, compile in minutes
˃ Instantiate multiple tiles (10s to 100s) for scalable compute ˃ Library-based design for ML framework developers

>> 3 © Copyright 2020 Xilinx


Vitis Philosophy : Platforms and Subsystems

PL Fabric + AIE PS
HW SW
Platform Platform

Subsystem #1 AIE PL Firmware

PS
Subsystem #2 PL Firmware
Application

Subsystem #N PL Firmware

 Subsystems form the customer’s differentiating logic: AIE and PL kernels, operating under the supervision of the PS

 Versal platform provides essential infrastructure services (CIPS, NoC, I/Os, OS, Drivers…)

 Platform insulates developers from low-level details; lets them focus on application development (SW, PL or AIE)

4 © Copyright 2020 Xilinx


Vitis 2020.2 Flow for Versal
AIE PL (HLS) PL (RTL) Platform PS

AIE Kernels, Graph PL Kernels (HLS) RTL Kernels XRT, Graph API
Vitis HW Platform
AIE driver

Vitis SW Platform
AIE Simulation HLS Cosimulation RTL Verification PS App
Linux + rootfs

PL and AIE Integration (v++ --link)

Vivado HW Build
SIM Build
Timing Closure

Generate Binary (v++ --package)

SSW
Run on Device HW Emulation
Profile Vivado
AIESim QEMU SIM Vitis
Debug

>> 5 © Copyright 2020 Xilinx


Vitis 2020.2 Flow for Versal
AIE PL (HLS) PL (RTL) Platform PS

AIE Kernels, Graph PL Kernels (HLS) RTL Kernels XRT, Graph API
Vitis HW Platform
AIE driver

Vitis SW Platform
AIE Simulation HLS Cosimulation RTL Verification PS App
Linux + rootfs

PL and AIE Integration (v++ --link)

Vivado HW Build
SIM Build
Timing Closure

Generate Binary (v++ --package)

SSW
Run on Device HW Emulation
Profile
Vivado
AIESim QEMU SIM
Debug Vitis

6 © Copyright 2020 Xilinx


AI Engine Programming
a b c d f
a b c polarclip feedback equalizer fir_tap_11 scale

d e f e
fir_tap_7

Single Kernel Programming AI Engine Application

˃ Create AI Engine kernel programs ˃ Create multi-kernel AI Engine projects

˃ The programming model allows you to use: ˃ ADF graph based programming
˃ Various Vector datatypes ˃ Modular, hierarchical graph definition
˃ AI Engine intrinsics ˃ Instantiation of AI Engine memories,
˃ Window function API, … Streams, …

˃ Analyze and Debug Kernel code ˃ Analyze and Debug


˃ Compile, Simulate, profile, … ˃ Dataflow, Function scheduling, …

>> 7 © Copyright 2020 Xilinx


Programming Flow

© Copyright 2020 Xilinx


Kernel Functional and Performance Validation

Single Node Development template


Kernel Development Or your own single Node project
Required for profiling and low-level analysis

Simple connection of the graph to the environment


Kernel Validation In-context, full AI Engine array access, PL-connection, …
Debug at the kernel level

Single Kernel Code vectorization, vector datatypes


Vector intrinsics, optimized interface, …
Optimization

>> 9 © Copyright 2020 Xilinx


AI Engine Kernel Programming Flow
Functional Matlab/C/C++  Code restructuring
Reference
Verification  Vector data-types
 Function intrinsics
Kernel
Vectorization  Memory optimization
Performance
Verification AIE Optimized
Directives
C/C++  Directives (pragmas)
 Loop unrolling
 Software pipelining
AIE Compiler

 Software development framework


AIE Assembly
Code  C/C++ verification (Debugger)
 Profiler

Cycles  SW-Emulation: functional only


AIE-Emulation
 AIE-Emulation: cycle true

AI Engine Programming: Standard Vector Programming Techniques


>> 10 © Copyright 2020 Xilinx
Kernel Programming Adaptive
Dataflow
A Kernel is a ‘C/C++’ function using
#include <adf.h> Library
special IO and Vector data types. It will
be launched automatically by a
void fir_16taps_symm(const unsigned samples, const int32 (&taps_in)[16], scheduler depending on some events
input_window_cint16 * w_input, output_window_cint16 * w_output)
{
v16int16 coeffs;
v32cint16 sbuff = undef_v32cint16(); Vector Datatypes
for (unsigned i = 0; i < 12 ; i++) for vectorized
coeffs = shft_elem(coeffs, (int16) taps_in[15 - i]); computations
const unsigned LSIZE = (samples / 4);
Directives to
for ( unsigned i=0; i<LSIZE; i+=2)
help in
chess_loop_range(2,)
scheduling for
chess_prepare_for_pipelining
performance
{
v4cacc48 acc; C Window API
sbuff = upd_w(sbuff, 0, window_readincr_v8(w_input)); to access data
sbuff = upd_w(sbuff, 1, window_readincr_v8(w_input));
sbuff = upd_w(sbuff, 2, window_read_v8(w_input) );
acc = mul4_sym( sbuff , 0 , 0x3210 , 1 , 15 , coeffs, 0, 0x0000, 1 );
acc = mac4_sym(acc, sbuff , 4 , 0x3210 , 1 , 11 , coeffs, 4, 0x0000, 1 ); AI Engine
window_writeincr(w_output, srs(acc,SRS_SHIFT)); intrinsics to
perform
acc = mul4_sym( sbuff , 4 , 0x3210 , 1 , 19 , coeffs, 0, 0x0000, 1 ); vectorized
acc = mac4_sym(acc, sbuff , 8 , 0x3210 , 1 , 15 , coeffs, 4, 0x0000, 1 ); computation
window_writeincr(w_output, srs(acc,SRS_SHIFT));

window_decr_v8(w_input,1);
}
}
>> 11 © Copyright 2020 Xilinx
Graph Development, Validation and optimization

Kernel stitching within graph


Graph Development AI Engine compiler, placer and router
Can include PL-based kernel

Emulation-SW: complete graph functional simulation


Graph Validation Emulation-AIE: Cycle true graph simulation
Debug at the graph level

I/F optimization
Graph
Location constraints, stamp (AI Engine graph map) and repeat
Optimization FIFO settings, circuit/packet switch communications, …

>> 12 © Copyright 2020 Xilinx


Graph Programming
Adaptive
#include <adf.h> Dataflow
using namespace adf; Library

#include "kernels.h"
AI Engine Application
described as a graph
class myGraph : public graph {
private:
kernel kernel1,kernel2; Single Kernel Based Graph
public:
IOs of the graph
input_port in;
are “ports”
input_port NSamples;
input_port Coefficients;
output_port out; The constructor of the graph
myGraph(){ describes all the connections
and some other parameters.
kernel1 = kernel::create(fir_16taps_symm);
kernel2 = kernel::create(fir_23taps_symm); For “Single Kernel programming”
connect< window<128> > net0 (in, kernel1.in[2]); this section is very simple
connect< window<128> > net1 (kernel1.out[0],kernel2.in[0]);
connect< window<128> > net2 (kernel2.out[0], out);
connect<parameter> (NSamples, async(kernel1.in[0]));
connect<parameter> (Coefficients, async(kernel1.in[1]));

source(kernel1) = "kernels/Kernel_1.cc";
source(kernel2) = "kernels/Kernel_2.cc";
runtime<ratio>(kernel1) = 0.1;
runtime<ratio>(kernel2) = 0.1;
}
};
>> 13 © Copyright 2020 Xilinx
Testbench
Adaptive
#include <adf.h> Dataflow
using namespace adf; Library

#include "kernels.h"
#include "kernels/include.h"
#include "project.h"
AI Engine Graph
kernelOptGraph mygraph;
Creation of a virtual platform:
simulation::platform<1,1> platform("data/input.txt", "data/output.txt"); - Input test vector file
connect<> net0(platform.src[0], mygraph.in); - Output vector file
connect<> net1(mygraph.out, platform.sink[0]); - Connection of the graph

int main(void) {
int32 taps[16] = {-100, 200, -300, 400, -500, 600, -700, 800, 800, -700, 600, -500,
400, -300, 200, -100};

mygraph.init(); Simulation control


mygraph.run(4);
mygraph.update(mygraph.samples, uint32(INPUT_SAMPLES));
mygraph.update(mygraph.coefficients, taps, 16);
mygraph.end();

return 0;
}

>> 14 © Copyright 2020 Xilinx


Vitis Analyzer

© Copyright 2020 Xilinx


Vitis Analyzer introduction

Compile Results Analysis:


 Graph
 Mapping
 Memory footprint
 DMAs, Locks, …

Profiling Viewer
Simulation Timeline analysis
Can be used also within Makefile flow

>> 16 © Copyright 2020 Xilinx


Vitis Analyzer Compilation View
Graph View
 Shows all the kernels defined in the AI Engine graph (AI Engine Array and PL)
 The kernels can be grouped by Tile or Subgraph or no grouping at all

© Copyright 2020 Xilinx


Vitis Analyzer Compilation View
Array View
 Shows the complete AI Engine array and specifies which Tile is used and wall connections

© Copyright 2020 Xilinx


Vitis Analyzer Trace view

The Trace view gives


information on what runs on
each tile (active tiles only) of
the array:
 Core, DMA, Locks and IOs

A Tile is active as soon as


its AI Engine processor, its
local memory or its
interconnect is active

© Copyright 2020 Xilinx


AI Engine Project Creation in Vitis 2020.2

© Copyright 2020 Xilinx


System Project structure in Vitis

AI Engine kernel source files


AI Engine
Sub-graphs and graphs description

PL kernel source files (HLS)


Programmable
Logic PL kernel source files (packaged RTL)

System
HW Link System configuration file
Project

Baremetal
OS
Linux
Processing
AI Engine Drivers
System
PS application XRT

OpenCL

>> 21 © Copyright 2020 Xilinx


Vitis 2020.2 Demo

© Copyright 2020 Xilinx


Example design partitioning

Graph
Weighted PolarClip
MM2S Average Classifier S2MM
sum HLS
(PL DMA) (AI Engine) (AI Engine) (PL DMA)
(AI Engine) (PL Kernel)

DDR DDR

>> 23 © Copyright 2020 Xilinx


Example design partitioning
AI Engine
Array
Weighted
Average Classifier
sum
(AI Engine) (AI Engine)
(AI Engine)

Programmable PolarClip
MM2S S2MM
HLS
(PL DMA) Logic (PL Kernel)
(PL DMA)

DDR DDR

>> 24 © Copyright 2020 Xilinx


Vitis 2020.2
Project Creation and AI Engine Simulation

© Copyright 2020 Xilinx


Vitis 2020.2
PL kernel compilation and HW link

© Copyright 2020 Xilinx


Vitis 2020.2
PS app compilation and HW Emulation

© Copyright 2020 Xilinx


Vitis 2020.2
HW Implementation

© Copyright 2020 Xilinx


Summary

Vitis is a unified tool that is used throughout the AI Engine development flow

AI Engine development is a 2-stage process


 Single kernel
 Graph development

Vitis handles all Versal ACAP domains

© Copyright 2020 Xilinx


Thank You

© Copyright 2020 Xilinx

You might also like