Readme PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

1

Developing Custom Plugins for Unsupported Layers


in TensorRT
Kuldeep Singh, Ramanan, Kiran M. Sabu,Nagavardhan Reddy, Mr.Tejalal Choudhary (Mentor)

Abstract:
A deep learning model is optimized to its maximum extent via specifying the optimizers and loss
functions. The Purpose of TensorRT is to optimize the deep learning model without
compromising the performance of the model. This whole process is carried out before the
deployment phase by converting a Network built using the common frameworks like Tensorflow,
pytorch etc to a TensorRT network. Conversion to TensorRT network involves converting the
layers/operation native to the framework to layers/operations of a TensorRT network. Currently
TensorRT does not support all layers , we propose to create a custom plugin for one such
unsupported layer thus optimizing the TensorRT model to the maximum.

Index terms:​ Nvidia SDK, Deep learning , TensorRT inference

1.Introduction: power efficiency and memory consumption.


TensorRT is a Software Development TensorRT has been integrated with
Kit(SDK) developed by NVIDIA for inference Tensorflow making tensorflow the ideal
of high-performance deep learning framework for TensorRT optimization and
models.TensorRT is available as an API for inference.while troubleshooting.TensorRT
both C++ and Pyhton.It includes a deep optimizes trained neural network models to
learning inference optimizer and runtime that produce a deployment-ready runtime inference
delivers low latency and high-throughput for engine. The pre-trained neural network is
deep learning inference applications.TensorRT optimized using several techniques like
can be used to improve the performance of a mixed-precision, layer fusion, kernel
wide variety of applications such as video autotuning and Dynamic Tensor memory.[1]
streaming, speech recognition, · TensorRT has its advantages and
recommendation and ​natural language disadvantages. Some of the advantages are, it
processing.​The basic idea behind tensorRT is to allows deployment of a deep learning model in
optimize the pretrained model before a device with low computing power such as
deployment so that it is efficient even in the Jetson nano thus improves the scalability. The
simplest of devices such as Jetson nano memory usage and computation time during
[1].TensorRT optimizes the neural network by Inference are significantly reduced, which is of
combining layers and optimizing kernel at most importance in applications related to
selection for improved latency, throughput, Autonomous vehicles [1]. The GraphSurgeon
2

feature provides the potential to map network. TensorRT provides an


TensorFlow nodes to custom layers in implementation of C++ on all supported
TensorRT, thus enabling Inference for many platforms, and implementation of Python on
TensorFlow networks with TensorRT.. x86, aarch64, and ppc64le. The key interfaces
The Limitations of TensorRT are as in the TensorRT core library are as follows
follows, TensorRT does not support all layers
of specific architectures. If it encounters an 2.1 Network Definition:
unsupported layer, its functionality is obtained
from the corresponding framework used. To
The Network Definition interface provides
optimize new layers, a custom plugin is
methods to specify the definition of a
created to alter its operation in terms of
network. Input Tensors and output tensors can
tensorRT. This process is called Registering of
be specified, layers can be added or modified ,
custom plugins.
and there is an interface for designing each
·The creation of custom plugins can be done
upheld layer type. As well as layer types, such
in C++ or Python. The resources for creating
as convolutional and fully connected layers,
custom plugins using Python is limited [1].
and a Plugin layer type allows the application
to implement functionality not natively
The rest of the paper is organized as follows:
supported by TensorRT.TensorRT provides
Section 2 describes the deployment of a
parsers like CAFFE, UFF and ONNX for
TensorRT model. Section 3 includes the
importing trained networks to create network
implementation and experimentation. The
definitions.We used UFF parser to parse our
architecture of SPP is described in Section 5.
network.
The supported and unsupported layers are
mentioned in the Section 6. Finally , the
section 7

2. TENSORRT DEPLOYMENT : 2.2 Builder:

TensorRT allows developers to import, The Builder interface allows the creation of an
calibrate, generate, and deploy optimized optimized engine from a network definition. It
networks. Deep learning Networks can be allows the application to specify the maximum
imported directly from Caffe, or from other batch size and workspace size, the lowest
frameworks via the UFF or ONNX formats. acceptable level of precision, timing iteration
Users can also run custom layers through counts for autotuning, an interface for
TensorRT using the Plugin interface. The quantizing systems to run in 8-piece
GraphSurgeon feature provides the capability exactnessIt is possible to build multiple
to map tensorflow nodes to the corresponding engines based on the same network definition,
custom layers in tensorRT, thus linking many but with different builder configurations
TensorFlow networks with the tensorRT
3

2.3 Engine: 3.Implementation and Experiments:


In order to get a better understanding of the
The Engine interface allows the application to
Mixed Precision technique a tensorflow model
execute Inference. It supports synchronous and
for detecting the handwritten numbers
asynchronous execution, profiling, and
(MNIST dataset) was trained on the three
enumeration and querying of the bindings for
different precisions individually. The
the engine inputs and outputs. A single-engine
tensorflow model(TF 32bit) model was built
can have multiple execution contexts, allowing
first and then converted to a TensorRT model
a single set of trained parameters to be used for
with 32 bit precision. The tensorflow model
the simultaneous execution of multiple
was then converted to an INT 16 precision and
batches.
INT 8 precision and was trained using the
same dataset.Due to the Reduction in the
2.4 Serializing and deserializing: precision ,the models used less memory and
higher computation speed. When the precision
We can either serialize the engine or we can is decreased the Throughput of the model
use the engine directly for Inference. increases. The throughput of the INT 32 model
Serializing and deserializing a model is an was 277 images/s and the throughput of the
optional step before using it for Inference - if INT 16 precision model was 741 images/s.
desirable, the engine object can be used for
Inference directly. The custom plugins present in the Github
repository off NVIDIA was also executed and
its impact on the model was observed. The
During Serialization,the engine is transformed
custom plugins are written in C++ and
into a format to store and use at a later time for
executed in a linux environment. The sample
Inference.We the apply deserialization on the
codes of Custom plugins were studied and then
engine to use it for Inference. Serializing and
the skeleton structure for the custom plugin
deserializing are optional.
was built.

2.5 Performing Inference: The functionality of the Spatial Pyramid Layer


was written in C++ and the plugin was
This engine is fed with data to perform registered under the TensorRT plugin
Inference.The inference part can be done from library.A registered Plugin can be used when it
devices like Jetson nano with at most is needed. The only limitation during the
efficiency. registration process is that the plugin should
have a unique name.
4

5. Spatial Pyramid Pooling: 6.Supported and Unsupported layers:

Supported Layers in TensorRT

Layer Name Layer Name in


TensorRT

Activation Layer IActivationLayer

Fig 5.1: architecture of SPP-Net Concatenation IConcatenationLaye


Layer r
Existing deep convolutional neural networks
(CNNs) require a fixed-size (e.g. 64X64) input Convolution layer IConvolutionLayer
image. This requirement may decrease the
recognition accuracy for the images or Arithmetic layer IElementWiseLayer
sub-images of an arbitrary size/scale.
Fill Layer IFillLayer
Using Spatial Pyramid Pooling is more
significant in Convolutional Neural Network. Fully connected I​FullyConnectedLa
Using SPP layer , we compute the feature Layer yer
maps from the entire image only once, and
then pool features in arbitrary regions Padding Layer IPaddingLayer
(sub-images) to generate fixed-length
representations for training the
detectors.Spatial Pyramid Pooling allows the
model to do manipulation on images of any Unsupported Layers in TensorRT
size thus eliminating the necessity of
maintaining a fixed input image size. Spatial Pyramid Pooling Layer

The layer is used between convolution and and


fully connected layers so that we can generate
output of fixed size from the output of 7. Conclusion:
convolution layers for the convolution layers. TensorRT is a relatively new Software
The aim of this project is to create a custom Development Kit thus, addition of support to
plugin to map the functionality of the Spatial new layers allows optimization of wide variety
Pyramid Pooling to the tensorRT network thus of architectures using the SPP layer. ​TensorRT
reducing the time taken to perform the is optimizing the pre-trained models of
optimization process. different frameworks.It’s giving less inference
timeAfter optimization model is generating
more throughput. The .CPP and .h files are
ready for SPP plugin
5

References:

[1] NVIDIA TensorRT Developer Guide


Available: h​ ttps://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide/index.html​.
Access date: 10-12-2019.
[2] S​ patial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition [Kaiming
He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun Microsoft Research, China ,Xi’an Jiaotong
University, China, University of Science and Technology of China ] Access date: 20-12-2019

You might also like