Desmo Uliers 2012

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

www.ietdl.

org
Published in IET Computers & Digital Techniques
Received on 16th November 2011
Revised on 16th May 2012
doi: 10.1049/iet-cdt.2011.0156

ISSN 1751-8601

Image and video processing platform for field


programmable gate arrays using a high-level synthesis
C. Desmouliers1 E. Oruklu1 S. Aslan1 J. Saniie2 F.M. Vallina3
1
Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, Illinois 60616, USA
2
Ingram School of Engineering, Texas State University, San Marcos, Texas 78666, USA
3
Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124, USA
E-mail: [email protected]

Abstract: In this study, an image and video processing platform (IVPP) based on field programmable gate array (FPGAs) is
presented. This hardware/software co-design platform has been implemented on a Xilinx Virtex-5 FPGA using a high-level
synthesis and can be used to realise and test complex algorithms for real-time image and video processing applications. The
video interface blocks are done in Register Transfer Languages and can be configured using the MicroBlaze processor
allowing the support of multiple video resolutions. The IVPP provides the required logic to easily plug-in the generated
processing blocks without modifying the front-end (capturing video data) and the back-end (displaying processed output
data). The IVPP can be a complete hardware solution for a broad range of real-time image/video processing applications
including video encoding/decoding, surveillance, detection and recognition.

1 Introduction model also known as a ‘golden reference model’ that can


be used for the verification process of Register Transfer
Today, a significant number of embedded systems focus on Languages (RTL) design [3].
multimedia applications with almost insatiable demand for In many image and video processing applications, most of
low-cost, high-performance and low-power hardware. the I/O, memory interface and communication channels are
Designing complex systems such as image and video common across designs, and the only block that needs to be
processing, compression, face recognition, object tracking, altered would be the processing block. This reuse of
3G or 4G modems, multi-standard CODECs and high platform components allows for accelerated generation of
definition (HD) decoding schemes requires integration of the golden reference for the new processing block and
many complex blocks and a long verification process [1]. faster HW/SW co-design at the system level. Also, the RTL
These complex designs are based on the I/O peripherals, generation and verification process would be shorter.
one or more processors, bus interfaces, A/D, D/A, Therefore in this paper, an embedded HW/SW co-design
embedded software, memories and sensors. A complete platform based on reconfigurable field programmable gate
system used to be designed with multiple chips and was array (FPGA) architecture is proposed for image and video
connected together on PCBs, but with today’s technology, processing applications. The proposed platform uses FPGA
all the functions can be incorporated in a single chip. These development tools, provides an adaptable, modular
complete systems are known as system-on-chip (SoC) [2]. architecture for future-proof designs and shortens the
Image and video processing applications require a large development time of multiple applications with a single,
amount of data transfers between the input and output of a common framework.
system. For example, a 1024 × 768 colour image has a size In the past, several platforms have been developed for
of 2 359 296 bytes. This large amount of data needs to be multimedia processing, including DSP chips based on very-
stored in the memory, transferred to the processing blocks large instruction word (VLIW) architectures. DSPs usually
and sent to the display unit. Designing an image and video run at higher clock frequencies compared with the FPGAs,
processing unit can be complex and time consuming, and however, the hardware parallelism (i.e. number of
the verification process can take months depending on the accelerators, multipliers etc.) is inferior. More importantly,
system’s complexity. they are not as flexible and may not meet the demands of
The design difficulty and longer verification processes firmware updates or revisions of multimedia standards. The
create a bottleneck for the image and video processing shortcomings of the DSP and general purpose processors
applications. One of the methods used to ease this led to a more rapid adoption of reprogrammable hardware
bottleneck is to generate a ‘virtual platform’, [3 – 6] which such as FPGAs in multimedia applications [7]. Schumacher
is a software level design using high-level languages to test et al. [8] proposed a prototyping framework for multiple
an algorithm, to create a software application even before hardware IP blocks on an FPGA. Their MPEG4 solution
the hardware is available and most importantly to create a creates an abstraction of the FPGA platform by having a

414 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 6, pp. 414 –425
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0156
www.ietdl.org
virtual socket layer that is located between the design and test development of image and video processing algorithms
elements, which reside on desktop computers. A different because of a software-based approach. A development tool
approach [9, 10] uses a layered reconfigurable architecture called Synphony C HLS tool from Synopsys [17] is used to
based on a partial and dynamic reconfigurable FPGA in convert C-based algorithms to hardware blocks that can
order to meet the needs for adaptability and scalability in easily be incorporated into the IVPP.
multimedia applications. In [11], an instruction set
extension is used for a motion estimation algorithm required
in the H.263 video encoder, and the authors incorporate 2 IVPP design
custom logic instructions into a softcore NiOS II CPU
within an Altera FPGA. In [12], a high-level synthesis The board used for the IVPP is the Virtex-5 OpenSPARC
(HLS)-based face detection algorithm is presented for evaluation platform developed by Xilinx. This board has a
implementing the convolutional face finder algorithm on Xilinx Virtex-5 XC5VLX110T FPGA with 69 120 logic
Virtex-4. ultraSONIC [13, 14] is a reconfigurable cells, 64 DSP48Es and 5328 Kb of block ram. It also has a
architecture offering parallel processing capability for video 64-bit wide 256-MB DDR2 small outline DIMM. The
applications based on plug-in processing elements (PIPEs) board has an analogue-to-digital converter (ADC) AD9980
– each of which can be an individual FPGA device. It also for video input. The ADC is an 8-bit, 95 MSPS, monolithic
provides an application programming interface and a analogue interface optimised for capturing YPbPr video and
software driver that abstracts the task of writing software RGB graphics signals. Its 95 MSPS encode rate capability
from the low-level interactions within the physical layer. and full power analogue bandwidth of 200 MHz support all
Although this is a very capable system with a scalable the HDTV video modes and graphics resolutions up to
architecture, it is still difficult to develop for and create XGA (1024 × 768 at 85 Hz). Moreover, the board has a
applications because of the low-level design of PIPEs and/ digital-to-analogue converter (DAC) CH7301C for video
or accelerator functions. A new driver has to be written for output. It is a display controller device that accepts a digital
each PIPE design and the corresponding video applications. graphics input signal, and encodes and transmits data
More closely related works to the image and video through a digital visual interface (DVI). The device accepts
processing platform (IVPP) presented in this study are the data over one 12-bit wide variable voltage data port, which
IP core generators from Altera and Xilinx. The Altera video supports different data formats including RGB and YCrCb.
and image processing (VIP) suite [15] is a collection of IP It can support UXGA (1600 × 1200 at 60 Hz). This board
core (MegaCore) functions that can be used to facilitate the is ideal as a video processing platform since it has all the
development of custom VIP designs. The functions include hardware necessary to capture and display the data on a
frame reader, colour space converter, deinterlacer, filtering monitor. Nevertheless, the proposed design can be
and so on. Xilinx offers a similar set of IP cores such as implemented on any FPGA as long as it is large enough.
LogiCORE IP video timing controller [16], which supports Video data are captured from a camera using the VGA
the generation of output timing signals, automatic detection input port at a resolution of 1024 × 768 at 60 Hz. Then,
and generation of horizontal and vertical video timing these video data are buffered in the DDR2 memory and
signals, support for multiple combinations of blanking or displayed on the monitor through the DVI output port. With
synchronisation signals and so on. this design, we have built a flexible architecture that enables
Although these IP cores can also be used and integrated in the user to perform real-time processing on a single frame
the IVPP, the main feature of the IVPP is the support for the or multiple frames. The overview of the design is given in
HLS tools by generating the necessary interfacing signals that Fig. 2. Multiple processing options are then possible, giving
can be used in high level C programs. This feature requires the flexibility to the user:
custom IVPP blocks described in this paper.
We present a new design framework targeting FPGA-based † The user can choose to display the RGB video data without
video processing applications with the purpose of accelerating any processing.
the development time by utilising pre-built hardware blocks. † The user can perform real-time processing on a single
In our approach, the designers can put custom logic into the frame of the RGB video data and display the output.
existing framework by adding additional accelerator units † The user can perform multi-frame processing and display
(user peripherals). These accelerator units are used for any the output.
single or multiple frame video processing. The development
phase would be limited to these custom logic components. Smoothing and edge detection filters are examples of frame
A unique design flow that incorporates the HW/SW processing. Motion detection and video compression are
components is used (Fig. 1). This platform provides a rapid examples of multi-frame processing. The next section

Fig. 1 HW/SW IVPP design flow

IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 6, pp. 414–425 415
doi: 10.1049/iet-cdt.2011.0156 & The Institution of Engineering and Technology 2012
www.ietdl.org

Fig. 2 Platform design overview

describes the constraints that need to be met by the platform (ii) The IVPP can be controlled over the internet via an
and the video processing applications. HTTP server. An HTTP server can be optionally
implemented on the FPGAs for enabling remote access and
control to the video processing platform. We have
2.1 IVPP features implemented a server function using a lightweight Internet
protocol library [18] which is suitable for embedded
The IVPP must be adaptable so that any video input devices such as a MicroBlaze processor on Xilinx FPGAs
resolution can be used. Hardware blocks can be configured and implements the basic networking protocols such as IP,
to support any resolution and they are all connected to the TCP, UDP, DHCP and ARP. This allows any user
microprocessor. When the board is powered up and connected to the Internet to control the platform, choose the
configured, a microprocessor initiates the configuration of type of video processing to perform and see the results on
the hardware blocks to support the resolution chosen by the its computer.
user. (iii) The IVPP can be controlled by MATLAB through a
The IVPP must be easy to use. The users can easily plug-in USB interface.
custom processing blocks without knowing details regarding (iv) Push buttons can be used to select which processing to
the platform architecture. Moreover, by using HLS tools such display on the screen.
as Synphony C, a user does not need to know anything about
the hardware language; its application can be designed in C The next section describes the video interface blocks.
and translated automatically into hardware blocks.
The applications must be designed so that real-time
processing is achieved. A Synphony C compiler tool has 2.3 Video interface and synthesis results
been used to develop our applications. The software
Video interface blocks are necessary for processing the video
analyses the application C code and gives advices in order
data coming from the ADC. We implemented several
to improve it so that the frequency and area constraints are
hardware modules in order to provide the necessary video
met. In our case, the pixel clock frequency is 65 MHz,
interface. These hardware components form the basis of
hence each application must be able to run at that frequency
the proposed platform and they are shared by all the user
in order to do real-time processing. Moreover, after a
applications. Pre-existing blocks include a data enabled
known latency, the algorithm must output a pixel for every
(DE) signal generator, a video-to-video frame buffer
clock cycle. The Synphony C compiler will try to achieve
controller (VFBC), VFBC, VFBC-to-video and VGA-to-
this frequency by optimising the datapath operations in the
DVI blocks, which are explained next.
algorithms. If this is not feasible because of algorithm
All the video fields and line timing are embedded in
complexity, frame skipping can be used in order to relax
the video stream (see Fig. 3). The purpose of the DE signal
the timing constraints.
generator is to create a signal that will be high only
The next section describes the different options available to
during the active portion of the frame and low otherwise.
communicate with the IVPP.
The timing information is given to the block during
the initialisation phase. The timing information for the
2.2 Communication with the IVPP resolution 1024 × 768 at 60 Hz is given in Table 1.
The visible RGB pixels are then extracted using the DE
Multiple options are available to the user in order to control signal and written to the DDR2 memory using the VFBC
the platform: [19]. The VFBC allows a user IP to read and write data
in two-dimensional sets regardless of the size or the
(i) A universal asynchronous receiver transmitter can be used organisation of external memory transactions. It is a
for direct communication between the computer and the connection layer between the video clients and the multiple
FPGA through a serial link. A menu is displayed on the port memory controller (MPMC). It includes separate
screen and the user can specify the type of video processing asynchronous FIFO interfaces for command input, write
needed. data input and read data output.

416 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 6, pp. 414 –425
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0156
www.ietdl.org

Fig. 3 Video format

Table 1 Timing information by the DAC. The synthesis results of the individual
modules and the overall system are given in Tables 2– 7.
Horizontal timing Vertical timing
The IVPP uses a few resources of the FPGA; hence space
Scanline part Pixels Frame part Lines is available for additional logic such as image and video
processing applications.
visible area 1024 visible area 768
front porch 24 front porch 3
sync pulse 136 sync pulse 6 Table 5 Synthesis results of VGA to DVI
back porch 160 back porch 29
Resource type Percentage of FPGA, %
whole line 1344 whole frame 806
slices registers ,1
slices LUTs ,1
Table 2 Synthesis results of the data enabled generator

Resource type Percentage of FPGA, % Table 6 Synthesis results of MPMC

slices registers ,1 Resource type Percentage of FPGA, %


slices look-up tables (LUTs) ,1
slices registers 10
slices LUTs 9
Table 3 Synthesis results of video to VFBC block RAM/FIFO 11

Resource type Percentage of FPGA, %

slices registers 1 Table 7 Synthesis results of the proposed platform supporting


slices LUTs ,1 multi-frame processing

Resource type Percentage of FPGA, %

The visible RGB pixels are then retrieved from the DDR2 slices registers 18
memory using the VBFC to video block. Two frames are slices LUTs 16
retrieved at the same time, so that multi-frame processing block RAM/FIFO 11
can be carried out. When ‘Frame #i + 2’ is being buffered,
‘Frame #i’ and ‘Frame #i + 1’ are retrieved. Finally, the
data are sent to the DVI output port in the format supported
3 HLS tools

Table 4 Synthesis results of VFBC to video SoC design is mainly accomplished by using RTL such as
Verilog and VHDL. An algorithm can be converted to the
Resource type Percentage of FPGA, % RTL level by using the behavioural model description
slices registers 1
method or by using pre-defined IP core blocks. After
slices LUTs 1
completing this RTL code, a formal verification needs to be
done, followed by a timing verification for proper operation.

IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 6, pp. 414–425 417
doi: 10.1049/iet-cdt.2011.0156 & The Institution of Engineering and Technology 2012
www.ietdl.org

Fig. 4 FPGA HLS flow

The RTL design abstracts logic structures, timing and


registers [20]. Therefore every clock change causes a state
change in the design. This timing dependency causes every
event to be simulated. This results in a slower simulation
time and longer verification period of the design. The
design and verification of an algorithm in RTL can take
50– 60% of the time-to-market (TTM). The RTL design
becomes impractical for larger systems that have a high
data flow between the blocks and require millions of gates.
Even though using behavioural modelling and IP cores may
improve design time, the difficulty in synthesis, poor
performance results and rapid changes in the design make
IP cores difficult to adapt and change. Therefore the
systems rapidly become obsolete.
The limitations of RTL and longer TTM forced designers
to think of the design as a whole system rather than blocks.
In addition, software integration in SoC was always done
after hardware was designed. When the system becomes
more complex, integration of the software is desired during
hardware implementation. Owing to improvements in SoC
and shorter TTM over the last two decades, designers can
use alternative methods to replace RTL. Extensive work has
been done in electronics system level design. Hence, HW/
SW co-design and HLS [21 – 24] are now integrated into
FPGA and ASIC design flows. The integration of HLS into
FPGA design flow is shown in Fig. 4.
RTL description of a system can be implemented from a
behavioural description of the system in C. This will result
in a faster verification process and shorter TTM. It is also
possible to have a hybrid design where the RTL blocks can
Fig. 5 HLS design flow be integrated with HLS.

Fig. 6 Synphony C-based design flow for hardware implementation

418 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 6, pp. 414 –425
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0156
www.ietdl.org
independent evaluation of HLS tools for Xilinx FPGAs has
been carried out by Berkeley Design Technology [26]. It
shows that using HLS tools with FPGAs can improve the
video applications’ performance significantly compared
with conventional DSP processors. Moreover, this study
shows that for a given application, the HSL tools will
achieve similar results compared with the hand-written RTL
code with a shorter development time.
Fig. 7 Canny edge detector algorithm
The proposed IVPP uses a Synphony C Compiler from
Synopsys [1, 27] to generate the RTL code for the
The HLS design flow (see Fig. 5) shows that a group of processing blocks (Fig. 1). Synphony C takes a C-based
algorithms (which represent the whole system or parts of description of an algorithm and generates a performance-
a system) can be implemented using one of the high- driven device-dependent synthesisable RTL code, testbench
level languages such as C, C++, Java, MATLAB and files, application drivers, simulation scripts as well as
so on [20, 25]. Each part in the system can be tested SystemC-based transaction-level modelling (TLM) models.
independently before the whole system is tested. During The Synphony C design flow is shown in Fig. 6.
this testing process, the RTL testbenches can also be
generated. After testing is complete, the system can be
partitioned into HW and SW. This enables the SW 4 Integration of HLS hardware blocks into
designers to join the design process during HW design; in the IVPP
addition, RTL can be tested using HW/SW together. After
the verification process, the design can be implemented With integration of the Synphony C compiler into the FPGA
using FPGA synthesis tools. flow, the designers can create complex hardware [1] sub-
Many HLS tools are available such as Xilinx’s AutoESL, systems from sequential untimed C algorithms. It allows the
Synopsys’s Synphony C compiler (also known as PICO) designers to explore programmability, performance, power,
[18], Mentor Graphics’ Catapult C and so on. An area and clock frequency. This is achieved by providing a

Fig. 8 HLS C code for noise reduction

IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 6, pp. 414–425 419
doi: 10.1049/iet-cdt.2011.0156 & The Institution of Engineering and Technology 2012
www.ietdl.org
comprehensive and robust verification and validation processing before the RGB pixels are written to the
environment. With these improvements to TTM [28], the memory. Processing block #2 is used for reading the
production cost can be reduced. current frame from memory and processing it. Processing
The Synphony C compiler can explore different types of block #3 is used for reading the previous frame from
parallelism and will choose the optimal one. The results in memory and processing it. Finally, Processing block #4 is
terms of speed and area are given along with detailed used for multi-frame processing where current and previous
reports that will help the user to optimise its code. When frames are handled simultaneously.
the achieved performance is satisfactory, the RTL code is The canny edge detector will be at position #2 in the IVPP,
generated and implemented in the targeted FPGA. Since the the motion detector will be at processing blocks #2, #3 and #4
testing is done at the C level, RTL and testbench files are and the object tracking at processing blocks #1 and #2. The
generated based on these inputs and testing and verification output images are real-time video results of the different
time can be drastically reduced [1]. When an application is hardware components generated by the Synphony C compiler.
created, a wrapper is generated by the Synphony C The C algorithms have been modified in order to achieve
compiler to help instantiate a user block into the design. optimal results compared with the hand-written RTL. The
Each block is synchronised using the synchronisation structure of the algorithms is very similar to the RTL
signals VSYNC and DE. The Synphony C block needs to design. Each algorithm has a stream of pixels as input
be enabled initially at the beginning of a frame. This is instead of a matrix.
done easily by detecting the rising edge of VSYNC which
indicates the beginning of a frame. Then, the ‘data ready’ 5.1 Canny edge detector
input of the Synphony C block is connected to the DE
signal. During blanking time (DE signal is not asserted), The algorithm is shown in Fig. 7 [29].
additional processing time is available for user video First, the RGB data are converted to grey scale. Then noise
applications which is equivalent to vertical (3 + 29 lines) reduction is performed by applying a 5 × 5 separable filter
and horizontal (24 + 160 pixels) blanking (see timing (see Fig. 8)
information in Table 1). All the required signals such as
⎡ ⎤
HSYNC, VSYNC and DE are generated by the IVPP 1 4 6 4 1
hardware base components; hence, the users can connect ⎢ 4 16 24 16 4⎥
1 ⎢⎢ 6 24 36

their custom HLS blocks to these interface signals. S= 24 6⎥
The user needs to analyse the latency of the Synphony C 256 ⎢
⎣ 4 16 24

16 4⎦
block in order to synchronise HSYNC, VSYNC and DE
with the output. For example, for a latency of ten clock 1 4 6 4 1
⎡ ⎤
cycles, those signals need to be delayed by ten. This is 1
accomplished using sliding registers. Nevertheless, if the ⎢4⎥
1 ⎢ ⎥ 1 
latency is greater than 128 clock cycles, FIFOs are used = ⎢ 6⎥⎥ 16 1 4 6 4 1
instead of registers in order to reduce the hardware usage. 16 ⎢
⎣4⎦
Canny edge detector and motion detector applications have 1
an important latency (multiple lines), hence FIFOs have
been used; the object tracking application has a latency of
three clock cycles. The Sobel operator [30] is a discrete differentiation operator
that computes an approximation of the gradient of the
image intensity function. It is based on convolving the
5 Case study using the Synphony C compiler image with a small, separable and integer valued filter in
horizontal and vertical directions and is therefore relatively
for the IVPP inexpensive in terms of computations. Basically, 3 × 3
As a demonstration of the possibilities of the proposed filters (F1 and F2) are applied to the video data (see Fig. 9).
platform, three video processing applications have been We obtain horizontal (O1) and vertical (O2) gradients.
designed and developed using the Synphony C compiler. If The magnitude and direction for each gradient are obtained
other HLS tools are used, the design flow would be very as follows:
similar. Canny edge detector, motion detector and object
tracking blocks have been tested with the IVPP. We have Gx,y = |O1 (x, y)| + |O2 (x, y)|
three different applications with different integrations into
the IVPP. From Fig. 2, we can see four possible positions (see equations at the bottom of the page)
for processing. Processing block #1 can be used for stream where Dx,y ¼ 0 corresponds to a direction of 08; Dx,y ¼ 1

|O1 | . 2 × |O2 | ^ (O1 . 0 ^ O2 ≥ 0 _ O1 , 0 ^ O2 ≤ 0)


Dx,y = 0 if
|O2 | . 2 × |O1 | ^ (O1 ≥ 0 ^ O2 , 0 _ O1 ≤ 0 ^ O2 . 0)

2 × |O2 | ≥ |O1 | . |O2 | ^ (O1 . 0 ^ O2 ≥ 0 _ O1 , 0 ^ O2 ≤ 0)


Dx,y = 1 if
2 × |O1 | ≥ |O2 | ≥ |O1 | ^ (O1 . 0 ^ O2 . 0 _ O1 , 0 ^ O2 , 0)

|O2 | . 2 × |O1 | ^ (O1 . 0 ^ O2 . 0 _ O1 , 0 ^ O2 , 0)


Dx,y = 2 if
|O2 | . 2 × |O1 | ^ (O1 ≥ 0 ^ O2 , 0 _ O1 ≤ 0 ^ O2 . 0)

2 × |O1 | ≥ |O2 | . |O1 | ^ (O1 ≥ 0 ^ O2 , 0 _ O1 ≤ 0 ^ O2 . 0)


Dx,y = 3 if
2 × |O1 | ≥ |O2 | ^ (O1 . 0 ^ O2 , 0 _ O1 , 0 ^ O2 . 0)

420 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 6, pp. 414 –425
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0156
www.ietdl.org
corresponds to a direction of 458; Dx,y ¼ 2 corresponds to a
direction of 908; and Dx,y ¼ 3 corresponds to a direction of
1358.
The HLS C code for the Sobel edge filter and gradient is
shown in Fig. 10.
Non-maximum suppression is then carried out by
comparing Gx,y with the magnitude of its neighbours along
the direction of the gradient Dx,y . This is done by applying
a 3 × 3 moving window. For example, if Dx,y ¼ 0, the
pixel is considered as an edge if Gx,y . Ga and Gx,y . Gb
Fig. 9 Sobel edge filters

Fig. 10 HLS C code for sobel edge detector and gradient direction

IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 6, pp. 414–425 421
doi: 10.1049/iet-cdt.2011.0156 & The Institution of Engineering and Technology 2012
www.ietdl.org

Fig. 11 HLS C code for non-maximum suppression

with Ga ¼ Gx,y21 and Gb ¼ Gx,y+1 . the HLS C code is shown neighbouring pixel value is greater than TL , and the resultant
in Fig. 11. is an image with sharp edges. The HLS C code for hysteresis
Then, the result is usually thresholded to decide which edges thresholding is given in Fig. 12. The Synphony C block has
are significant. Two thresholds TH and TL are applied, where been placed at position #2 in the platform.
TH . TL . If the gradient magnitude is greater than TH then Table 8 shows the synthesis results of the canny edge
that pixel is considered as a definite edge. If the gradient is detector.
less than TL , then, that pixel is set to zero. If the gradient
magnitude is between these two, then it is set to zero unless 5.2 Motion detector
there is a path from this pixel to a pixel with a gradient
above TL . A 3 × 3 moving window operator is used, the A motion detection algorithm (see Fig. 13) has been
centre pixel is said to be connected if at least one implemented using the Synphony C compiler: Both the

422 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 6, pp. 414 –425
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0156
www.ietdl.org

Fig. 13 Motion detection algorithm

Table 9 Synthesis results of the motion detector

Resource type Usage Percentage of FPGA, %

slices registers 704 1


slices LUTs 1281 1
block RAM/FIFO 9 6
DSP48Es 2 3

Fig. 12 HLS C code for hysteresis thresholding

current and the preceding frame are converted to black and


white. Then, a 5 × 5 noise reduction filter is applied.
Finally, a Sobel edge detector is applied on the difference
of the two images and the motion is superimposed on the
current frame. The Synphony C block has been placed at
processing blocks #2, #3 and #4 in the platform. Table 9
shows the synthesis results of the motion detector.

5.3 Object tracking

An object tracking algorithm has been developed and tested


on the platform. It is composed of two main phases: Fig. 14 HLS C code for RGB to HSV conversion

(i) At processing block #1 in the IVPP, noise reduction is


performed on the RGB values, then RGB to hue, saturation
and value (HSV) conversion is done and colour
segmentation is applied.
(ii) Then, at processing block #2 in the IVPP, a boundary box
is created around the pixels from a specific colour selected by

Table 8 Synthesis results of the canny edge detector

Resource type Usage Percentage of FPGA, %

slices registers 1028 1


slices LUTs 1388 2
block RAM/FIFO 11 7
DSP48Es 1 1
Fig. 15 HLS C code for boundary information

IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 6, pp. 414–425 423
doi: 10.1049/iet-cdt.2011.0156 & The Institution of Engineering and Technology 2012
www.ietdl.org

Fig. 16 HLS C code for display of object boundaries

the user. The boundary box can give information on the


orientation of the object and also the distance between the
object and the camera. Moreover, the algorithm keeps track
of all the positions of the object during 1 s and displays
them on the screen.

Noise reduction is performed on the input data by applying


the same Gaussian filter as seen above. Then RGB to HSV
conversion is done as follows:

MAX = max(Red, Green, Blue)


MIN = min(Red, Green, Blue)


⎪ 0 if MAX = MIN

⎪ −

⎪ Green Blue

⎪ 42 + 42 if MAX = Red
⎨ MAX − MIN
Fig. 18 Real-time video output shows a green pen framed by a
H= Blue − Red

⎪ 42 + 127 if MAX = Green rectangular box

⎪ MAX − MIN

⎪ Trace positions are also displayed which follow the centre of the box for a 1 s
⎩ 42 Red − Green + 213 if MAX = Blue
⎪ duration
MAX − MIN
 Table 10 Synthesis results of object tracking
0 if MAX = 0
S= MAX − MIN Resource type Usage Percentage of FPGA, %
255 Otherwise
MAX
slices registers 2631 4
V = MAX slices LUTs 4892 7
block RAM/FIFO 3 2
Fig. 14 shows the HLS C code of the RGB to HSV DSP48Es 9 14
conversion. div_tab is an array of 256 values (div_tab[0] is
set to 0 since it does not affect the results) which stores the
algorithm is given in Fig. 18. Table 10 shows the synthesis
division results of 1/i for i ¼ 1, . . . , 255 with a precision of
results of object tracking.
16 bits.
We compare the HSV values of the pixel with an input
value (Hin) chosen by the user. If H is close enough to that 6 Conclusion
input value and S and V are big enough, then the pixel
will be tracked. The boundary information is obtained In this work, we have developed an IVPP for real-time
depending on the colour (Hin) selected by the user (see applications on a Virtex-5 FPGA. A new C-based HLS
Fig. 15). design flow is presented. The user can design the image and
At processing block #2, we receive the information from video processing applications in C language, convert them
block #1 for the boundary box. A red box is displayed into hardware using the Synphony C compiler tool and then
around the object selected by the user (see Fig. 16).The implement and test them easily using the IVPP. The IVPP
centre of the box is saved for a duration of 1 s before being streamlines the development by providing all the necessary
erased. Hence, the movement of the object can be tracked logic blocks for the front-end (capturing video data) and the
(see Fig. 17). An example of real-time video output of this back-end (displaying processed output data) operations. For
a case study, three example applications have been
discussed, showing the performance and flexibility of the
proposed platform. The IVPP can be a cost-effective, rapid
development and prototyping platform for key applications
such as video encoding/decoding, surveillance, detection
and recognition.

7 Acknowledgment
The authors would like to thank Xilinx, Inc. (www.xilinx.
com) and Synfora, Inc. (www.synfora.com) for their
Fig. 17 HLS C code for display of positions of object valuable support.

424 IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 6, pp. 414 –425
& The Institution of Engineering and Technology 2012 doi: 10.1049/iet-cdt.2011.0156
www.ietdl.org
8 References 14 Sedcole, N.P., Cheung, P.Y.K., Constantinides, G.A., Luk, W.: ‘A
reconfigurable platform for real-time embedded video image
1 Coussy, P., Morawiec, A.: ‘High-level synthesis: from algorithm to processing’ (Springer, Berlin, 2003, LNCS, 2778), pp. 606– 615
digital circuits’ (Springer Science + Business Media, Berlin, 2008), 15 Altera,Video and Image Processing Suite, User Guide, 2012, available at
Ch. 1, 4 http://www.altera.com/literature/ug/ug_vip.pdf#performance_performance
2 Muller, W., Rosenstiel, W., Ruf, J.: ‘SystemC: methodologies and 16 Xilinx LogiCORE IP Video Timing Controller, Product Guide,
applications’ (Kluwer Academic Publishing, Dordrecht, 2003), Ch. 2 2012, available at http://www.xilinx.com/support/documentation/ip_
3 Hong, S., Yoo, S., Lee, S., et al.: ‘Creation and utilization of a virtual documentation/v_tc/v4_00_a/pg016_v_tc.pdf
platform for embedded software optimization: an industrial case 17 Synopsys, Inc., Synphony High-Level Synthesis from Language and
study’. Proc. Fourth Int. Conf. Hardware/Software Codesign and Model Based Design, available at http://www.synopsys.com/Systems/
System Synthesis, 2006, pp. 235– 240 BlockDesign/HLS/Pages/default.aspx
4 Ruggiero, M., Bertozzi, D., Benini, L., Milano, M., Andrei, A.: 18 Xilinx Lightweight IP (lwIP) application examples, 2011, available at
‘Reducing the abstraction and optimality gaps in the allocation and http://www.xilinx.com/support/documentation/application_notes/xapp1
scheduling for variable voltage/frequency MPSoC platforms’, IEEE 026.pdf
Trans. Comput.-Aided Des. Integr. Circuits Syst., 2009, 28, (3), 19 Xilinx LogiCore Video Frame Buffer Controller v1.0, XMP013,
pp. 378–391 October 2007, available at http://www.xilinx.com/products/devboards/
5 Tumeo, A., Branca, M., Camerini, L., et al.: ‘Prototyping pipelined reference_design/vsk_s3/vfbc_xmp013.pdf
applications on a heterogeneous FPGA multiprocessor virtual 20 Ramachandran, S.: ‘Digital VLSI system design’ (Springer, New York,
platform’. Design Automation Conf., 2009, pp. 317–322 2007), Chapter 11
6 Skey, K., Atwood, J.: ‘Virtual radios – hardware/software co-design 21 Hammami, O., Wang, Z., Fresse, V., Houzet, D.: ‘A case study:
techniques to reduce schedule in waveform development and porting’. quantitative evaluation of C-based high-level synthesis systems’,
IEEE Military Communications Conf., 2008, pp. 1 –5 EURASIP J. Embed. Syst., 2008, 685128, pp. 1–13
7 Yi-Li, L., Chung-Ping, Y., Su, A.: ‘Versatile PC/FPGA-based 22 Glasser, M.: ‘Open verification methodology cookbook’ (Springer,
verification/fast prototyping platform with multimedia applications’, New York, 2009), Chapters 1– 3
IEEE Trans. Instrum. Meas., 2007, 56, (6), pp. 2425– 2434 23 Man, K.L.: ‘An overview of SystemCFL’, Research in Microelectron.
8 Schumacher, P., Mattavelli, M., Chirila-Rus, A., Turney, R.: ‘A Electron., 2005, 1, pp. 145– 148
software/hardware platform for rapid prototyping of video and 24 Hatami, N., Ghofrani, A., Prinetto, P., Navabi, Z.: ‘TLM 2.0 simple
sockets synthesis to RTL’. Int. Conf. on Design & Technology of
multimedia designs’. Proc. Fifth Int. Workshop on System-on-Chip
Integrated Systems in Nanoscale Era, 2000, vol. 1, pp. 232–235
for Real-Time Applications, 2005, pp. 30– 34
25 Chen, W.: ‘The VLSI Handbook’ (CRC Press LCC, Boca Raton, 2007,
9 Zhang, X., Rabah, H., Weber, S.: ‘Auto-adaptive reconfigurable
2nd edn.), Chapter 86
architecture for scalable multimedia applications’. Second NASA/ESA
26 Berkeley Design Technology, ‘An independent evaluation of high-level
Conf. on Adaptive Hardware and Systems, 2007, pp. 139–145
synthesis tools for Xilinx FPGAs’, available at http://www.bdti.com/
10 Zhang, X., Rabah, H., Weber, S.: ‘Cluster-based hybrid reconfigurable
MyBDTI/pubs/Xilinx_hlstcp.pdf
architecture for autoadaptive SoC’. 14th IEEE Int. Conf. on
27 Haastregt, S.V., Kienhuis, B.: ‘Automated synthesis of streaming C
Electronics, Circuits and Systems, ICECS, 2007, pp. 979–982
applications to process networks in hardware’, Design Automation &
11 Atitallah, A., Kadionik, P., Masmoudi, N., Levi, H.: ‘HW/SW FPGA
Test in Europe, 2009, pp. 890– 893
architecture for a flexible motion estimation’. IEEE Int. Conf. on
28 Avss, P., Prasant, S., Jain, R.: ‘Virtual prototyping increases
Electronics, Circuits and Systems, 2007, pp. 30–33
productivity – a case study’. IEEE Int. Symp. on VLSI Design,
12 Farrugia, N., Mamalet, F., Roux, S., Yang, F., Paindavoine, M.: ‘Design Automation and Test, 2009, pp. 96– 101
of a real-time face detection parallel architecture using high-level 29 He, W., Yuan, K.: ‘An improved canny edge detector and its realization
synthesis’, EURASIP J. Embed. Syst., 2008, 938256, pp. 1– 9
on FPGA’. Proc. Seventh World Congress on Intelligent Control and
13 Haynes, S., Epsom, H., Cooper, R., McAlpine, P.: ‘UltraSONIC: Automation, 2008
a reconfigurable architecture for video image processing’ (Springer, 30 Gonzales, R.C., Woods, R.E.: ‘Digital image processing’ (Prentice-Hall,
Berlin, 2002, LNCS), pp. 25–45 New Jersey, 2007, 3rd edn.)

IET Comput. Digit. Tech., 2012, Vol. 6, Iss. 6, pp. 414–425 425
doi: 10.1049/iet-cdt.2011.0156 & The Institution of Engineering and Technology 2012

You might also like