ASIP Approach For Implementation of H.264/AVC

Journal of Signal Processing Systems 50, 53–67, 2008
* 2007 Springer Science + Business Media, LLC. Manufactured in The United States.
DOI: 10.1007/s11265-007-0109-y
ASIP Approach for Implementation of H.264/AVC
SUNG DAE KIM AND MYUNG H. SUNWOO

School of Electrical and Computer Engineering, Ajou University, San 5, Wonchun-Dong,
Yeongtong-Ku, Suwon, 442-749, South Korea
Received: 3 April 2007; Accepted: 13 June 2007
Abstract. This paper presents an Application Specific Instruction Set Processor (ASIP) for implementation of
H.264/AVC, called Video Specific Instruction-set Processor (VSIP). The proposed VSIP has novel instructions
and optimized hardware architectures for specific applications, such as intra prediction, in-loop deblocking filter,
integer transform, etc. Moreover, VSIP has coprocessors for computation intensive parts in video signal
processing, such as inter prediction and entropy coding. The proposed VSIP has much smaller area and can
dramatically reduce the number of memory access compared with commercial DSP chips, which result in low
power consumption. Moreover, the proposed hardware accelerators have small size, consume low power
consumption, and thus, they can support real-time video processing. VSIP has been thoroughly verified using an
FPGA board having the Xilinxi Virtex II. VSIP can implement a real-time H.264/AVC decoder. The proposed
VSIP is one of promising solutions for video signal processing.
Keywords: application specific instruction-set processor, hardware software codesign, H.264/AVC,

low power design, data reuse, hardware accelerator
1. Introduction greatly reduce time-to-market and allow faster

changes and upgrades. However, programmable
With the rapid progress of semiconductor technolo- DSPs may have the disadvantages related with cost,
gy, Application Specific Instruction-set Processor size, and power consumption. ASIP can compromise
(ASIP), which adopts high performance and low advantages of ASIC designs and general DSP chips
power of ASIC and flexibility of DSP, has become [1–5]. In other words, ASIP chips adopt high
increasingly important. The market of ASIP is performance and low power of ASIC chips and
growing fast since the sales of portable devices, flexibility of DSP chips. ASIP can give low power at
such as cellular phones, digital cameras, MP3 the algorithm/architecture level, which can provide
(MPEG Layer-3) players, PMP (Portable Multimedia the most efficient way to achieve low power
Player), etc. are dramatically increasing. These consumption [6].
applications need high performance, low power Multimedia technology has been developed with
consumption and low cost. Application-Specific the progress of semiconductor technology. Technol-
Integrated Circuit (ASIC) designs can reduce the ogy related to multimedia codec has been standard-
cost, size, and power consumption of systems. ized as MPEG-2, MPEG-4, H.261, H.263, etc. The
However, ASIC designs have been found inadequate Joint Video Team (JVT) announced H.264/AVC in
to upgrade standards since they should be rede- Dec. 2003 [7]. The new video coding standard
signed. On the other hand, programmable DSPs can H.264/AVC can provide twice as much as higher
54 Kim and Sunwoo
compression efficiency than MPEG-4. However, it tion into hardware implementation for high perfor-
has about 2 times more hardware complexity for a mance and software implementation for flexibility.
decoder, and about 10 times more hardware com- H.264/AVC has computation intensive parts such as
plexity for an encoder than the MPEG-4 visual inter prediction and entropy coding. To achieve low
simple profile codec [8]. power consumption and real-time processing, hard-
In mobile communications, the implementation of ware accelerators for inter prediction and entropy
multimedia codec needs high performance, low coding are required. Next, we design the optimized
power consumption and low cost. The implementa- instruction set and their architecture based on the
tion also requires the flexible system which can analysis. H.264/AVC has new features such as intra
upgrade without replacing the system. The ASIP prediction, in-loop deblocking filter, integer trans-
approach can be quite suitable for these require- form, etc. It is inefficient to implement these blocks
ments. Hence, we propose an ASIP for implementa- using existing DSP instructions. Hence, we propose
tion of mobile multimedia codec, called Video new instructions and their architecture to implement
Specific Instruction-set Processor (VSIP) [5]. We H.264/AVC efficiently. The optimized instruction set
implement the VSIP based on the design flow of can reduce computation complexity, redundancy and
ASIP as shown in Fig. 1 [9]. overhead. In general, computation cycles to perform
First, the target application is chosen. H.264/AVC target applications in an ASIP are much less than
is widely used in mobile communication standards, those of general DSPs. Finally, the functions of the
such as DMB, DVB-S2, DVB-T, etc. Hence, the proposed VSIP have been thoroughly verified using
target application of VSIP is video signal processing the Xilinx XC2v6000 FPGA.
including H.264/AVC. Second, we profile the H.264/ The proposed VSIP can efficiently perform new
AVC tasks. Through profiling, we can find the features of H.264/AVC, such as intra prediction, in-
complexity of H.264/AVC tasks. According to the loop deblocking filter, and integer transform. More-
complexity of each task, we can divide the applica- over, VSIP has hardware accelerators for inter
Target application
selection
Application profiling
H/W, S/W partitioning
Design special instructions Design hardware

and architecture accelerators
Verification and
performance comparison
Chip fabrication
Figure 1. Design flow of ASIP.

ASIP Approach for Implementation of H.264/AVC 55
Figure 2. Complexity analysis of the H.264/AVC baseline profile.
prediction and entropy coding that occupy the largest 2. Implementation Analysis for H.264/AVC
portion of power consumption and critical timing
parts of video processing. Hence, VSIP can imple- This section introduces briefly H.264/AVC and
ment a real-time and low power H.264/AVC baseline shows the results of profiling. Then, various imple-
profile decoder in QCIF format. mentations of H.264/AVC are analyzed. This section
This paper is organized as follows. Section 2 also presents existing DSP instructions for video
introduces H.264/AVC and describes existing DSP signal processing.
instructions to implement multimedia standards.
Section 3 proposes novel instructions and hardware 2.1. Implementation of H.264/AVC
accelerators, and Section 4 explains performance
comparisons. Finally, Section 5 contains concluding H.264/AVC has adopted new features to improve
remarks. code efficiency, which are described as follows.
Figure 3. Computation times of various implementations. a DSP. b ASIC. c VSIP + accelerators.

56 Kim and Sunwoo
src1 a0 a1 a2 a3
DOTPU4
src2 b0 b1 b2 b3
dst (a0*b0) + (a1*b1) + (a2*b2) + (a3*a4)

Figure 4. DOTPU4 instruction in TMS320c64.
H.264/AVC uses several reference frames, variable method used to encode the residual data of 44
block size, and quarter pixel accuracy in Motion blocks [11–13].
Estimation (ME)/Motion Compensation (MC). These Figure 2 shows the operation complexity of the
features enable the encoder to search for the best H.264/AVC baseline profile [14]. As shown in Fig. 2,
match for the current frame. However, the memory ME/MC takes 53% and VLC takes 18.20% of the
access and hardware complexity increase significant- operation complexity. Especially, these tasks access
ly. The past standards, such as MPEG-2, MPEG-4, memory frequently. To achieve low power consump-
H.263, etc., transmit the first frame without com- tion we need the dedicated hardware for these tasks.
pression. On the other hand, the H.264/AVC encoder In practice, the computation complexity of VLC is not a
adopts intra prediction, which eliminates the redun- dominant part in H.264/AVC. Intra prediction and in-
dancy of intra frame. loop deblocking filter have more computation complex-
The block based structure causes blocking arti- ity compared with VLC. However, VLC requires bit
facts. Thus, H.264/AVC adopts the in-loop deblock- manipulation operations which are inefficient to be
ing filter to eliminate blocking artifacts. The implemented on a general processor. Hence, we employ
Exponential Golomb Coding (EGC) and Context the dedicated hardware for VLC. Moreover, inter
Adaptive Variable Length Coding (CAVLC) are also prediction can be executed in parallel with intra
the newly adopted features of the H.264/AVC prediction and entropy coding can also be executed in
baseline profile. EGC uses variable length codes parallel with in-loop deblocking filter. Thus parallelism
with a regular construction [10]. CAVLC is the can be exploited for these tasks. The proposed instruc-
PCU Program Memory AU IPA

Prefetch Logic Data Memory 1
Program Counter Data Memory 2 ME Hardware

16 Bit Address Registers
Accelerator
Instruction Register
FSM
MC Hardware
Accelerator
Stack AGU AGU
Interrupt Controller
Program Bus (16 Bit)

Data Buses (32 Bit)
Address Buses (16 Bit)
DPU ECA
Register File
CAVLC EGC
MAC MAC ALU ALU Shifter Accelerator Accelerator
Figure 5. Proposed VSIP architecture.

boundary
q3 q2 q1 q0 p0 p1 p2 p3
Figure 6. Block boundary.
tions of VSIP can efficiently support intra prediction and averages of the packed data in two registers. After
in-loop deblocking filter [5]. To maximize the usage of additions of four packed data, four results are shifted
hardware resource, the hardware accelerators of inter a bit to the left for division, and 0.5 is added to each
prediction and entropy coding are essential. Thus, the result for rounding. The TMS320c6 series also
proposed VSIP employs the hardware accelerators for support the DOTPU4 instruction which calculates
these tasks. the dot product between four sets of packed 8 bit
Figure 3 shows computation times of DSP, ASIC, values. Figure 4 shows the operation flow of the
and VSIP implementations according to the profiling DOTPU4 instruction. The values in both src1 and
results. Figure 3(a) shows the computation times of src2 are treated as the unsigned 8 bit packed data.
the DSP implementation. If a single DSP is used to The 32 bit unsigned result is written into dst. Four
implement the H.264/AVC algorithm, the DSP clock cycles are required to execute this instruction.
serially executes all of the algorithm blocks. DCT has a regular computation flow, while ME/MC
Figure 3(b) shows the computation times of the and entropy coding have control based computations.
ASIC implementation. Each block is executed using TMS320c55 has a coprocessor for DCT computa-
the dedicated hardware. However, all of the blocks tions, and it requires 2.8 MIPS for DCT computations
cannot be executed in parallel, since some blocks use to achieve the processing speed of 30 fps for the QCIF
the output of other blocks. For example, the format. TMS320c6 having eight function units
transform block uses ME results and the entropy requires 1.1 MIPS to implement DCT of 30 QCIF fps
coding block needs transform/quantization results. video data using DSP instructions [16].
Figure 3(c) shows the proposed VSIP having accel- In entropy coding, the code word table is referred
erators. The VSIP implementation having acceler- according to the number of successive zeros in the
ators requires more computation times than the ASIC input bit stream. Moreover, packed compare oper-
implementation. However, it requires much less ations are required. To execute these operations,
computation times than DSP and can support various TMS320c64 supports the LMBD and CMPEQ/GT/
profiles and standards. LT instructions, and the Blackfin DSP of Analog
Device supports the ONES instruction [16, 17]. The
2.2. Existing DSP Instructions for Video Signal
Processing
Q A B C D E F G H
I a b c d
Existing DSPs support various instructions to exe-
J e f g h
cute packed operations between two registers. These
K i j k l
operations are used for various video signal process-
L m n o p
ing, such as DCT, IDCT, ME/MC, etc. TMS320c6
M
of Texas Instruments supports special instructions
N
for multimedia signal processing, such as SUB-
ABS4, AVGx, etc. [15]. The SUBABS4 instruction O
calculates absolute differences of four pairs of the P
packed data. The AVG4 instruction calculates Figure 7. Identification of samples for 44 intra prediction.
58 Kim and Sunwoo
dst = HADD(src)
dst = HADD(src:mask)
dst = HADD(src:mask1.mask2)
src a0 a1 a2 a3
dst
mask 1 0 1 0
src a0 a1 a2 a3
a0 << 1 a2 << 1
dst
mask1 0 1 1 1
mask2 1 0 0 1
src a0 a1 a2 a3
a3 << 1
dst
c
Figure 8. Proposed instructions for packed additions within a register. a dst = HADD(src). b dst = HADD(src:mask). c dst =
HADD(src:mask1.mask2).
LMBD instruction counts the number of zeros in a instructions including instructions for H.264/AVC,
register. The CMPEQ/GT/LT instructions compare which are described next.
pairs of 8 bits or 16 bit packed data.
3.2. Proposed Instructions for In-loop Deblocking
3. Proposed Instructions and Accelerators Filter and Intra Prediction
This section presents an overall architecture, new The in-loop deblocking filter is used to eliminate
instructions and hardware accelerators for the H.264/ blocking artifacts as mentioned in Section 2. Figure 6
AVC codec. shows 8 pixels of neighboring 44 blocks. The
8 pixel values are decided according to the boundary
3.1. Overall Architecture of the Proposed VSIP Strength (bS), which represents the difference of two
neighboring blocks, using p0 õ p3 and q0 õ q3.
Figure 5 shows the overall architecture of the The equations calculating pixel values are defined
proposed VSIP. The proposed VSIP consists of two in [7]. The equations can be classified into five
parts, a programmable DSP part and a hardware categories as follows.
accelerator part. The DSP part has a program control p2 þ p1 þ p0 ð1Þ
unit (PCU), a data processing unit (DPU), and an
address unit (AU). The hardware accelerator part has
an Inter Prediction Accelerator (IPA) and an Entropy p2 þ 2 p1 þ 2 p0 ð2Þ
Coding Accelerator (ECA). IPA consists of an ME
accelerator and an MC accelerator. ECA has a
CAVLC accelerator and an EGC accelerator. The 2 p3 þ 3 p2 þ p1 þ p0 ð3Þ
hardware accelerators can operate in parallel with the
DSP units. 2 p1 þ p0 ð4Þ
PCU consists of a prefetch logic, a program
counter, an instruction register, an FSM (Finite State
Machine), a stack, and an interrupt controller. DPU ðp0 þ q0 þ 1Þ 1 ð5Þ
consists of two Multiply and Accumulate (MAC)
units for two 16-bit by 16-bit multiplications and p0õp3 are the packed data in a register, and q0 õ
accumulations, two Arithmetic Logic units (ALU), a q3 are also the packed data in another register. Then,
barrel shifter and a register file. AU has two address Eq. (1) shows the additions of three packed data in
generation units (AGU) for load and store. Each of one register. Eq. (2) represents one bit shift left
the internal word lengths is 32 bit. The instruction operations of two data followed by additions of three
pipeline consists of six stages, that is, pre-fetch, packed data in the same register. Eq. (3) shows one
fetch, decode, execute1, execute2, and execute3. The bit shift left operation of data and a multiplication
proposed ASIP has 35 arithmetic instructions, 11 operation of data followed by the additions of four
logical and shift instructions, 6 program control packed data. Eq. (4) shows one bit shift left
instructions, 4 move instructions and 16 special operation of the packed data followed by an addition
Figure 9. Assembly program of core block for in-loop deblocking filter.

60 Kim and Sunwoo
Figure 10. Assembly program of intra prediction.
of two packed data. Eq. (5) shows an addition of the shown in Fig. 6, but they occur between the packed
most significant byte (MSB) of one register and the data within the same register.
least significant byte (LSB) of the other register As mentioned in Section 2, the intra prediction
followed by one bit shift operation. Even though eliminates the redundancy of intra frame and inter
these computations are packed operations, these frame, which has few redundancies between two
operations do not occur between two registers as frames. Figure 7 shows an identification of samples
x(0) X(0)
x(1) - X(2)
x(2) - X(1)
-2
2
x(3) - X(3)
X(0) x(0)
X(2) x(1)
-
1/2
X(1) - x(2)
1/2
X(3) - x(3)
b
Figure 11. Operation flow of 44 integer transform. a 1-D forward transform. b 1-D inverse transform.
src A B C D
fTRAN
dst A+B+C+D (B-C)+2(A-D) A+D-(B+C) (A-D)-2(B-C)

a
R0 A B C D R4 A E I M
TRAN
R1 E F G H R5 B F J N
R2 I J K L R6 C G K O
R3 M N O P R7 D H L P
b
Figure 12. Operation flow of fTRAN and TRAN instruction. a Operation flow of fTRAN instruction. b Operation flow of TRAN
instruction.
for 44 intra prediction. a õ p in Fig. 7 are Intra prediction and in-loop deblocking filter
predicted using A õ Q according to the equations require dot product calculation. TI_s TMS320C6
defined in [7] and some of equations are represented supports the DOTPU4 instruction for dot product
in Eq. (6), where A, B, and C represent pixel values, calculation which performs packed multiplications of
and a pixel value is represented using 8 bits. For a 32 two registers and adds four results in four cycles.
bit architecture, A, B, C and D are stored in one ADI_s ADSP-BF53 not supporting these special
register since a õ p and A õ Q in Fig. 7 are 8 bit instructions requires more clock cycles to perform
values. dot product calculation. Hence, we only compare the
proposed HADD instruction with the TMS320C6
ð A þ 2 B þ C þ 2Þ 2 instruction.
ð A þ B þ 1Þ 1 ð6Þ Figure 9 shows the assembly program of the core
ð A þ 3 B þ 2Þ 2 block for the in-loop deblocking filter. R0 and R1 are
general registers. The packed pixel data is stored in
R0 and R1. Each result of the instruction can be
As described in Section 2, existing DSPs support obtained after one clock cycle. Hence, the proposed
only packed operations between two registers. A VSIP can execute these equations for in-loop
large number of instruction cycles is required to deblocking filter in one clock cycle.
implement the in-loop deblocking filter and intra Figure 10 shows the assembly program of the intra
prediction with the existing packed instructions that prediction. Acc represents an accumulator and the
execute packed operations between two registers. packed pixel data is stored in R0. R1 and R2 have
Hence, H.264/AVC may require a new instruction to
execute packed operations within a register.
Figure 8 shows the proposed three horizontal addi-
tion (HADD) instructions. Three HADD instructions Loop label1 #2
are as follows. The proposed instruction in Fig. 8(a) Acc = fTRAN(R0)
packs a 32 bit register into four 8 bit data, adds four R(4) = MOVR(Acc)
packed data, and then saturates the result to 8 bit data. Acc = fTRAN(R1)
R(5) = MOVR(Acc)
Figure 8(b) is similar with Fig. 8(a). However, the Acc = fTRAN(R2)
packed data, which is selected by a mask, is one bit R(6) = MOVR(Acc)
shifted to left. In Fig. 8(c), mask1 selects the data to be Acc = fTRAN(R3)
added, and mask2 selects the data to be shifted. Eqs. R(7) = MOVR(Acc)
label1 G0 = TRAN(G1)
(1), (2), (4), (5) of the in-loop deblocking filter and Eq.
(6) of the intra prediction can be implemented using Figure 13. Assembly program of 44 forward 2-D integer
the proposed HADD instructions. transform.
62 Kim and Sunwoo
Reference Picture
4 x 4 Current Block
SAD operation
a
Reference Picture
4 x 4 Current Block
SAD operation
b
Figure 14. Operation flow of the proposed motion estimation. a ME operation in the first cycle. b ME operation in the second cycle.
offset values for rounding. The ADDAR instruction This paper proposes novel instructions to efficient-
in VSIP calculates an addition of two source data. ly execute the forward/inverse 44 integer transform
After the addition, the result is shifted to right by the as follows.
immediate value in the instruction. Each result of the
instruction can be obtained after one clock cycle. dst ¼ fTRANðsrcÞ
Hence, the proposed VSIP can execute these equa- dst ¼ iTRANðsrcÞ:
tions for intra prediction in two clock cycles.
Each instruction performs the operations of Fig. 11(a)
3.3. Proposed Instructions for Integer Transform and (b).
The 44 integer transform can be operated using the

forward transform as shown in Fig. 11(a). The
forward transform is executed with four rows of four
packed data. Then, the forward transform is per-
formed again with four columns of four packed data
to get the results of the 44 integer transform.
Figure 11(b) represents an inverse transform. Simi-
larly, the 44 inverse integer transform can be Figure 15. ME computation flow. a Existing computation flow.
executed using the operations in Fig. 11(b). b Proposed computation flow.
vlc=1 001X
files each of which contains R0, R1, R2, and R3 or
vlc=2 0011X R4, R5, R6, and R7, respectively. The Loop instruc-
vlc=3 00111X tion repeats the program until the label. The number
vlc=4 001111X of repeats is determined by the immediate value in
the instruction. The MOVR instruction moves the
vlc=5 0011111X
source data into the general register that has four
vlc=6 00111111X
8 bit pixel data. The general register file has four
vlc=n 00 11 … 11 X general registers. Each VSIP instruction requires one
n clock cycle. As you can see, the program in Fig. 13
Figure 16. Threshold value of each level decoding table. has nine instructions for 1-D forward integer trans-
form. To execute 2-D transform, 18 instructions for
computation and 1 instruction for program control
Figure 12(a) shows the operation flow of the fTRAN are needed. Hence, 19 cycles are required to execute
instruction. The fTRAN instruction reads a 32 bit this program. If the data load cycle is added, the total
general register in one register file, which consists of cycles of 44 forward 2-D integer transform are 23
four 32 bit registers, and executes the operation flow in cycles. The implementation using the instructions of
Fig. 12(a). Then, the results are written in another TMS320c55 (SW) for integer transform requires
register file consisting of four 32 bit registers. The more than 1,078 cycles [18]. Hence, the proposed
iTRAN instruction performs a similar operation. fTRAN and iTRAN instructions can be more
These instructions can be implemented using the efficient than the existing DSP instructions for
adders and eight additional 21 multiplexers. Figure integer transform.
12(b) shows the operation flow of the TRAN
instruction. The general register file has a 44 matrix 3.4. Proposed Accelerator for Inter Prediction
whose elements are 8 bit pixel data. The TRAN
instruction in VSIP executes the transpose of a 44 The proposed MC accelerator supports the motion
matrix as shown in Fig. 12(b). vector with quarter pixel accuracy which is one of
44 integer transform can be easily programmed the key features of H.264/AVC. However, it does not
with the proposed instructions. The assembly pro- yet support the multiple reference frames which is
gram of 44 forward 2-D integer transform is shown another key feature of H.264/AVC. As described in
in Fig. 13. G0 and G1 represent two general register Section 2, ME/MC should frequently access memo-
Pipeline Stage
First 1 detect Level Decode Table Update
First 1 detect Level Decode Table Update
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
a
Pipeline Stage
First 1 detect Level Decode
Pre Table
Update
First 1 detect Level Decode
Pre Table
Update
Stage 1 Stage 2 Stage 3
b
Figure 17. Comparison of flows for the level of the nonzero coefficient decoding. a Generic level of the nonzero coefficient decoding flow.
b Proposed level of the nonzero coefficient decoding flow.
64 Kim and Sunwoo
barrel shifter, the first one detector, an adder, etc. The

encoder of CAVLC finds the value of the codeword
and the length of the codeword in memory according
to the data. Therefore, the efficient memory address
generator is needed. The decoder of CAVLC is
usually implemented with a lookup table. In the
decoding process, the level of the nonzero coefficient
Figure 18. FPGA emulation. decoding is an iterative method, which can be
implemented without a lookup table.
ry. From a performance point of view and a low A generic decoding process for the level of the
power point of view, it can be a serious problem. nonzero coefficient is as follows. First, the decoder
Thus, the sliding window method is used to alleviate obtains the number of successive zeros in the input bit
this problem [19]. Figure 14 illustrates the proposed stream. Next, the decoder calculates the current symbol
ME operation flow. length and decodes the current symbol. Finally, the
The proposed ME architecture supports the [+8, decoder updates the table information used for next
j7] search window. In the [+8, j7] search window, symbol decoding. The decoder cannot decode the next
16 44 blocks exist in a row. In the first cycle, four symbol until the table information is decided.
SADs are simultaneously calculated as shown in Fig. To increase the decoding speed, we propose the
14(a). Next, the search window shifts right and each pre-table update method. Table updating is decided
operation unit repeats the SAD calculation as shown whether the current symbol value is greater than the
in Fig. 14(b). The SADs of upper four pixels of every threshold value. Fig. 16 shows the threshold value of
block in a row can be obtained after four cycles and each level decoding table. As shown here, all
16 SADs are stored in buffers. The SADs of the threshold values have regular forms. Runs of zeros
second upper are calculated in the same way, and the are two and runs of ones are equal to the level
16 SADs are accumulated with the 16 SADs in decoding table index. Hence, we can update the table
buffers, respectively. Then, after 16 cycles, the 16 before current symbol decoding.
SADs of 44 blocks can be obtained. The generic level of the nonzero coefficient decod-
Figure 15 shows the ME computation flows of ing process is shown in Fig. 17(a). The level decoding
general architectures [20] and the proposed architec- process cannot be performed until finishing the table
ture. In Fig. 15(a), the pixel values in the dotted update process. Therefore, five pipeline stages are
block should be fetched again to calculate the SAD required to decode two symbols. Figure 17(b) shows
of the dotted block after the SADs of two adjacent the level of the nonzero coefficient decoding process
blocks (block 1 and block 2) are obtained. However, using the proposed pre-table update method. Since
if a 44 block is shifted pixel by pixel as shown in three pipeline stages are required to decode two
Fig. 15(b), the data in the dotted block in Fig. 15(a) symbols, we can reduce two computation cycles for
can be reused. Hence, we can achieve the low power level decoding.
consumption.
3.5. Proposed Accelerator for Entropy Coding 4. Implementation and Performance

Comparisons
As described in Section 2, H.264/AVC uses EGC and
CAVLC for entropy coding. Since EGC has a regular H.264/AVC can be implemented using VSIP having
coding structure, the EGC accelerator consists of a hardware accelerators. The proposed VSIP has been
Table 1. Performance comparisons of 44 integer transform.
Parameter TMS320c55 (SW) [18] TMS320c55 (HW) [18] TMS320c64 [16] Proposed VSIP
MIPS 12.8 2.8 1.1 1.1

a b
Figure 19. Chips of proposed hardware accelerator for ME and MC. a ME hardware accelerator. b MC hardware accelerator.
modeled by Verilog HDL and thoroughly verified blocks. Table 1 shows the number of the required
using the iPROVEi FPGA board having the instructions for 30 frames on existing DSPs [16, 18]
Xilinxi Virtex II shown in Fig. 18. and the proposed VSIP. VSIP can be more efficient
Several core blocks for generating an intra than the implementation using instructions of
predictor and an in-loop deblocking filter are TMS320c55 (SW) and using the coprocessor of
coded using the proposed special instructions and TMS320c55 (HW) for integer transform.
the same blocks are also coded using the existing TMS320c64 is a large VLIW architecture having
instructions of TMS320c64. The proposed archi- eight function units while VSIP requires only two 32
tecture can reduce the number of clock cycles for bit adders.
generating an intra predictor about 40% compared The optimized VSIP instructions can reduce
with TMS320c6. Moreover, the total number of computation complexity, redundancy and overhead.
clock cycles to execute the in-loop deblocking Hence, the computation cycles of VSIP to perform
filter can be reduced about 20õ25% than tasks in H.264/AVC are much less than those of
TMS320c6. TMS320C64 supports the DOTPU4 general DSPs. Hence, the proposed VSIP can
instruction that executes packed multiplications of efficiently reduce the number of memory accesses
two registers and adds four results in four cycles. and achieve low power.
Other DSPs require more instructions, since they The hardware accelerators have been implemented
do not support the special instructions. using the MagnaChip HSI 0.25 mm standard cell
The fTRAN and iTRAN instructions can be library by the Synopsysi Astro tool as shown in
executed in one cycle. Hence, 23 clock cycles are Fig. 19. The chip specifications are listed in Table 2.
required to execute 44 integer transform using the The chip is being fabricated by the Information
proposed instructions and about 1,092,960 clock Technology System on Chip (ITSOC) MPW service
cycles for 30 frames ((23 cycles16 blocks)99
macro blocks30 frame) are required for QCIF
images, since a QCIF image has 99 1616 macro
Table 3. Performance comparisons of the hardware ME archi-
tectures.
Table 2. Chip summary of proposed hardware accelerator for Clock Gate

ME and MC. cycles/ Search Supported counts
Parameter frame range block size (K)
ME hardware MC hardware
Parameter accelerator accelerator References 405,603 [j16, Variable block 154
[20] +15] support
Process technology 0.25 mm 1p4m 0.25 mm 1p4m
References 406,077 [j8, Variable block 61
Logic gate count 40,000 10,000 [21] +7] support
Maximum frequency 100 MHz 150 MHz Proposed 431,244 [j8, Variable block 40
On chip memory size – 32 kb architecture +7] support
66 Kim and Sunwoo
in KOREA. The ME chip achieves the gate count VSIP is one of promising solutions for video signal
without memory of 40 K and the operating frequency processing.
of 100 MHz. The MC chip achieves the gate count
without memory of 10 K and the operating frequency Acknowledgements
of 150 MHz. ME requires more memory than MC.
Moreover the size of ME is almost four times larger This work was supported in part by the Ubiquitous
than the size of MC. However, we could not insert Computing and Network (UCN) Project, the Minis-
the memory of ME in a chip because of the size try of Information and Communication (MIC) 21st
limitation of the ITSOC MPW service in Korea. Century Frontier R&D Program in Korea, in part by
Moreover, we separated ME and MC. IT R&D Project funded by Korean Ministry of
The proposed ME accelerator can significantly Information and Communications, in part by the
reduce the gate counts compared with [21] and [22]. second stage of Brain Korea 21 Project in 2006, and
Table 3 shows the comparisons among [21, 22] and in part by IDEC.
our architecture. Kim et al. [21] can support larger
search ranges than the other architectures. However, References
it has much larger gate counts than the other
architectures. The required computation cycles of 1. J. S. Lee, Y. S. Jeon and M. H. Sunwoo, BDesign of new DSP
[21] and [22] are comparable to our architecture. instructions and their hardware architecture for high-speed
However, the total gate counts of [21] and [22] are FFT^, in Proc. IEEE Workshop on Signal Processing Syst.,
Sept. 2001, pp. 80–90.
much larger than our architecture.
2. J. Glossner, J. Moreno, M. Moudgill, J. Derby, E. Hokenek, D.
The proposed hardware accelerator for CAVLC Meltzer, U. Shavadron and M. Ware, BTrends in compilable
takes average 368 clock cycles for a macro block. To DSP architecture,^ in Proc. IEEE Workshop on Signal
achieve the real-time processing requirement for Processing Syst., 2000, pp. 181–199.
H.264/AVC decoding with HD1080i format, the 3. J. H. Lee, J. H. Moon, K. L. Heo, M. H. Sunwoo, S. K. Oh and
proposed design should run over 90 MHz. The I. H. Kim, BImplementation of Application Specific DSP for
OFDM Systems,^ in Proc, IEEE IEEE Int. Symp. Circuit
proposed design can support real-time processing Syst., May 2004.
since the maximum operating frequency of the 4. S. H. Yoon, J. H. Moon and M. H. Sunwoo, BEfficient DSP
proposed design is about 130 MHz. Architecture for High-Quality Audio Algorithms,^ in Proc.
IEEE Int. Symp. Circuits Syst., May 2005.
5. S. D. Kim, J. H. Lee, C. J. Hyun and M. H. Sunwoo, BASIP
approach for implementation of H.264/AVC,^ in Proc. Asia
5. Conclusions South Pacific Design Automation Conf., Jan 2006.
6. J. Chen and K. J. R. Liu, BCost-effective low-power
This paper presents the ASIP for video signal architectures of video coding systems,^ in Proc. IEEE Int.
processing, called VSIP. VSIP has the special Symp. On Circuits and Syst., May 1999, pp. 153–156.
7. Draft ITU-T Recommendation and Final Draft International
instructions and the optimized hardware architec-
Standard of Joint Video Specification (ITU-T Rec. H.264/ISO/
tures for H.264/AVC. Moreover, VSIP has the IEC 14496-10 (E) AVC). July, 2004.
hardware accelerators for ME/MC and entropy 8. J. Ostermann, T. Wedi, et al., BVideo coding with H.264/
coding. As shown in performance comparisons, AVC: tools, performance, and complexity,^ IEEE Circuits
computation cycles to perform target applications and Systems Magazine, vol. 4, 2004, pp. 7–28.
on our VSIP are much less than those of general 9. M. K. Jain, M. Balakrishnam and A. Kumar, BASIP design
methodologies: survey and issues,^ in Fourteenth Internation-
DSPs. Moreover, VSIP can dramatically reduce al Conference on VLSI Design, Jan. 2001, pp. 76–81.
memory access by using the proposed special 10. W. Di, G. Wen, H. Mingzeng and J. Zhenzhou, BAn Exp-
instructions and the hardware accelerators. Hence, Golomb encoder and decoder architecture for JVT/AVS,^ in
VSIP can achieve low power at the algorithm/ Proc. 5th International Conference on ASIC, vol. 2, 21–24
architecture level. Since the hardware accelerators Oct., 2003, pp. 910–913.
11. G. Bjontcgaard and K. Lillcvold, BContext-adaptive VLC
can concurrently operate, the VSIP can efficiently (CAVLC) coding of coefficients,^ Doc. JVT-028, JVT of IS0/
perform in real-time video processing and it can IEC MPEG & ITU-T VCEG 3rd Meeting, Virginia, USA,
support various profiles and standards. The proposed May. 2002.
12. H.-C. Chang, C.-C. Lin and J.-I. Guo, BA Novel Low-Cost
High-Performance VLSI Architecture for MPEG-4 AVC/
H.264 CAVLC Decoding,^ in Proc. IEEE Int. Symp. Circuits
Syst., May 2005.
13. Y.-K. Lai, C.-C. Chou and Y.-C. Chung, BA simple and cost
effective video encoder with memory-reducing CAVLC,^ in
Proc. IEEE Int. Symp. Circuits Syst., May 2005.
14. W. I. L. Choi, B. Jeon and J. Jeong, BFast motion estimation
with modified diamond search for variable motion block
sizes,^ in Proc. International Conference on Image Process-
ing, vol. 3, Sept. 2003, pp. 14–17.
Myung H. Sunwoo received the B.S. degree in Electronic
15. TMS320C6000 CPU and Instruction Set Reference Guide,
Engineering from the Sogang University in 1980, the M.S.
Texas Instruments Inc., Dallas, TX, 2000.
degree in Electrical and Electronics from the Korea Advanced
16. TMS320C64 Image/Video Processing Library, Texas Instru-
Institute of Science and Technology in 1982, and the Ph.D.
ments Inc., Dallas, TX, 2003.
degree in Electrical and Computer Engineering from the
17. Blackfini DSP Instruction Set Reference, Analog Device
University of Texas at Austin in 1990. He worked for
Inc., Norwood, Mass. 2002.
Electronics and Telecommunications Research Institute
18. TMS320C55 Hardware Extensions for Image/Video Appli-
(ETRI) in Daejeon, Korea from 1982 to 1985, and Digital
cations Programmer_s Reference, Texas Instruments Inc.,
Signal Processor Operations, Motorola, Austin, TX from 1990
Dallas, TX, 2002.
to 1992. Since 1992, he has been a Professor with the School
19. T. Wiegand, X. Zhang and B. Girod, BLong-Term Memory
of Electrical and Computer Engineering, Ajou University in
Motion-Compensated Prediction,^ Trans. Circuit Syst. Video
Suwon, Korea. In 2000, he was a Visiting Professor in the
Technol., vol. 9, no. 1, Feb. 1999, pp. 70–84.
Department of Electrical and Computer Engineering, the
20. E. Iain, G. Richardson, Video Codec Design: Developing
University of California, Davis, CA. He has over 300 papers
Image and Video Compression Systems, Wiley, 2002.
and also holds 37 patents. He received more than 20 research
21. M. H. Kim, I. G. Hwang and S. I. Chae, BA Fast VLSI
awards including the Best Student Paper Award from the
Architecture for Full-Search Variable Block Size Motion
IEEE Workshop on Signal Processing Systems (SIPS) 2005,
Estimation in MPEG-4 AVC/H.264,^ in Proc. of Asia and
Athens, Greece, the Ministry of Commerce, Industry and
South Pacific Design Automation Conference (ASP-DAC
Energy, Samsung Electronics, the Institute of Electronics
2005), Shanghai, China, Jan 2005.
Engineers of Korea (IEEK), and professional foundations.
22. S. Y. Yap and J. V. McCanny, BA VLSI Architecture for
His research interests include SOC architectures and design
Variable Block Size Video Motion Estimation,^ Trans. Circuit
for multimedia and communications, application-specific DSP
Syst. Video Technol., vol. 51, no. 7, July 2004.
architectures, and application-specific design. He served on
the Technical Program Chairs of the IEEE Workshop on SIPS
in 2003 and International Conference on SOC Design in 2003
and has served on program committee, organizing committee,
steering committee, and executive committee for major
international conferences and workshops including IEEE
Workshop on SIPS, Cool Chips, Design, Automation and Test
in Europe (DATE), IEEE International ASIC/SOC Confer-
ence, Asian-Pacific Conference on CAS (APC-CAS), Asian-
Solid State Circuits Conference (A-SSCC), International SOC
Design Conference (ISOCC), International Symposium on
VLSI Design, Automation and Test (VLSI-DAT), etc. He
served as an Associate Editor for the IEEE Transactions on
Very Large Scale Integration (VLSI) Systems (2002–2003)
and as a Guest Editor for the Journal of VLSI Signal
Sung Dae Kim received the B.S. degree in Electronics Processing (Kluwer, 2004). He is a Director of the National
Engineering from the Ajou University, Suwon, Korea in Research Laboratory sponsored by the Ministry of Science and
2002. He is currently working toward the Ph.D. degree. His Technology, a Director of the New Growth Engine Semicon-
current research interests are in the areas of digital image/ ductor Center, and an Executive Director of IEEK. Currently,
video processing, DSP, and ASIP, specifically low power and He is a Senior Member of IEEE and a Chair of the IEEE CAS
high performance architectures. Society of the Seoul Chapter.

ASIP Approach For Implementation of H.264/AVC

Uploaded by

Copyright:

Available Formats

ASIP Approach For Implementation of H.264/AVC

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ASIP Approach For Implementation of H.264/AVC

Uploaded by

Copyright:

Available Formats

Journal of Signal Processing Systems 50, 53–67, 2008

ASIP Approach for Implementation of H.264/AVC

SUNG DAE KIM AND MYUNG H. SUNWOO

Received: 3 April 2007; Accepted: 13 June 2007

Keywords: application specific instruction-set processor, hardware software codesign, H.264/AVC,

1. Introduction greatly reduce time-to-market and allow faster

H/W, S/W partitioning

Design special instructions Design hardware

Figure 1. Design flow of ASIP.

Figure 2. Complexity analysis of the H.264/AVC baseline profile.

Figure 3. Computation times of various implementations. a DSP. b ASIC. c VSIP + accelerators.

dst (a0*b0) + (a1*b1) + (a2*b2) + (a3*a4)

PCU Program Memory AU IPA

Program Counter Data Memory 2 ME Hardware

Program Bus (16 Bit)

Figure 5. Proposed VSIP architecture.

Figure 6. Block boundary.

Figure 9. Assembly program of core block for in-loop deblocking filter.

Figure 10. Assembly program of intra prediction.

dst A+B+C+D (B-C)+2(A-D) A+D-(B+C) (A-D)-2(B-C)

The 44 integer transform can be operated using the

First 1 detect Level Decode Table Update

First 1 detect Level Decode Table Update

Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

First 1 detect Level Decode

First 1 detect Level Decode

barrel shifter, the first one detector, an adder, etc. The

3.5. Proposed Accelerator for Entropy Coding 4. Implementation and Performance

Table 1. Performance comparisons of 44 integer transform.

MIPS 12.8 2.8 1.1 1.1

Table 2. Chip summary of proposed hardware accelerator for Clock Gate

You might also like

dst (a0b0) + (a1b1) + (a2b2) + (a3a4)