SOC Implementation Wave-Pipelined: Venkataramani
SOC Implementation Wave-Pipelined: Venkataramani
SOC Implementation Wave-Pipelined: Venkataramani
G. Seetharaman#, B. Venkataramani* # Research Scholar, Department of ECE, National Inst. of Technology, Tiruchirappalli, India. gsraman@nitt. edu * Professor and Head, Department of ECE, National Inst. of Technology, Tiruchirappalli, India. bvenki@nitt. edu
Abstract
In the literature, wave-pipelining is proposed as one of the techniques for increasing the operating frequency of the digital circuits. Higher operating frequencies can be achieved in Wave-Pipelined (WP) circuits, by adjusting the clock periods and clock skews so as to latch the outputs of combinational logic circuits at the stable periods. Major contributions of this paper are the proposal for the use of soft-core processor for the automation of the above tasks, and the superiority of the WP circuits with regard to power dissipation. The proposed scheme is evaluated by using two circuits. filters using Distributed Arithmetic Algorithm (DAA) and a sine wave generator using COordinate Rotation DIgital Computer (CORDIC) algorithm. Both the circuits are studied by adopting three different schemes. wave-pipelining, pipelining and non-pipelining. The SystemOn-Chip (SOC) approach is adoptedfor implementation on Altera Field Programmable Gate Arrays (FPGAs) based SOC kits with Nios II soft-core processor. From the implementation results, it is verified that the WP circuits are faster compared to non-pipelined circuits. The pipelined circuits are found to be faster than the WP circuits and this is achieved at the cost of increase in area and power. For the power dissipation, when both pipelined and WP circuits are operated at the same frequency, the former dissipates more power for circuits with higher word sizes and for medium taps filters. From the implementation results, it is verified that the superiority of the power dissipation of the WP circuits depends not only on the area but also on the logic depth of the circuit. This observation is made for the first time for the WP circuits. Index Terms- CORDIC, DAA, SOC, wave-pipelining, pipelining, FPGA.
1. Introduction
Programmable logic devices such as FPGAs offer an alternative solution for the computationally intensive functions performed traditionally by Programmable Digital Signal Processors (P-DSPs). The ability to design, fabricate and test Application Specific Integrated Circuits (ASICs) as well as FPGAs with gate count of the order of a few tens of
million, has led to the development of complex embedded SOC. Hardware components in a SOC may include one or more processors, memories and dedicated components for accelerating critical tasks and interfaces to various peripherals. One of the approaches for SOC design is the platform based approach [1], [2]. For example, the platform FPGAs such as Xilinx Virtex II Pro and Altera Excalibur include custom designed fixed programmable processor cores together with millions of gates of reconfigurable logic devices. In addition to this, the development of Intellectual Property (IP) cores for the FPGAs for a variety of standard functions including processors, enables a multi million gate FPGA to be configured to contain all the components of a platform based FPGA. Development tools such as the Altera System-On-Programmable Chip (SOPC) builder enable the integration of IP cores and the user designed custom blocks with the Nios II soft-core processor [3]. Softcore processors are far more flexible than the hard-core processors and they can be enhanced with custom hardware to optimize them for specific application. The increased gate count in a complex SOC results in increased power dissipation, clock routing complexity and clock skews between different parts of a synchronous system. These limitations may be partially overcome by adoption of circuit design techniques such as wavepipelining. Wave-pipelining enables a combinational logic circuit to be operated at a higher frequency without the use of registers and may result in lower power dissipation and clock routing complexity compared to a pipelined circuit. However, the maximization of the operating speed of the wave-pipelined circuit requires the following three tasks: adjustment of the clock period, clock skew and equalization of path delays. The automation of these three tasks are proposed for the first time in this paper. Effectiveness of the automation scheme is studied using two circuits: i) filter using DAA and ii) sine wave generator using CORDIC algorithm. The organization of the rest of the paper is as follows: In section 2, the previous work related to wave-pipelining and the challenges involved in the design of wave-pipelined circuits are described. In section 3, automation schemes for wave-pipelined circuits are presented. In section 4, an
FPT ,200X7
overview of the DAA and parallel DAA and pipelined DAA schemes are presented. In section 5, an overview of the CORDIC algorithm is presented. In section 6, SOC approaches for the implementation of wave-pipelined circuits are discussed and the implementation results are presented. Section 7, summarizes the conclusions.
Hence, adjustment of the clock period, clock skew (6) and equalization of path delays, are the three tasks required for maximizing the operating speed of the wave-pipelined circuit. All the three tasks require the delays to be measured and altered if required. Layout editors, such as FPGA editor from Xilinx or Floor planner from Altera may be used for this purpose.
n
Dmax
4D
0TF
UP r
"
Fig. 2. Temporal/spatial diagram of data flow through the combinational logic circuit.
/1 7,VI-," Telk
Time
Clock
Clock Skew
These tasks are carried out manually in [8], [9]. The wave-pipelined circuit designed using the layout editor may be tested using simulation. However, the simulation is inadequate for testing due to the difference between the actual delays and the delays calculated by the layout editor. This is because, the layout editor considers only the worst case delays and the actual delays may be significantly different due to fabrication variations. This difference becomes important as the logic depth of the circuit increases. Hence, the design is downloaded to the actual FPGA and its operation is checked using a Personal Computer (PC) based test system in [9]. If correct results are not obtained, delays are altered and the design is downloaded for testing again. A number of iterations of place and route, simulation, downloading and testing in the actual device may be required till the correct results are obtained. The design of wave-pipelined circuit in this fashion requires human intervention and is time consuming. Automation of the above three tasks are considered in the next section.
period/clock skew depends on the interconnect delays and the delay in the logic blocks. The select inputs s(O)-s(1 8) are connected to one data input and a(O)-a( 18) are connected to the other port of the Nios II processor. The select input of 8:1 multiplexer (pl, p2, p3) is varied by the processor to achieve different clock frequencies. The clock and skew generator may be programmed using either off-chip processor or on-chip processor. The off-chip processor is used when the FPGA is used as a coprocessor or hardware accelerator for a main processor or microcontroller. The offchip communication between the FPGA and a processor is bound to be slower than on-chip communication. In Fig. 3, a majority logic circuit with 3 inputs is used to minimize the effect of glitches which may arise due to transients in the data lines. The clocks required for the wave-pipelined circuit may also be derived using the internal system clock generator of Altera. However, the multiplication factor has to be specified at the synthesis time and hence the clock frequency cannot be dynamically altered as in the scheme given in Fig. 3. The circuit using the programmable clock and skew generator is a suboptimal wave-pipelined circuit but can operate at a higher frequency than that reported by the commercially available synthesis tools which use Dmax for fixing the operating frequency. In order to minimize the time required for adjustment of the parameters of the wavepipelined circuit (clock frequency and skew), the Built In Self Test (BIST) approach for design for testability [11] may be used. In the BIST approach, a Finite State Machine (FSM) assumed to be available on off-chip and it is used for adjustment of the parameters of the wave-pipelined circuit [12]. In SOC approach, a processor is assumed to be
available on-chip and it is used for adjustment of the parameters of the wave-pipelined circuit.
DSP block. Alternatively, it may consist of soft-core processors such as Nios II or Micro blaze and a custom DSP block implemented in FPGA. In this paper, Altera FPGA based SOC consisting of Nios II soft-core processor is used for the implementation.
The computation of the output of an N tap Linear Time Invariant (LTI) filter and computation oftransform of a Nxl vector can be generalized as the problem of computation of the sum of products given by
ii
0
y(n)
kO=
EL a(n,k)x(k)
(1)
In the case of LTI filters and transform computation, a(n,k) is time invariant and only x(k) varies with time. In view of this, y(n) can be computed by using the look up tables for multiplication. This can be achieved as follows: The input samples x(k) may be assumed to be represented in 2's complement representation using W bits and can be written as
(2)
Substituting equation (2) in (1) and interchanging the order of summation w.r.t. m and k, we get Y,LS(m)2-(W-) y(n) = -S(W -1) + m=O (3)
(4)
the location of the logic elements and the interconnects used for the implementation of these blocks should be fixed so that when these blocks are integrated with the DAA/CORDIC or the processor, the interconnect delays are not altered. This is achieved by using the Logic Lock feature in Altera. The operating frequency of the wave-pipelined circuit is expected to lie between that of non-pipelined circuit and pipelined circuit. Hence, the minimum and maximum frequency of the clock generator should correspond to the maximum operating frequencies of the non-pipelined and pipelined circuits respectively. The approximate values of these two frequencies are found for the circuit to which the clock is to be applied using the synthesis report. After determining the range of the frequencies to be generated by the clock circuit, the number of delay blocks are adjusted.
It may be noted that x(m,k), for m= 0,1, ... W-1, takes binary values 1 or 0. Hence, S(m) can be computed using ROM with address as the bits x(m,0), x(m,1), ... x(m,N-1). Furthermore, the contents of S(m) is the same for all values of m.
y(n) = { [-S(7) + S(6)2-1 + [S(5) + S(4)2-1 ]2-2} + (5) { [S(3) + S(2)2-1 ]+ [S(1) + S(0)2-1 ]2-2
-4
Equation (5) requires multiplication ofthe numbers by 2-i. If 2's complement multiplication with sign extension is used, this requires shifting the number towards right i times and replicating the MSB i times. For example, multiplication of a number 10100101 represented in 2's complement form by 2-4 results in the number 1 1 11 1010 0101. The full parallel DAA scheme with 2's complement multiplication with sign extension is shown in Fig. 5. The logic depth or the no. of stages of logic elements required for DA filter depends on the no. of taps. The no. of stages required for DA filter with 8, 16 and 32 taps are 4, 5 and 6 respectively.
wave-pipelining. Implementation of self tuned wavepipelined CORDIC unit is considered next. The CORDIC algorithm provides an iterative method of performing vector rotations by arbitrary angles using shifts and adds. In the rotation mode, CORDIC may be used for converting a vector in polar form to rectangular form. In the vector mode, it converts a vector in rectangular form to polar form [16]. The functionality of the circuit may be verified by taking the cosine value as output.
(7)
which rotates a vector (x,,, y, in a Cartesian plane by an Y) angle 0 to another vector with the coordinates (xfin ,Yfin) . The rotation may be achieved by performing a series of successively smaller elementary rotations 00, 01, 02,... ON such that 0 = Y0 0 Rotation of the vector by an angle 0. can be rewritten as
(8)
(9)
(10)
Xi+l
yI Yi+l =y +x.tan6
i
iX
(11)
The computational complexity of (10), (11) can be reduced by rewriting these equations as (12) Xi+1 = x -y tan 0 (13) y11 = yi + X tan 0
(x,y)
Fig. 5. Distributed Arithmetic using ROM decomposition.
11
YN5 afElos0j
(14)
and performing the division by cos 0. together for all the N+ 1 iterations by dividing the value of (XN, YN ) by
i =1, 2.., N is
chosen such that tan 8. is 2-i. This reduces the multiplication by the tan 0. to simple shift operation. As the iteration increases, 0. becomes smaller and smaller. We may terminate the iteration when the difference between 0 & Zo 0i becomes very small for some value of N. The remaining angle by which the vector needs to be rotated after completion of i iterations is indicated by the parameter Zi+1 and is defined by equation (14).
Zi+l
z
zi- 0
(I 5a) (15b)
is considered to be positive when the rotation required is anticlockwise and is negative otherwise. To approximate an arbitrary angle using 0. of the form tan-' (2-4 0. may have to be chosen to be negative for some values of i. Since, tan 0. is +2-i when 0. is positive and 2-1 otherwise, the iterative equations may be rewritten as sgn (z1) (16)
0.
The shift and add operation required for each of the iteration is carried out using a single shift and add block in the serial CORDIC scheme (also referred to as the folded CORDIC scheme). Separate hardware blocks are used for each iteration in the case of Parallel or Unrolled CORDIC scheme. The block diagram of the unrolled CORDIC unit with 5 stages (corresponding to 5 iterations) is shown in Fig. 6 [16]. The entire CORDIC unit is reduced to an array of interconnected adder-subtractors [16]. The functionality of the circuit is verified by taking the cosine value as output
x1+=
x1
Yi 2-1
dxi 2-
(17)
(18)
(19)
yi+, = yi
zi+ =zi
-dtan-' ( 2-1)
HN
=
1
cos
0i
may
be simplified
as
6,
cos
for
very
small values of
0i may be computed for N=6 and may be used for any other value of N > 6. For N=6, K = 16 cos 0i = 0.6073.
In all the three filters, wave-pipelined circuits dissipate less power compared to pipelined circuits if the power dissipation due to the overhead circuits is not taken into account. However, the reduction in the power dissipation for the wave-pipelined circuit decreases as the logic depth increases. (As noted in Section IV, the logic depth required for DA filter with 8, 16 and 32 taps are 4, 5 and 6 respectively). If the power dissipation due to the overhead circuits is also taken into account, power dissipation of pipelined filter is higher by 15.1 00 and 6.1% for 16 tap and 32 tap filter. For the 8 tap filter, power dissipation of pipelined filters is lower by 5.90. The power dissipation due to overheads decreases with the number of taps as the filter with higher number of taps operates at a lower frequency. At lower logic depths, the overheads make the wave-pipelining to be inefficient. At higher logic depths, overheads along with the increased capacitance make the wave-pipelined DA filter to be less efficient with regard to power dissipation.
-P -P ~ t
X-I
-P
p
I-t a-l"4
-~
X
Type of Circu
- rnui
-~~~~~15,5"-
uit
Non-pipelined cirruit
Fipelirtd circuit
circuit (aditonal overhead) Fig. 7. Implementation results of 8, 16 and 32 tap filters using DAA approach. In order to assess the superiority of wave-pipelining with regard to power dissipation, both wave-pipelined and pipelined circuits are operated at the same frequency (corresponding to the maximum operating frequency of the wave-pipelined circuit) and the power dissipated for the 8,16 and 32 tap filters are given in Fig. 7.
Wave-pipelid circuit
Wve-pipelined
number of logic elements, number of registers, maximum operating frequency and power dissipated are computed and the results are given in Fig. 8. Overheads required for the wave-pipelined circuits are also shown in Fig. 8. From this figure, it may be noted that the wave-pipelined CORDIC unit is faster by a factor of 1.19-1.26 compared to the nonpipelined CORDIC unit. The pipelined CORDIC unit is faster by a factor of 1.83-2 compared to the wave-pipelined CORDIC unit. This is achieved at the cost of increase in the number of registers by a factor of 4.43-8.43 and increase in number of LEs by a factor of 1.14-1.42 compared to the wavepipelined CORDIC unit. In order to assess the superiority of wave-pipelining with regard to power dissipation, both wave-pipelined and pipelined circuits are operated at the same frequency and the power dissipated for different word sizes are also given in Fig. 8. From this figure, it may be noted that if the overheads are not considered, then pipelined CORDIC unit dissipates 5.5-7.300 more power than wave-pipelined CORDIC unit. If the overheads required for wave-pipelined CORDIC unit are also considered, then the pipelined CORDIC unit dissipates more (2.54%) power than wave-pipelined CORDIC unit only for higher word size (32 bit). This can be explained as follows: The area for overhead is independent of word size. However, the highest operating frequency decreases with word size. Hence, the power dissipated by the overhead circuit decreases with word size. On the other hand, the power dissipated by the CORDIC unit increases with the word size. If the word size is small, then power dissipated by the overheads becomes significant and this makes the wave-pipelined CORDIC unit to be inefficient. If the word size is increased, then the wave-
pipelined CORDIC unit becomes more and more efficient compared to pipelined CORDIC unit. It may be noted that SOC approach is also applicable for Xilinx FPGAs.
hardware and software IP components in SOC," Elsevier Integration, The VLSI Journal, pp. 1-31, Nov. 2003.
[2] [3] G. Martin and H. Chang, "System-on-Chip design," Proc. of Intl. conf on ASIC, pp. 12 - 17, 2001.
[4]
K. K. Parhi, "VLSI signal processing systems," John Wiley & Sons, 1999.
J. Nyathi and J. G. Delgado-Frias, "A hybrid wave pipelined network router," IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 49, no. 12, pp. 1764 -1772, Dec. 2002. W. P. Burleson, M. Ciesielski, F. Klass, and Liu, "Wavepipelining: a tutorial and research survey," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 6, no. 3, pp. 464 -474, Sep.1998.
[5]
[6]
16
328
16 32
16
32
16 32
[7]
C. Thomas Gray, W. Liu and R. Cavin, "Wave Pipelining: Theory and Inplementation," Kluwer Academic Publishers, 1993.
E. I. Boemo, S. Lopez-Buedo and J. M. Meneses, "Wave pipelines via LUTs," IEEE International Symposium on Circuits and Systems ISCAS '96, vol. 4, pp. 185 -188, 1996.
Non-pipelined circuit
L
Pieid ciruit
Wa-pipelined
[8]
Wa-pipelined
circuit
circuit (additional overhead) Fig. 8. Implementation results of CORDIC unit with word
[9]
7. Conclusion
The automation scheme proposed in this paper for the FPGA implementation of the wave-pipelined circuit are tested using DAA and a CORDIC based sine wave generator. It is observed that wave-pipelined circuits operate faster compared to non-pipelined circuits. The pipelined circuits are in turn faster than the wave-pipelined circuits and this is achieved with the increase in the number of registers and LEs or slices. When both pipelined and wavepipelined circuits are operated at the same frequency, the superiority of one over the other with regard to power dissipation depends on the logic depth of the circuit and the input word size. From the implementation results, it is observed that, only for higher word size and medium tap filters, the SOC based wave-pipelined circuit is found to be more efficient in both area and power dissipation than pipelined circuit.
G. Lakshminarayanan and B. Venkataramani, "Optimization techniques for FPGA based wave-pipelined DSP blocks," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 13, no. 7, pp 783-793, July 2005.
[11] M. J. S. Smith, "Application Specific Integrated Circuits," Pearson Education Asia Pvt. Ltd, Singapore, 2003.
B.Venkataramani and G. [12] G. Seetharaman, Lakshminarayanan, "Design and FPGA implementation of self-tuned wave-pipelined filters," IETE journal ofresearch, vol 52, no. 4, pp. 305-313, July-August 2006.
[13] J. E. Volder, "The CORDIC trigonometric computing technique," IRE Trans. on Electronic Computers, vol. EC-8, no. 3, pp. 330-4, Sept. 1959.
[14] J. S. Walther, "A Unified algorithm for elementary functions," Spring Joint Computer Conf, pp. 379-385, 1971. [15] R.Andraka "A Survey Of CORDIC Algorithm For FPGAs," Proc. of ACMISIGDA sixth international symposium of FPGAs (FPGA'98), Monterey, CA, pp.191-200, Feb 22-24, 1998. [16] W.Tuttlebee, "Software defined radio: Baseband technology for 3G," Wiley, 2004.
References
[1]
Flavio R. Wagner, Wander 0. Cesario, Luigi Carro and Ahmed A. Jerraya, "Strategies for the integration of