4.1 Unsigned Binary Multiplication: Digital Computer Arithmetic Datapath Design

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

56 DIGITAL COMPUTER ARITHMETIC DATAPATH DESIGN

below. Although there are various different perspectives on the implementation


of multiplication, its basic entity usually is the adder.
1 Partial Product (PP) Generation - utilizes a collection of gates to generate
the partial product bits (i.e. ai · bi ).
2 Partial Product (PP) Reduction - utilizes adders (counters) to reduce the
partial products to sum and carry vectors.
3 Final Carry-Propagate Addition (CPA) - adds the sum and carry vectors to
produce the product.

4.1 Unsigned Binary Multiplication


The multiplication of an n-bit by m-bit unsigned binary integers a and b
creates the product p. This multiplication results in m partial products, each
of which is n bits. A partial product involves the formation of an individual
computation of each bit or ai · bj . The n partial products are added together to
produce a n + m-bit product as shown below. This operation on each partial
product forms a nice parallelogram typically called a partial product matrix.
For example, in Figure 4.1 a 4-bit by 4-bit multiplication matrix is shown.
In lieu of each value in the matrix, a dot is sometimes shown for each partial
product, multiplicand, multiplier, and product. This type of diagram is
typically called a dot diagram and allows arithmetic designers a better idea of
which partial products to add to form the product.

P = A·B

n−1 
m−1
= ( ai · 2i ) · ( bj · 2j )
i=0 j=0
 m−1
n−1 
= ai · bj · 2i+j
i=0 j=0

The overall goal of most high-speed multipliers is to reduce the number of


partial products. Consequently, this leads to a reduced amount of hardware
necessary to compute the product. Therefore, many designs that are visited in
this chapter involve trying to minimize the complexity in one of the three steps
listed above.

4.2 Carry-Save Concept


In multiplication, adders are utilized to reduce the execution time. However,
from the topics in the previous chapter, the major source of delay in adders is
consumed from the carries [Win65]. Therefore, many designers have concen-
trated on reducing the total time that is involved in summing carries. Since
Multiplication 57

a3 a2 a1 a0
x b3 b2 b1 b0

a3 b0 a2 b0 a1 b0 a0 b0
a3 b1 a2 b1 a1 b1 a0 b1
a3 b2 a2 b2 a1 b2 a0 b2
a3 b3 a2 b3 a2 b3 a0 b3
p7 p6 p5 p4 p3 p2 p1 p0

Figure 4.1. 4-bit by 4-bit Multiplication Matrix.

multiplication is concerned with not just two operands, but many of them it is
imperative to organize the hardware to mitigate the carry path or chain. There-
fore, many implementations consider adders according to two principles:
Carry-Save Addition (CSA) - idea of utilizing addition without carries con-
nected in series but just to count.
Carry-Propagate Addition (CPA) - idea of utilizing addition with the carries
connected in series to produce a result in either conventional or redundant
notation.
Each adder is the same as the full adder discussed in Chapter 3, however, the
view in which each connection is made from adder to adder is where the main
difference lies. Because each adder is really trying to compute both carry and
save information, sometimes VLSI designers refer to it as a carry-save adder
or CSA. As mentioned previously, because each adder attempts to count the
number of inputs that are 1, it is sometimes also called a counter. A (c, d) is an
adder where c refers to the column height and d is the number of bits to display
at its output. For example, a (3, 2) counter counts the 3 inputs all with the same
weight and displays two outputs. A (3, 2) counter is shown in Figure 4.2.
Therefore, an n-bit CSA can take three n-bit operands and generate an n-bit
partial sum and n-bit carry. Large operand sizes would require more CSAs to
produce a result. However, a CPA would be required to produce the correct
result. For example, in Table 4.1 an example is shown that adds together A +
B + D + E with the values 10 + 6 + 11 + 12. The implementation utilizing
the carry-save concept for this example is shown in Figure 4.4. As seen in
58 DIGITAL COMPUTER ARITHMETIC DATAPATH DESIGN

A B D

CSA

C S

Figure 4.2. A Carry-Save Adder or (3, 2) counter.

Table 4.1, the partial sum is 31 and the carry is 8 that produces the correct result
of 39. This process of performing addition on a given array that produces an
output array with a smaller number of bits is called reduction. The Verilog
code for this implementation is shown in Figure 4.3.

Column Value Column Value


Row Radix 2 Radix 10
A 1 0 1 0 10
B 0 1 1 0 6
D 1 0 1 1 11
+ E 1 1 0 0 12
S 1 1 1 1 1 31
C 0 1 0 0 0 8

Table 4.1. Carry-Save Concept Example.

The CSA utilizes many topologies of adder so that the carry-out from one
adder is not connected to the carry-in of the next adder. Eventually, a CPA
could be utilized to form the true result. This organization of utilizing m-word
by n-bit multi-operand adders together to add m-operands or words each of
which is n-bits long is called a multi-operand adder (MOA). A m-word by n-
bit multi-operand adder can be implemented using (m − 2) n-bit CSA’s and 1
CPA. Unfortunately, because the number of bits added together increases the
result, the partial sum and carry must grow as well. Therefore, the result will
contain n + log2 (m) bits. In our example above, this means a 4 + log2 (4) =
6-bit result is produced.
Higher order counters can be created by putting together various sized coun-
ters. A higher order counter (p, q) takes p input bits and produces q output bits.
Multiplication 59

module moa4x4 (S, C, A, B, C, D, Cin);

input [3:0] A, B, D, E;
input Cin;
output [4:0] S, C;

fa csa1 (c_0_0, s_0_0, A[0], B[0], D[0]);


fa csa2 (c_0_1, s_0_1, A[1], B[1], D[1]);
fa csa3 (c_0_2, s_0_2, A[2], B[2], D[2]);
fa csa4 (S[4], s_0_3, A[3], B[3], D[3]);

fa csa5 (C[0], S[0], E[0], s_0_0, Cin);


fa csa6 (C[1], S[1], E[0], s_0_1, c_0_0);
fa csa7 (C[2], S[2], E[0], s_0_2, c_0_1);
fa csa8 (C[3], S[3], E[0], s_0_3, c_0,2);

endmodule // moa4x4

Figure 4.3. 4-operand 4-bit Multi-Operand Adder Verilog Code.

A3 B3 D3 A2 B2 D2 A1 B1 D1 A0 B0 D0 C in

CSA CSA CSA CSA

E3 E2 E1 E0

CSA CSA CSA CSA

S4 C3 S3 C2 S2 C1 S1 C0 S0

Figure 4.4. A 4 operand 4-bit Multi-Operand Adder.

Since q bits can represent a number between 0 and 2q − 1, the following re-
striction is required p ≤ 2q − 1. In general a (2q − 1, q) higher order counter
requires (2q − 1 − q) (3, 2) counters. The increase in complexity that occurs
in higher-order counters and multi-operand adders can make an implementa-
tion complex as seen by the Verilog code in Figure 4.3. Consequently, some
designers use programs that generate RTL code automatically. Another use-
ful technique is to utilize careful naming methodologies for each temporary
60 DIGITAL COMPUTER ARITHMETIC DATAPATH DESIGN

variable and declaration. For example, in Figure 4.3, the temporary variables
utilize s 0 2 to represent the sum from the first carry-save adder in the second
column.

4.3 Carry-Save Array Multipliers (CSAM)


The simplest of all multipliers is the carry-save array multiplier (CSAM).
The basic idea behind the CSAM is that it is basically doing paper and pen-
cil style multiplication. In other words, each partial product is being added. A
4-bit by 4-bit unsigned CSAM is shown in Figure 4.6. Each column of the mul-
tiplication matrix corresponds to a diagonal in the CSAM. The reason CSAM’s
are usually done in a square is because it allows metal tracks or interconnect to
have less congestion. This has a tendency to have less capacitance as well as
making it easier for engineers to organize the design.
The CSAM performs PP generation utilizing AND gates and uses an array
of CSA’s to perform reduction. The AND gates form the partial-products and
the CSA’s sum these partial products together or reduce them. Since most of
the reduction computes the lower half of the product, the final CPA only needs
to add the upper half of the product. Array multipliers are typically easy to
build both using Verilog code as well in custom layout, therefore, there are
many implementations that employ both.
For the CSAM implementation, each adder is modified so that it can perform
partial product generation and an addition. A modified half adder (MHA) con-
sists of an AND gate that creates a partial product bit and a half adder (HA).
The MHA adds this partial product bit from the AND gate with a partial prod-
uct bit from the previous row. A modified full adder (MFA) consists of an
AND gate that creates a partial product bit, and a full adder (FA) that adds this
partial product bit with sum and carry bits from the previous row. Figure 4.5
illustrates the block diagram of the MFA.
An n-bit by m-bit CSAM has n · m AND gates, m HAs, and ((n − 1) · (m −
1))−1 = n·m−n−m FAs. The final row of (n−1) adders is a RCA CPA. The
worst case delay shown by the dotted line in Figure 4.6 is equal to one AND
gate, two HAs, and (m + n − 4) FAs. In addition, due to the delay encountered
by each adder in the array, the worst-case delay can sometimes occur down the
an column instead of across the diagonal. To decrease the worst case delay, the
(n − 1)-bit RCA on the bottom of the array can be replaced by a faster adder,
but this increases the gate count and reduces the regularity of the design. Array
multipliers typically have a complexity of O(n2 ) for area and O(n) for delay.
The Verilog code for a 4-bit by 4-bit CSAM is shown in Figure 4.8. The
hierarchy for the partial product generation is performed by the PP module
which is shown in Figure 4.7. The MHA and MFA modules are not utilized to
illustrate the hardware inside the array multiplier, however, using this nomen-
clature would establish a better coding structure.

You might also like