Dual-Field Multiplier Architecture For Cryptographic Applications
Dual-Field Multiplier Architecture For Cryptographic Applications
Dual-Field Multiplier Architecture For Cryptographic Applications
2
j
(b+p)
Latch
j Next
b
Stage
j
p
Shifter
j
0 cc
MUX-0 MUX-1
m
a 00
i
m
b Local 01
0 m
cs Control 10
0 m j
Logic 11 cs
cc
0
p
0 z x y
(3,2) Adder Array
Field
Select
j j
cc cs
next stage next stage
Fig. 1. Processing unit of dual-radix architecture with radix-2 for GF (p) and radix-4 for GF (2n )
csw-1(j)
MUX2w-1
MUX1w-1
cs1(j)
cs0(j)
MUX20
MUX21
MUX11
MUX10
be in the calculations. cc0 and cs0 in Figure 1 are the
least significant digits of carry and sum part of the
partial result c.
The dual-radix architecture consists of one or more z
Dual Field
x y z x
Dual Field
y z
Dual Field
x y
ter e + 1 clock cycles, where e is the number of words Fig. 2. Dual Field Adder Array for radix-(2,4) Unified Multi-
in the modulus (i.e. e = n/w), a PU finishes its plier
portion of work and becomes free for further compu-
tation. One can refer to [7] for more information about
the pipeline organization.
A redundant representation (Carry-Save) is used for strated that this additional functionality is obtained
the partial result in the architecture. Thus, for the almost without any cost.
partial result we can write c = cc + cs, where cc and C. Selection Circuitry
cs stand for the carry and sum part of the partial re- As stated previously, the selection logic for radix-
sult. Redundant format necessitates an extra addition (2,4) multiplier, which is shown in Figure 3, deter-
operation to transform the final result into nonredun- mines which of the inputs of MUX-0 and MUX-1 in
dant format at the end of the calculations. The trans- Figure 1 are to be added in (3, 2) adder array, which
formation operation is simply performed by a carry in turn calculates c := c + ai · b + q · p.
propagate adder (e.g. carry look-ahead adder) which In GF (p)-mode the multiplier uses radix-2, hence
is also capable of doing modulo-2 addition operation in m00 and m01 must be calculated while m10 and m11
GF (2n )-mode. The existence of an adder is also useful are forced to be 0 since input 0 of MUX-1 is always se-
for performing the precomputation of b + p, which is lected in this mode. We can use the following formulae
used during multiplication. to express the control inputs of MUX-0.
B. (3, 2) Adder Array
An n-bit (3, 2) adder array shown in Figure 1 con- m00 = ai,0
sists of two parts: single-bit dual-field adders (DFA)
and shift-and-alignment layer as demonstrated in Fig- m01 = q0 = (cs0,0 ⊕ cc0,0 ⊕ ai,0 · b0,0 )
ure 2. When used in GF (p)-mode, the DFA simply
becomes a Carry-Save adder. A DFA cell is basically where ⊕ stands for modulo-2 addition, ai,j denotes
a full-adder capable of doing addition with or with- jth bit of the digit ai and qj is the jth bit of q, and
out carry. It has an input called F SEL that enables csi,j and cci,j are the sum and carry bits of the partial
this functionality. Our implementation results demon- result, respectively.
3
m00 area time −5 time x area
ai,0 m00 x 10
4000 7 2.2
A1 A1 A1
cc0,0 A2 A2 A2
m01 2
cs0,0 m01 3500
6.5
1.8
ai,0
b0,0 3000 1.6
ai,1 m10 6
m10
# of gates (NAND)
1.4
2500
time x area
p0,1
FSEL 5.5 1.2
2000
1
cs0,1 5
0.8
FSEL 1500
m11
FSEL 0.6
4.5
ai,0 1000
b0,1 0.4
ai,1
b0,0 500 4 0.2
0 20 40 0 20 40 0 20 40
word length (w) word length (w) word length (w)
Fig. 3. Selection logic for radix-(2,4) multiplier Fig. 4. Implementation results: Critical path delay and area
On the other hand, the multiplier computes with layer is another reason for larger area usage in the
radix-4 in GF (2n )-mode. Thus, the select inputs of dual-radix design. Note also that, the relative increase
MUX-1 must also be calculated. For this, we use the in area becomes less significant as the word size also
formulae increases. This can also be explained by the fact that
m10 = ai,1 · F SEL the area of selection logic is independent of word size.
When w = 32, the area consumed by the selection
m11 = q1 = [(cs0,1 ⊕ ai,0 · b0,1 ⊕ ai,1 · b0,0 ) logic becomes less significant. For example, increase
in area in the dual-field multiplier (A2), 45% when
⊕(cs0,0 ⊕ ai,0 · b0,0 ) · p0,1 ] · F SEL w = 32. The use of the precomputation technique in
Note that the first input of MUX-1, cc is always zero in the architecture A2 improves the critical path delay
this mode since redundant form is also used for partial by 18% to 23%.
result and the carry part of it is forced to be zero. The performance of the two multipliers in terms of
clock cycle count to perform a multiplication is deter-
III. Implementation results mined, to a large extent, by the number of PUs (t)
We implemented processing units of two different and the word size (w), which is subject to the limi-
multiplier architectures: (A1) the original unified tations on the silicon area available. Therefore, the
multiplier in [7], and (A2) radix-(2,4) multiplier. We relative increase in the area of a PU may be mislead-
used VHDL to implement two architectures and syn- ing in evaluating the overall performance of the new
thesized the resulting code using Mentor Graphics architectures. Two architectures utilize many PUs or-
tools for an ASIC technology of 0.5µm AMI CMOS ganized in a pipeline. To provide more insight in the
(ADK library [14]). overall effect of the new architecture on the area and
Figure 4 demonstrates the area and time delay of time, we investigated the time to compute multipli-
two different PU designs, using different word sizes. cation for a precision range of cryptographic interest
Area consumption is always given in terms of 2-input given a limited area. Figure 5 demonstrates the results
NAND gates. Due to the highly modular nature of for multiplier configuration in GF (p)-mode with ap-
the design, the critical path of a PU determines the proximately 30, 000 gates. We basically designed the
maximum clock frequency that can be applied to the multipliers for each architecture by putting as many
whole multiplier. PUs as possible.
As can easily be observed from Figure 4, there is In this configuration, the new architecture, A2, of-
an increase in area of the new architecture. There fer a significant speedup in time performance over the
are two basic reasons for this increase: (1) having original architecture A1 for the range of [160, ∼ 500].
an extra interstage register for passing the precom- Beyond the precision of 500 bits, higher area require-
puted value, b + p, to the next stage, (2) selection ments of new architectures will have a negative im-
logic. The selection logic becomes more complicated pact on the performance. For the same area the
due to what may be appropriately called as a look- new architecture, A2 is by 13% to 35%. Note that
ahead technique which processes the least-significant the maximum speedup in the new architectures, ex-
bits of the operands. The fact that two least signif- ceeds the maximum speedup provided by a single PU.
icant bits of some operands are needed in the look- This is due to the fact that having more PUs not al-
ahead technique partially explains the further increase ways improves the performance, hence may result in
in the area. More complicated shift-and-alignment a slight degradation for some bit lengths. The dual-
4
−6 w=8 −6 w=16 −6 w=32
x 10 x 10 x 10
9 9 10
A1
A2
8 8 9
8
7 7
7
6 6
Time (seconds)
6
5 5
5
4 4
4
3 3
3
2 2 2
1 1 1
0 500 1000 0 500 1000 0 500 1000
precision precision precision
Fig. 5. Multiplication Timings (in µs) for an area of 30.000 gates with w = 8, 16, and 32 in GF (p)-mode
radix architecture offers a significant speedup over A1 [4] N. Koblitz. Elliptic curve cryptosystems. Mathematics of
in GF (2n )-mode. It outperforms A1 by 56% to 67% Computation, 48(177):203–209, January 1987.
in this mode. [5] P. L. Montgomery. Modular multiplication without trial
division. Mathematics of Computation, 44(170):519–521,
April 1985.
IV. Summary and Conclusions [6] Colin D. Walter. An improved linear systolic array for fast
Using the design methodology proposed in [7], we modular exponentiation. IEE Proceedings - Computers
and Digital Techniques, 147(5):323–328, Sept. 2000.
presented a new unified multiplier architecture called [7] E. Savaş, A. F. Tenca, and Ç. K. Koç. A scalable and
dual-radix architecture for binary extension and prime unified multiplier architecture for finite fields GF(p) and
fields. The architecture utilizes a precomputation GF(2m ). In Ç. K. Koç and C. Paar, editors, Crypto-
technique and improves critical path delay signifi- graphic Hardware and Embedded Systems - CHES 2000,
cantly. The cost of implementing the precomputation Lecture Notes in Computer Science No. 1965, pages 281–
technique in hardware in terms of area is studied and it 296. Springer, Berlin, Germany, 2000.
has been concluded that the overall impact is insignif- [8] Ç. K. Koç and T. Acar. Montgomery multiplication in
icant for a large range of precision. The dual-radix GF(2k ). Designs, Codes and Cryptography, 14(1):57–69,
architecture also facilitates faster computation of mul- April 1998.
[9] Johann Grossschadl. A bit-serial multiplier architecture
tiplication in GF (2n )-mode than GF (p)-mode. The
for finite fields GF(p) and GF(2k ). In Cryptographic
area and speed characteristics of the dual-radix archi- Hardware and Embedded Systems, Lecture Notes in Com-
tecture is also extensively investigated and its perfor- puter Science, No. 2162, pages 202–219. Springer-Verlag,
mance in terms of area and time is compared against Berlin, 2001.
single-radix, unified multiplier architecture. At the ex- [10] A. Bernal and A. Guyot. Design of a modular multiplier
pense of using extra resources, which proved to have a based on Montgomery’s algorithm. In 13th Conference
very limited impact on the silicon area under certain on Design of Circuits and Integrated Systems, pages 680–
circumstances, it provides significant improvement in 685, Madrid, Spain, November 17–20 1998.
[11] P. Kornerup. High-radix modular multiplication for cryp-
critical path delay compared to the original unified de- tosystems. In E. Swartzlander, Jr., M. J. Irwin, and
sign in both GF (p) and GF (2n )-modes. Furthermore, G. Jullien, editors, Proceedings, 11th Symposium on
it provides a superior performance in GF (2n )-mode. Computer Arithmetic, pages 277–283, Windsor, Ontario,
June 29 – July 2 1993. IEEE Computer Society Press, Los
References Alamitos, CA.
[12] C. D. Walter. Space/Time trade-offs for higher radix
[1] J.-J. Quisquater and C. Couvreur. Fast decipherment modular multiplication using repeated addition. IEEE
algorithm for RSA public-key cryptosystem. Electronics Transactions on Computers, 46(2):139–141, February
Letters, 18(21):905–907, October 1982. 1997.
[2] W. Diffie and M. E. Hellman. New directions in cryp- [13] Colin D. Walter. Montgomery exponentitation needs no
tography. IEEE Transactions on Information Theory, final subtractions. Electronic Letters, 35(21):1831–1832,
22:644–654, November 1976. October 1999.
[3] National Institute for Standards and Technology. Digital [14] ASIC design kit. Mentor Graphics Co.
Signature Standard (DSS). FIPS PUB 186-2, January
2000.