Paper formatting guidelines for FPL 2005 ... - Xun ZHANG

and degree(b(x)). , and to update the values of d = - and. Q = min(, ). Initially = n-1,. = n, d = -1, Q = n-1. *Supported by MEC (grant number SEG2004-05592).
96KB taille 1 téléchargements 217 vues
FINITE FIELD DIVISION IMPLEMENTATION Jean-Pierre Deschamps

Gustavo Sutter

University Rovira i Virgili Tarragona, Spain email: [email protected]

Universidad Autonoma de Madrid Madrid, Spain. email: [email protected]

ABSTRACT

2. ALGORITHMS

A generalized version of the plus-minus algorithm is used for implementing dividers over GF(pn). Generic dividers have been synthesized in the general case of GF(pn) and in the particular cases of GF(2n) and GF(p). The theoretical costs are O(logN), being N the number of field elements, and the theoretical computation times are O(logN) in the case of dividers over GF(pn) and GF(2n), and O((logN)2) in the case of dividers over GF(p). Finally, the results of FPGA implementations are reported, and a comparison is made between dividers over GF(p), GF(2n) and GF(pn).

Given two polynomials a(x) = a0 + a1.x + a2.x2 + ... + ann-1 and b(x) = b0 + b1.x + b2.x2 + ... + bn-1.xn-1 + bn.xn 1.x where ai , bj  GF(p) and b0 z 0, then

1. INTRODUCTION

The following iterative algorithm computes the greatest common divisor of a(x) and b(x):

if a0 = 0: a(x) is divisible by x while b(x) is not, so that gcd(a(x), b(x)) = gcd(a(x)/x, b(x)); if a0 z 0: define a new polynomial ab(x) = a(x) b(x).a0.b0-1; then gcd(a(x), b(x)) = gcd(ab(x), b(x)) = gcd(ab(x), a(x)), and ab(x) is divisible by x.

Finite-field operations are used as computation primitives for executing numerous cryptographic algorithms, especially those related with the use of public keys (asymmetric cryptography). Classical examples are ciphering / deciphering, authentication and digital signature protocols based on RSA-type or elliptic curve algorithms. As regards the inversion and the division, several types of algorithms have been proposed. Some of them [1,3,6] are based on extensions of the Euclid algorithm. Others are based on the Fermat's little theorem and substitute the inversion by multiplications [12,7,13, 5]. Recently, another gcd-free inversion method has been proposed in the case of GF(p) [8]. Nevertheless, as long as hardware implementations are concerned, the most efficient dividers over GF(p) are based on the plus-minus algorithm [9,2,11], and a very similar algorithm [14] can also be used for implementing dividers over GF(2n). In this paper a generalized and slightly modified version of the previous algorithm is used for implementing dividers over GF(pn). The division algorithm is described in section 2. Then (section 3), generic dividers are synthesized in the general case of GF(pn) and in the particular case of GF(2n). Finally (section 4), the results of FPGA implementations are reported, and a comparison is made between dividers over GF(p), GF(2n) and GF(pn).

Algorithm 1 - greatest common divisor over GF(pn), first version while degree(a(x))>0 loop if a0=0 then a(x):=a(x)/x; else old_a(x):=a(x); a(x):=(a(x)-b(x).a0.b0-1)/x; if degree(b(x))>degree(old_a(x)) then b(x):= old_a(x); end if; end if; end loop; if a0=0 then gcd:=monic(b(x)); else gcd:=1; end if; If b(x) is a polynomial of degree m, then monic(b(x)) = b(x).bm-1. Observe that at each step the sum of the degrees of a(x) and b(x) is reduced. After a finite number of steps the degree of a(x) will be equal to 0, i.e. a(x) = a0. Then if a0 = 0: gcd(0, b(x)) = monic(b(x)); if a0 z 0: gcd(a0, b(x)) = 1. Instead of computing the degree of a(x) and b(x) at each step, a simpler solution [10,11] consists in defining upper bounds D and E such that degree(a(x)) d D and degree(b(x)) d E, and to update the values of d = D - E and Q = min(D, E). Initially D = n-1, E = n, d = -1, Q = n-1.

*Supported by MEC (grant number SEG2004-05592).

0-7803-9362-7/05/$20.00 ©2005 IEEE

670

b0

c0

0

1

(from figure 2)

v0

done (from control unit) mod p multiplier

-a 0 .b 0-1

v(x)/x

u0

mod p inverter

a0

c0-1

b0 -1

mod p multiplier

-b 0-1

-a

oper_a =1: -v0 .a0 /b 0 oper_a =0: 0

-(f(x).f 0-1 )/x oper_a =1: -(v(x)/x).a 0.b 0 -1 oper_a =0: 0

-1 0.b 0

mod p adder

mod p multiplier

oper_a

mod p adder

oper_a

u(x)/x

mod p multiplier

b(x)/x

c

mod p multiplier

oper_a

a(x)/x -1 oper_a =1: -(b(x)/x).a 0 .b 0 oper_a =0: 0

mod p adder

mod p adder num(x) den(x) 1 1

0

0 new_u(x)

new_a(x) f(x) a(x)

1

0

first_step

new_b(x)

0

u(x)

1

0

first_step

new_v(x)

Fig. 1. AB_cell and UV_cell

Algorithm 2 - greatest common divisor over GF(pn), second version b(x):=f(x); d:=-1; q:=n-1; while q>0 loop d:=d-1; if a0=0 then a(x):=a(x)/x; if d=0 then if d=0 then q:=q-1; end if; else b(x):=old_a(x); end if; end if; end loop; if a0=0 then gcd:=monic(b(x)); else gcd:=1; end if;

An extension of the previous algorithm 2 allows to compute the value of z(x) = num(x)/den(x) mod f(x), where f(x) is relatively prime with den(x), i.e. gcd(den(x),f(x)) = 1 and f0 z 0. For that, two additional polynomials u(x) and v(x) are defined and updated at each step. Initially a(x) = den(x), b(x) = f(x), u(x) = num(x), v(x) = 0. During the algorithm execution, u(x) and v(x) are updated in the same way as a(x) and b(x): if a(x) := a(x)/x, then u(x):=u(x)/x mod f(x); if a(x) := (a(x)-b(x).a0.b0-1) /x, then u(x):= (u(x)-v(x).a0.b0-1) /x mod f(x); if b(x) := a(x), then v(x) := u(x). It can be proven that v(x).den(x) { b(x).num(x) mod f(x) and u(x).den(x) { a(x).num(x) mod f(x). At the end of the algorithm execution a(x) = a0 and either

After less than 2.n steps, Q will be equal to 0. Furthermore, every time that Q is decreased, b(x) is unchanged, so that when Q = 0 then D = 0, that is degree(a(x)) = 0.

a0 = 0 and monic(b(x)) = gcd(den(x),f(x)) = 1, so that b(x) = b0 and z(x) = v(x).b0-1,

671

where N = pn is the number of field elements, plus the cost of a table storing c-1 for every c in GF(p). The delays of the AB_cell and UV_cell blocks are mainly defined by the delay of the mod p multipliers, that is a linear function of log2p. Thus, the total computation time is roughly proportional to n.(log2p) = log2N.

or -1

a0 z 0 so that z(x) = u(x).a0 . It remains to generate a procedure for computing w(x)/x mod f(x): w(x)/x mod f(x) = (w(x)-f(x).w0.f0-1)/x.

u

v

a0

b0

1

0

1

n

a

(a=0)

NOR

Algorithm 3 - division over GF(p ) a(x):=den(x); b(x):=f(x); d:=-1; u(x):=num(x); v(x):=0; q:=n-1; while q>0 loop d:=d-1; if a0=0 then a(x):=a(x)/x; u(x):=(u(x)-f(x).u0.f0-1)/x; if d=0 then if d=0 then q:=q-1; end if; else b(x):=old_a(x); v(x):=old_u(x); end if; end if; end loop; if a0=0 then z(x):=v(x).b0-1; else z(x):=u(x).a0-1; end if; end mod_divider;

0

z1 c0 (to AB_cell) 1/c0 (from AB _cell)

mod p multiplier

z

Fig. 2. Final-step circuit a(x)/x

b(x)/x

oper _a

u(x)/x

v(x)/x

oper _a

f(x)/x den (x) first_step num(x)

3.1. Divider over GF(pn)

first_step

new_a(x)

A circuit for executing the preceding algorithm 3 has been synthesized. The AB_cell and UV_cell blocks (figure 1) compute the new values of a(x), b(x), u(x) and v(x). The data path contains the AB_cell and UV_cell blocks, the registers which store a, b, u and v, the counters q and d, the flag generation circuit (q > 0, d d 0, d t 0, d = 0) and the final-step circuit (figure 2). The combinatorial part is made up of about 4.n mod p multipliers, 3.n mod p adders, 5.n+2.log2p 2-to-1 multiplexers, 1 mod p inverter. The corresponding costs for the Spartan3 and Virtex-family programmable arrays are the following ones [4]: 1 LUT per 2-to-1 multiplexer, about 2.(log2p)2 LUTs per mod p multiplier, about 2.log2p LUTs per mod p adder. Thus, the total cost of the combinatorial part, without the mod p inverter, is approximately equal to 8.n.(log2p)2 + 6.n.log2p + 5.n+2.log2p LUTs. For great values of p (>>2), the total cost of the whole combinatorial part is roughly equal to c1 = 8.n.(log2p)2 = 8.(log2p). (log2N) LUTs,

0

1

3. CIRCUIT SYNTHESIS

1

f(x)

a(x)

1

0

0

new _u(x)

new_b(x)

0

u(x)

1

0

new _v(x)

Fig. 3. AB_cell and UV_cell (p = 2)

3.2. Divider over GF(2n) In the binary case (p = 2), several simplifications can be done: the multipliers are AND gates, the adders are XOR gates and some blocks are no longer necessary. The simplified AB_cell and UV_cell blocks are shown in figure 3. The combinatorial part of the data path is made up of about 2.n AND gates, 3.n XOR gates, 5.n 2-to-1

(1)

672

depend on the chosen field - not only on N and p - as well as on the placement and routing of the logic functions. So, in order to get pertinent conclusions, concrete dividers must be implemented and compared.

multiplexers (the fifth one selects the final result z(x) among u(x) and v(x)). Assuming that every 2-input gate and every 2-to-1 multiplexer is implemented with a LUT, the total number of LUTs is approximately equal to c2 = 10.n = 10.(log2N) LUTs,

4. FPGA IMPLEMENTATIONS

(2)

being N = 2n the number of field elements. The delays of the AB_cell and UV_cell blocks are practically constant (there are no ripple adders). Thus, the total computation time is proportional to n = log2N.

4.1. Dividers over GF(2n) Dividers over GF(2n), with n = 64, 128, 160 and 256, have been implemented. For every value of n a particular irreducible polynomial - namely a pentanomial - has been used. In the next tables, period is the minimum clock period and max_time is equal to (2.n).period, i.e. the maximum computation time at minimum clock period. The results for an xv2000e-6 and an xc3s2000-5 device are given in table 1.

3.3. Divider over GF(p) A modular divider, based on the method proposed in [10], has also been synthesized. Its cost is about c3 = 13.(log2p) = 13.(log2N) LUTs,

(3)

n

Table 1. Dividers over GF(2 ) with fixed f(x): area

being N = p the number of field elements. The delays of the AB_cell and UV_cell blocks are mainly defined by the delays of the (log2p)-bit ripple adders, that is a linear function of log2p. Thus, the total computation time is roughly proportional to the product of log2p by the number of steps. The order of magnitude of the latter is also proportional to log2p, so that the computation time is approximately proportional to (log2p)2 = (log2N)2. Nevertheless, if fixed-segment-length pipelined adders are used [10], the computation time is approximately proportional to the number of steps, that is to log2N.

(slices), period and max_time (ns) xv2000e-6 xc3s2000-5 N area period max_time area period max_time 64 282 6.4 819 270 5 640 128 540 6.6 1689 571 5.4 1382 160 666 6.7 2144 709 5.7 1824 256 1046 7.4 3789 1109 5.8 2970

Generic dividers over GF(2n) have also been implemented. In this case f(x) is an additional input instead of a particular polynomial. The results are given in table 2. Table 2. Generic dividers over GF(2n): area (slices), period and max_time (ns)

3.4. Comments and comparison Some comments must be done: 1 The cost of the control unit is not taken into account. It is assumed to be much smaller than the data path cost. 2 The cost of the registers is not taken into account. According to the placement strategy they could be placed into the same slices as the combinational circuits, or not. 3 The cost of the memory which stores the mod p inverses is not taken into account. It is assumed that, whatever be the application, some memory blocks will be unused. 4 The delay approximations only take into account the functional unit delays, not the connection delays. As a matter of fact the latter could be even longer than the former, and obviously depend on the circuit size and complexity.

Xv2000e-6 xc3s2000-5 N area period max_time area period max_time 64 319 6.8 870 348 5.2 666 128 610 6.9 1766 723 5.6 1434 160 852 7.1 2272 903 5.6 1792 256 1190 7.5 3840 1411 5.7 2918

4.2. Dividers over GF(p) The cost and minimum clock period of the best known implementations of the division over GF(p) are shown in table 3 [10]. As before period is the minimum clock period and max_time is equal to (2.n).period, where n # log2p. Table 3. Generic dividers over GF(p): area (slices),

period and max_time (ns) xv2000e-6 xc3s200-4 N area period max_time area period max_time 64 420 13 1664 424 13 1664 128 778 18.2 4659 873 20.8 5325 160 951 22.2 7104 112 22.7 7264 256 1457 34.5 17,664 1920 34.5 17,664

The comparison of (1), (2) and (3) indicates that GF(p) and GF(2n) generate more cost-effective circuits than GF(pn) with p >> 2. Furthermore, as regards the total computation time, GF(2n) is better than GF(p). Nevertheless, the actual complexity and computation time

673

implementations (section 4), indicate that the theoretical approximations give correct orders of magnitude.

If pipelined adders, with 32-bit segments, are used the results are quite better [10]:

6. REFERENCES

Table 4. Pipelined generic dividers over GF(p) : area

(slices), period and max_time (ns) [1] K.Araki, I.Fujita, and M.Morisue, "Fast inverter over finite field based on Euclid's algorithm", IEICE Transactions, vol.E72, pp.1230-1234, November 1989. [2] R.P.Brent and H.T.Kung, "Systolic arrays for linear time GCD computation", Proceedings of VLSI''83, pp.145-154, 1983. [3] H.Brunner, A.Curiger, and M.Hofsetter, "On computing multiplicative inverses in GF(2m)", IEEE Transactions on Computers, vol.42, nº8, pp.1010-1015, 1993. [4] J.-P.Deschamps and G.Sutter, "FPGA Implementation of Modular Multipliers", Proceedings of the XVII Design of Circuits and Integrated Systems Conference, pp. 107 – 112, Santander, November 19 – 22, 2002. [5] J.-P.Deschamps, G.Bioul, and G.Sutter, Synthesis of Arithmetic Circuits, Wiley 2005. [6] J.H.Guo and C.L.Wang, "Novel digit-serial systolic array implementation on Euclid's algorithm for division in GF(2m)", Proceedings of the International. Symposium. on Circuits and Systems, pp.II 478-481, 1998. [7] T. Itoh and S. Tsujii, “A Fast Algorithm for Computing Multiplicative Inverses in GF(2m) using Normal Bases,” Information and Computation, vol.78, nº 3, September 1988, pp. 171 - 177. [8] M.Joye and P.Paillier, "GCD-free Algorithms for Computing Modular Inverses", Cryptographic Hardware and Embedded System - CHES 2003, Springer-Verlag, 2003, LNCS 2779, pp. 243 - 253. [9] D.E.Knuth, The Art of Computer Programming, vol2: Seminumerical Algorithms, 2nd edition, Addison-Wesley, 1981. [10] G.Meurice de Dormale, Ph.Bulens and J.-J.Quisquater, "Efficient Modular Division Implementation", J.Becker, M.Platzner and S.Vernalde (Editors), Field-Programmable Logic and Applications, Lecture Notes in Computer Sciences nº 3203, pp. 231-240, Springer-Verlag, 2004. [11] N.Takagi, "A VLSI Algorithm for Modular Division Based on the Binary GCD Algorithm", IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E81-A, nº 5, pp. 724-728, May 1998. [12] CC.Wang, T.K.Truong, H.M.Shao, L.J.Deutch, and J.K.Omura, "VLSI architecture for computing multiplications and inverses in GF(2m), IEEE Transactions on Computers, vol.C-34, nº 8, pp.709-717, August 1985. [13] D. Woodbury, “Elliptic Curve Cryptography on Smart Cards without Coprocessors,” IFIP CARDIS, 2000, pp. 71 – 92. [14] Ch.-H.Wu, Ch.-M.Wu, M.-D.Shieh, and Y.-T.Hwang, "Novel Algorithms and VLSI Design for Division over GF(2m)", IEICE Transactions Fundamentals, vol.E85-A, nº 5, May 2002, pp. 1129-1139.

xv2000e-6 xc3s200-4 N area period max_time area period max_time 64 460 12 1536 461 12 1536 128 842 13.3 3405 927 12.7 3251 160 1022 13 4160 1180 12.8 4096 256 1612 13 6656 1847 12.5 6400

The results of tables 3 and 4 have been obtained with xc3s2000-4 devices, about 15% slower than the xc3s20005. Nevertheless it is quite obvious that, even in the case of the pipelined versions, the difference between dividers over GF(p) and GF(2n) is important, especially the computation time. 4.3. Dividers over GF(23917) As a last example, a divider over GF(23917) has been implemented. The irreducible polynomial is f(x) = 237 + x17. The cost, minimum period and maximum computation time at minimum clock period, i.e. 34.period, are given in table 5. Table 5. Dividers over GF(23917): area (slices), period

and max_time (ns) area 4016

xv2000e-6 period max_time 84.1 2859

area 4045

xc3s2000-5 period max_time 64.6 2195

The number N of field elements, i.e. 23917, is approximately equal to 2134. So, the results of table 5 should be compared with the results of the preceding tables with n = 128. 5. CONCLUSION According to the cost and computation time approximations of section 3, and as long as the only requisite is the number N of field elements, the best option is GF(2n). With GF(p), the computation time is longer if ripple adders are used: O((logN)2) instead of O(logN). With pipelined adders the computation time is O(logN), but the actual cost and delay (table 4) are greater than in the previous case (tables 1 and 2). With GF(pn), p > 2, both the cost and the computation time have the same order of magnitude as that of a divider over an equivalent binary field (O(logN)). Nevertheless, according to the approximations (1) and (2), if 8.log2p >> 10, i.e. p >> 2, the binary solution is better. Furthermore, the practical

674