Paper formatting guidelines for FPL 2005 proceedings - Xun ZHANG

Virtex-II Pro XC2VP2 or XC2VP30 and Virtex-4. XC4VLX15 or XC4VSX25. This adds even more variation to the basic balance between Distributed ROM and ...
113KB taille 3 téléchargements 271 vues
SNOW 2.0 IP CORE FOR TRUSTED HARDWARE Wen Hai Fang

Thomas Johansson, Lambert Spaanenburg

Dept. of Information Technology Lund University P.O.Box 118, 221 00 Lund (Sweden) email: [email protected]

Dept. of Information Technology Lund University P.O.Box 118, 221 00 Lund (Sweden) email: [email protected], [email protected]

ABSTRACT Stream ciphers like Snow 2.0 are very promising techniques for encryption in trusted hardware, but demand specialized IP cores to enhance conventional architectures. The paper describes the design of such a core that can be adapted to the system needs according to a ratio of throughput and effective slice usage of 3.2 to 3.5. The footprint is comparable with a commercial floating-point unit. Keywords---Stream cipher encryption, ISA-compliant IPcore, Trusted Hardware, Pareto curve, Behavioral Synthesis. 1. INTRODUCTION Cryptography is rapidly diffusing into society. With the explosive growth of the Web-based appliances market, one can no longer afford to go unprotected. Cryptography is needed and it must often be hardware supported to handle product tampering at any time. Trusted hardware will have encrypted data- and instruction flows as a first line of defense [1]. However, encryption still counts as overhead and must not take too much of the design and implementation budget. Traditionally, enciphering has been based on fixed-sized chunks of data, blocks. A block cipher is a type of symmetric-key encryption algorithm that transforms a fixed-length block of plaintext (unencrypted text) data into a block of ciphertext (encrypted text) data of the same length. Famous block ciphers like DES or AES need to iterate a round function many times to keep the code from breaking by sheer computational power [2]. Unfortunately, the hardware implementation of block ciphers is quite costly. The best block cipher for hardware is probably AES. Another class of ciphers, stream ciphers, is especially interesting for embedded products. A stream cipher is a type of symmetric encryption algorithm that operates on smaller units of plaintext, usually bits, in a time-varying manner. Well-known examples like A5/1 (used in GSM) or E0 (used in Bluetooth) are not acclaimed for their safety [3], but are definitely low in overhead as embedded program. The stream cipher Snow is developed at Lund University [4]. The current version Snow 2.0 has improved security

0-7803-9362-7/05/$20.00 ©2005 IEEE

281

and is one of the two dedicated stream cipher designs of the current draft of ISO/IEC 18033-4 [5]. The fact that SNOW 2.0 is likely to soon become an ISO standard makes implementation aspects for the cipher very interesting. The software performance was recently examined in [6]. There is little known about the algorithmic qualities that make for small hardware. We focus here on the macro diversity on an FPGA, as originally featured in the Xilinx Virtex-II family and even more exploited in the Virtex-IV series, for its impact on a SNOW realization. The structure of this paper is as follows. In Section 2, the Snow 2.0 algorithm is described. Subsequently, the design and hardware implementation are presented. Then, in Section 4, comparisons of area and speed are outlined. Finally, some conclusions are drawn. 2. THE SNOW 2.0 ALGORITHM Snow 2.0 is a symmetric key, word-oriented stream cipher with a word size of 32 bits. It takes two parameters as input values: a secret key of either 128 or 256 bits and a publicly known 128-bit initialization value IV. The cipher generator is based on a length-16 linear feedback shift register (LFSR), feeding a Finite State Machine (FSM) that consists of (a) two 32-bit registers, called R1 and R2, (b) an S-box to perform permutation analog to the Rijndael round function [2] as well as (c) some operations to calculate the output and the next state (the next value of R1 and R2). The FSM has two input words from the LFSR and the running key is formed as the XOR between the FSM output and the last element of the LFSR. These features are also the main ingredients for the security provision. The operation of the cipher is as follows. First, a key initialization is performed. This operation provides the LFSR with a starting state as well as gives the internal FSM registers R1 and R2 their initial values. Next, the cipher is clocked once and the first keystream symbol is read out. Then the cipher is clocked again and the second key-stream symbol is read (see [4] for a detailed description of the Snow algorithm). In Snow 2.0 there are two different elements involved in the feedback loop, D and D-1, where D is a root of a

Į S t+15 St+14

-1

s11=k3 † 1, s10=k2 † 1 † IV2, s9= k1 † 1 † IV3, s8=k0 † 1, s7=k3, s6=k2, s5=k1, s4=k0, s3=k3 † 1, s2=k2 † 1, s1=k1 † 1, s0=k0 † 1, where 1 denotes the all one vector (32 bits).

Į

St+11

St+5

St+2

St

Į-1

Ft

FSM

Zt

S t+15 St+14

R1

S

St+11

St+5

St+2

St

Ft

R2

FSM

Fig. 2.

Fig. 1.

Į

running key

Cipher operation during key initialisation.

After the LFSR has been initialized, R1 and R2 are both set to zero. Now, the cipher is clocked 32 times without producing any output symbol. Instead, the output of the FSM is incorporated in the feedback loop (Fig 2). Thus during the 32 clocks in the key initialization, the next element to be inserted into the LFSR is given by s16=D-1st+11 † st+2 † Dst † Ft. After the 32 clocks the cipher turns back to normal operation (Fig 1) and is clocked once more before the first key-stream symbol is produced. The S-box, denoted by S[w], is a permutation on Z 232

A schematic picture of SNOW 2.0.

primitive polynomial of degree 4 over F28 . Let the state of the LFSR at time t 0 be denoted (st+15,st+14,…,st), st+i  F232 , i t 0. The element st is the rightmost element (or first element to exit) as indicated in Fig 1, and the sequence produced by the LFSR is (s0, s1, s2, …). Time t=0 means the time instance directly after the key initialization. Then the cipher is clocked once before producing the first keystream symbol, i.e., the first keystream symbol, denoted z1, is produced at time t=1. The produced keystream sequence is denoted (z1, z2, z3, …). The value of the FSM registers at time t t 0 is denoted R1t and R2t respectively. The input to the FSM is (st+15, st+5) and the output of the FSM, denoted Ft, is calculated as Ft=(st+15 ⬄ R1t) † R2t, t t 0 and the keystream is given by zt=Ft † st , t t 1. Here the notation ⬄ is used for integer addition modulo 232 and † for bitwise addition (XOR). The registers R1 and R2 are updated with new values according to R1t+1=st+5 ⬄ R2t and R2t+1=S(R1t), t t 0. Snow 2.0 takes two parameters as input value; a secret key of either 128 or 256 bits and a publicly known 128-bit initialization value IV. The IV value is considered as a four word input IV=(IV3, IV2, IV1, IV0), where IV0 is the least significant one. The possible range for IV is thus 0…2128-1. This means that for a given secret key K, Snow 2.0 implements a pseudo-random length increasing function from the set of IV values to the set of possible output sequences. The key initialization is done as follows. Denote the registers in the LFSR by (s15, s14, …, s0) from left to right in Fig 1. Thus, s15 corresponds to the element holding st+15 during normal operation of the cipher. Let the secret key be denoted by K=(k3, k2, k1, k0) in the 128 bit case, where each ki is a word and k0 is the least significant word. First, the shift register is initialized with K and IV according to s15=k3 † IV0, s14=k2, s13=k1, s12=k0 † IV1,

based on the round function of Rijndael. Let w=(w3, w2, w1, w0) be the input to the S-box, where wi, i=0…3 is the four bytes of w. Assume w3 to be the most significant byte. Let w w3 , w2 , w1 , w0 T be a vector representation of the input to the S-box. First the Rijndael S-box is applied, denoted by SR to each byte, where the vector is S R >w3 @, S R >w2 @, S R >w1 @, S R >w0 @ T . In the MixColumn transformation of Rijndael’s round function, each 4 byte word is considered a polynomial in y over F28 , defined by the irreducible polynomial x8+x4+x3+x+1  F2 [x]. Each word can be represented by a polynomial of at most degree 3. The vector above as representing a polynomial over F28 is multiplied with a fixed polynomial c(y)=(x+1)y3+y2+y+x  F28 [y] modulo y4+1  F28 [y]. This polynomial multiplication can (as done in Rijndael) be computed as a matrix multiplication x 1 1 1 · § S R [ w0 ] · § r0 · § x ¨ ¸ ¨ ¸ ¸¨ x x  1 1 ¸ ¨ S R [ w1 ] ¸ ¨ r1 ¸ ¨ 1 , where (r3, r2, ¨ r2 ¸ ¨ 1 1 x x  1¸ ¨ S R [ w2 ] ¸ ¨ ¸ ¨ ¸ ¸¨ 1 1 ¹ © S R [ w3 ] ¹ © r3 ¹ © x  1 1 r1, r0) are the output bytes from the S-box. These bytes are concatenated to form the word output from the S-box, r=S[w]. Finally, in encryption the produced output sequence, called running key in Fig 1, is added bitwise to the plaintext

282

sequence. The result is the ciphertext sequence. In decryption, the same operation is done.

together with a signal CipherStat being high. Furthermore, there are some status signals for testing purposes. A number of different stream ciphers can be supported from the same interface. Having the data byte and the instructions organized in nibbles allows for easy adaptation to individual needs of the embedding processor. For an FPGA this entails in-product re-configuration, a desired feature in case a successful attack appears at some stage in the future. The internal architecture of the SNOW architecture is shown in Fig 4 and contains the following elements: x Key_IV. It receives the secret (Key) and public (IV) keys and sends them to the LFSR; x LFSR. It provides key Initialization, sends LFSR values out and receives new LFSR values; x Calcu New_LFSR Value & KeyStream. It calculates new LFSR values and the KeyStream, and sends them out; x FSM. It calculates new R1 and R2 value, permutates them and send them out; x Encrypt_Addition. Does the actual encryption. Two plaintexts can be stored. The FullText signal will be high when the plaintexts are full; otherwise it is low. This guarantees that there is at each clock cycle one plaintext ready for encryption. Further, the latency of the plaintext is 1 clock cycle. As mentioned above, D is a root of x4+E23x3+E245x2+E48x+E239 F28 [x], so the degree reduction

3. HARDWARE IMPLEMENTATION To ease the experimentation with various encryption algorithms, we start from a uniform interface (Fig 3). This also simplifies the use of every design variety as IP core by having the same functional footprint. From the outside, all designs have an elementary Instruction Set Architecture that allows external data to be entered into specific registers for internal usage without complicated timing requirements. They are presented in this section with different amounts of Block SelectRAM, resulting in varying area & speed considerations. For a full length description, one is referred to [7]. PlainText 32

Mode

4

FullText

MasterClock RESET

CipherStat

SNOW 2.0

Key_In

32

KeyInstr

8

4

Status

32

of D can be given by D4=E23D3+E245D2+E48D+E239.

CipherText Fig. 3.

External Interface for FPGA Implementation.

The architecture is fully synchronous. Mode control is a 4-bit signal that differentiates between the various initialization steps required for the operation: x no-op disables the part; x key-in allows to enter both secret key and public IV by port Key_In[0:31] with 32 bits each clock cycle, key first and then IV; x Active sets the encryption in motion: x Init allows entering the internal initialization. The parameter settings for the SNOW operation are entered by an 8-bit Instruction Key a7a6a5a4a3a2a1a0, where a7 is MSB and a0 is LSB, with the following meaning: x a2a1a0---Secret Key Size Selection (32,64,128,256 or other choices) x a4a3 ---IV Selection (00-no IV, 01-IV 128 bits or other choices) x a7a6a5---Plaintext Size (8,16,32,64 128 or other choices) The plaintext can be entered on the signal FullText, while the ciphertext will become available on the output

Key & IV

Key_IV

LFSR

Calcu New_LFSR Value & KeyStream FSM

PlainText

Fig. 4.

Encrypt_Addition

CipherText

Internal structure for FPGA Implementation.

In the feedback loop, multiplication with D and D-1 can be implemented as a simple byte shift plus an additional XOR with one of 256 possible patterns [4] using the lookup table stored in the Block SelectRAMs or by the Distributed ROMs of the FPGA. This choice will lead to the design variations that we pursue in this paper. The S-box can also be implemented using the look-up tables like the implementation of multiplication with D and D-1. So only the XOR, bit addition, bit shift and look-up tables are used.

283

The purpose of this work is to get a high performance Snow 2.0 implementation on a Xilinx FPGA--- mainly on Virtex-II Pro XC2VP2 or XC2VP30 and Virtex-4 XC4VLX15 or XC4VSX25. This adds even more variation to the basic balance between Distributed ROM and Block SelectRAM that is studied in this paper.

Key & IV

K e y _ I V

K e y _ I n i t i a l i s a t i o n

LFSR1

LFSR2

LFSR Value Output

Name Keyinitialization LFSR1 LFSR2 LFSR3 Alpha_Invalpha sbox12 sbox34

Calcu New LFSR Value

FIFO

Calcu Keystream

sbox12

sbox34

Encrypt_Addition

FIFO

Calcu New R1 & R2 Value

LFSR3

Block SelectRAMs with 512 words of 36 bits and their use in Design A.

Table 1.

FIFO

PlainText

Fig. 5.

Alpha_Invalpha

LFSR1, LFSR2 and LFSR3 to send these 5 LFSR values out to different blocks simultaneously. Other RAMs are used for initialization purposes, leading to a grand total of 7 Block SelectRAMs for design A (Table 1).

Block SelectRAMs 1 1 1 1 1 1 1

In the next variations, we decrease the usage of Block SelectRAMs and apply Distributed ROMs instead. First we do this for the small blocks sbox12 and sbox34. This decreases the RAM count to 5 and is called Design B. Next, even more small blocks such as Alpha_invalpha are treated this way (design C), leaving 4 Block SelectRAMs. Finally, in Design D, no Block SelectRAMs are used anymore. The block diagram of Design D is pictured in Fig 6. The difference between Fig 5 and Fig 6 is that 4 small blocks in the former one (Key_Initialization, LFSR1, LFSR2 and LFSR3) are combined into a single small block Key_Initialisation & Value Out, making that there are no Block SelectRAMs in use. Not only are both R1 and R2 used to calculate the output of FSM, but they are also related to each other. Therefore it is necessary to re-arrange the sequence of the output of R1 and R2 to optimize the performance of the whole design. The diagram of calculating R1 and R2 is shown in Fig 7.

CipherText

Implementation of Snow 2.0, design A, B and C.

Going from architecture to implementation, a number of the elements that appear in Fig 4 are further detailed. The following substitutions are made to arrive at the implementation shown in Fig 5: x LFSR.; Replaced by Key_Initialisation, LFSR1, LFSR2, LFSR3, and LFSR Value output x Calcu New_LFSR Value & KeyStream.; Replaced by 2 FIFOs , Alpha_Invalpha, Calcu New LFSR Value, Calcu Keystream. x FSM. Replaced by Sbox12, Sbox34, Calcu New R1 & R2 Value, FIFO. The first attempt (variation A) makes full use of the Block SelectRAMs available in the Xilinx Virtex-II family. They are configured on instantiation as Dual-Port RAMs, allowing for one read and two write accesses simultaneously. From Fig 1 it can be found that 5 LFSR values are used each time to calculate the new LFSR value, R1 and Keystream. So 3 Block SelectRAMs are used for

S5

+ ’0'&R1[7:0] 1&R1[15:8]

Alpha_Invalpha

Sbox12 LUT

Key & IV

’0'&R1[23:16] Key_Initi alisation & Value Out

LFSR Value Output

FIFO

sbox12

Calcu New R1 & R2 Value

sbox34

PlainText

Fig. 6.

Encrypt_Addition

’1'&R1[31:24]

Calcu Keystream

FIFO

Sbox1_out Sbox2_out

†

R1

Calcu New LFSR Value

FIFO K e y _ I V

Words used 20 16 16 16 512 512 512

Fig. 7.

Sbox34 LUT

R2

Sbox3_out Sbox4_out

Calculation of R1 and R2.

Sbox12 and Sbox34 are two look-up tables using two dual-port Block SelectRAMs or Distributed ROMs. The address port of the look-up table is 9-bit width and can hold 512 values. From Fig 7 it can be found that calculating new R2 value needs 2 clock cycles: 1 clock cycle for the value output from look-up table and 1 clock cycle for bit addition.

CipherText

Diagram of Snow 2.0 Design D.

284

But only 1 clock cycle is needed for calculating the new R1 value that adds S5 with R2. The requirement of high performance to get high throughput means that there are R1 and R2 values ready in each clock cycle for calculating the output of FSM. According to Snow 2.0, after the LFSR has been initialized, R1 and R2 are both set to zero. This means that the first new R1 value is equal to S5. As R1 is equal to zero at first, the values of Sbox1_out, Sbox2_out, Sbox3_out and Sbox4_out can be achieved at the beginning. When calculating the new R1 and R2 value, only a new R1 value is calculated at the first clock cycle. At that time, the lookup tables have both inputs and outputs. From the second clock cycle, new R1 and R2 values are calculated simultaneously. As the new R2 value is one clock cycle late compared to new R1 value, there is a FIFO added to the output of R1 value. Consequently, there are new R1 and R2 values reaching the small block ‘calcu Keystream’ at each clock cycle. Also there is one Keystream output at each clock cycle for encrypt addition. The implementation uses VHDL as design language; Synplify 7.7.1 is applied for synthesis. Place & route is performed by Xilinx ISE 6.3.03i. The designs have been targeted on Xilinx FPGA Virtex-II Pro XC2VP2, XC2VP30 and Virtex-4 XC4VLX15, XC4VSX25. Some designs are converted into bit files and downloaded to the Memec Virtex-II Pro FF1152 development board for testing purposes. The development board can use a 100MHz LVTTL oscillator or a programmable LVDS Clock Source that run from 25 to 700MHz as the clock source. Both are used to do the test. The frequency is set to 225MHz when the programmable LVDS Clock Source is used. The Xilinx ChipScopePro6.3.03i is used to test these designs [8].

key=80000000000000000000000000000000 and 128 bit public key (IV3, IV2, IV1, IV0)=(0,0,0,0) the output presented in Fig 9a, while a 256 bit secret key= AAA…AAA (total 64 A) and a 128 bit public key (IV3, IV2, IV1, IV0)=(4,3,2,1) gives Fig 9b. It can be seen that there is one ciphertext output in each clock cycle after the initialization, but the clock cycle of the first ciphertext output is different. Design A and B are implemented on both XC2VP2 and XC2VP30, but design C and D are implemented only on XC2VP30 because they use more slices than available on XC2VP2.

(a)

(b) Fig. 9.

Recently Xilinx announced the newest product --Virtex-4 FPGAs --- that contains three families (platforms): LX, FX and SX. Virtex-4 LX is for high-performance logic applications solution, Virtex-4 FX is for high-performance full-featured solution for embedded platform applications, and Virtex-4 SX is for high-performance solution for Digital Signal Processing (DSP) applications. To evaluate their impact, we have also implemented design A with 7 Block SelectRAMs on the XC4VSX25 and Design D with no Block SelectRAMs on the XC4VLX15. The overall results are listed in Table 2. It shows that the Virtex-4LX gives the highest output frequency (252.4 MHz), while the SX device gives the highest frequency for the small area solutions. Overall, the XC2VP2 gives the smallest area, but in Design B it occupies 99% of the available slices and cannot achieve a higher output frequency because of the limited slice count. More interesting is the compound qualification by means of the Throughput versus Area ratio, whereby area is expressed in slices. On first glance, Design A using the XC2VP2 seems superior even to the newer devices. However, the effect of the Block SelectRAM usage must be taken into account. From Table 2 it can be inferred that a fully filled Block SelectRAM is equivalent to 256 slices, and not 128 as reported in [9]. Taking this into consideration, we find that the Virtex-4 offers actually an improvement. The factor Throughput/Slices lays between 3.2 and 3.5 with the newer device at the high end and the older devices at the low end of the spectrum. The metrical variation is (at least partly) due to the structural variations between the designs (see section 3).

4. DISCUSSION In the following we give a few snapshots of the realized results obtained from ChipScope Pro 6.3.03i. The output of design A, B and C for 128 bit secret key = 80000000000000000000000000000000 and 128 bit public key (IV3, IV2, IV1, IV0)=(0,0,0,0) are shown in Fig 8a, while a 256 bit secret key=AAA…AAA (total 64 A) and a 128 bit public key (IV3, IV2, IV1, IV0)=(4,3,2,1) give Fig 8b. Similarly we find for Design D with 128 bit secret

(a)

(b) Fig. 8.

Design D waveforms from ChipScope.

Design B waveforms from ChipScope.

5. CONCLUSIONS

285

Table 2. Final Results (this does not include the resources used by ChipScope) (Throughput is defined as Blocksize * Frequency, where the Snow2.0 block size is 32) Design A A A B B C D D

Device XC2VP2 XC2VP30 XC4VSX25 XC2VP2 XC2VP30 XC2VP30 XC2VP30 XC4VLX15

Freq (MHz) 225.3 235.4 247.4 218.3 241.6 234.0 236.6 252.4

Throughput(Mbps) 7209 7532 7916 6985 7731 7488 7571 8076

Slice (BRAM) 884 (7) 959 (7) 953 (7) 1406 (5) 1495 (5) 1716 (4) 2262 (0) 2359 (0)

Throu/Slice(Mbps/slice) 8.15 (3.30) 7.85 (3.33) 8.30 (3.51) 4.96 (3.21) 5.17 (3.41) 4.36 (3.36) 3.35 (3.35) 3.42(3.42)

Initialization Clock cycles 68 68 68 68 68 68 49 (128 bit key) 53 (256 bit key) 49 (128 bit key) 53 (256 bit key)

2400

Pareto Curve

The original SNOW 2.0 implementation in software can be taken as a reference. According to [4], the encryption algorithm takes 937 clock cycles latency and 38 cycles production on a Pentium 4 workstation running at 1.8 GHz with a 512 Mbyte memory. This makes the hardware 2.5 times faster at a 10 times lower clock rate, making for an overall improvement by a factor 25. As no other papers related to the hardware implementation of Snow 2.0 can be found, no comparison can be made here. The latest high-speed block cipher AES implementation that uses a Xilinx FPGA XCV1000E-8 to get a 168.4 MHz frequency at 21,566 Mbps throughput, occupies 11022 Slices to achieve 1.956Mbps/Slice [10] with no Block SelectRAMs, can be a reference. The provisional conclusion is here, that a stream cipher is a factor 10 better than a block cipher. Four Snow2.0 designs considering the area and speed and targeted on different Virtex families are presented. This allows drawing the trade-off curve for the design that shows the balance between speed and area in dependence of the amount of Block SelectRAM usage. They seem to fit nicely to a Pareto curve [11], characterized by a factor Throughput/Slices of about 3.3. Their footprint is roughly the same is of a commercial floating-point unit [12], ranging from about 900 to 2400 slices at a speed ranging from 225 to 250 MHz. This makes the stream cipher core suitable for inclusion on modern trusted systems.

SNOW 2.0 Pareto 3.5

2350 2300 2250

10000

2200

9000

2150 2100

area(in slices)

8000

2050

7000

2000 1950

6000

3,8

4

4,2

4,4

4,6

4,8

5000 4000 3000 2000 1000 0 0

2

4

6

8

execution speed(in nsec)

Fig. 10.

6. REFERENCES

10

12

SNOW 2.0 Pareto3.5

Pareto curve for the SNOW 2.0 implementation.

[5]

Chr. J. Mitchell and A. W. Dent, “International standards for stream ciphers: A Progress report”, Proc. State of the Art of Stream Ciphers workshop, Brugge (Belgium), 2004, pp. 121-128.

[6]

Mitsuru Matsui and Sayaka Fukuda, “How to Maximize Software Performance of Symmetric Primitives on Pentium III and 4 Processors”, 12th Fast Software Encryption Workshop, Preproceedings, pp. 411-423

[7]

WenHai Fang, “A Hardware Implementation for Stream Encryption Snow 2.0”, M. Sc. Thesis, Dept. of Information Technology, Lund University, 2005.

[8]

Brent Przybus. “The Need for Speed, Exploit ChipScope Pro to debug your high-performance designs”, Xcell Journal, Issue 50, pp. 54-56.

[9]

G.P. Saggese et al., “An FPGA-based performance analysis of the unrolling, tiling and pipelining of the AES algorithm”, pp. 292-302, in: P.Y.K. Cheung et al. (Eds.), Proceedings FPL2003, LNCS 2778, Springer Verlag, 2003.

[1]

S. Smith, “Fairy dust, secrets, and the real world”, IEEE Security and Privacy, Vol. 1, No. 1, 2003, pp. 89-93.

[2]

J. Daemen and V. Rijmen, “The design of Rijndael”, Springer Verlag, 2002.

[3]

P. Ekdahl and T. Johansson, “Some results on correlations in the Bluetooth stream cipher”, Proc. 10th Joint Conference on Communications and Coding, Obertauern, Austria, 2000.

exploration for Pareto-optimal configurations in parameterized system-on-a-chip”, IEEE Transactions on VLSI, Vol. 10, No. 4, 2002, pp. 416-422.

[4]

P. Ekdahl, “On LFSR based Stream Ciphers Analysis and Design”, Ph.D. Thesis, Dept. of Information Technology, Lund University, 2003.

[12] D. Andersson and Å. Kullenberg, “Soft compilation of a

[10] Xinmiao Zhang and Keshab K. Parhi. “High-Speed VLSI

Architecture for the AES Algorithm”, IEEE Transactions on VLSI, Vol. 12, No. 9, 2004, pp. 957 - 967. [11] T. Givargis, F. Vahid and J. Henkel, ”System-level

floating-point library”, M. Sc. Thesis, Dept. of Information Technology, Lund University, 2004.

286