DOI

Communication Structure. FU1. FU2 ... Firstly, the communication structure can be adapted to min- ..... ware was dynamically synthesized for their runtime critical.

Télécharger le PDF

158KB taille 4 téléchargements 280 vues

commentaire

Report

2009 International Conference on Reconfigurable Computing and FPGAs

Effects of Simplistic Online Synthesis for AMIDAR Processors Stefan Döbrich Dresden University of Technology Chair for Embedded Systems Dresden, Germany [email protected]

Christian Hochberger Dresden University of Technology Chair for Embedded Systems Dresden, Germany [email protected]

and typically have very high runtime so that they need to be run by experts and only for very few selected applications. Our approach tries to make the reconfigurable resources available for all applications. Thus, synthesis of a accelerating circuits takes place during application execution. No hand crafted adaptation of the source code shall be required. In this contribution we want to show that even a relatively simple approach to online circuit synthesis can achieve substantial application acceleration.

Abstract—Future chip technologies will change the way we deal with hardware design. (1) logic resources will be available in vast amount and (2) engineering specialized designs for particular applications will no longer be the general approach as the non recurring expenses will grow tremendously. Thus, we believe that online synthesis that takes place during the execution of an application is one way to overcome these problems. In this paper we show that even a relative simplistic synthesis approach can have a strong impact on the performance of compute intensive applications. Keywords-online synthesis; dynamic reconfiguration;

A. Related Work

I. I NTRODUCTION Following the road of Moore’s law, the number of transistors on a chip doubles every 24 months. After being valid for more than 40 years, the end of Moore’s law has been forecast many times now. Yet, technological advances have kept the progress intact. Further shrinking of the feature size of traditionally manufactured chips will give us two challenges: 1) Mask costs increase exponentially. This makes it prohibitively expensive to produce small quantities of chips for one particular design. 2) We will have vast numbers of logic resources at hand even for relatively small chip sizes. How can we use these logic resources without building individual designs for each application? Reconfigurable logic in different granularities has been proposed to solve both problems [14]. It allows us to build large quantities of chips and yet use them individually. Field programmable gate arrays (FPGAs) are in use for this purpose for more than two decades. Yet, it requires much expert knowledge to implement applications or part of them on an FPGA. Also, reconfiguring FPGAs takes a lot of time due to the large amount of configuration information. Coarse Grain Reconfigurable Arrays (CGRAs) try to solve this last problem by working on word level instead of bit level. The amount of configuration information is dramatically reduced and also the programming of such architectures can be considered more software style. The problem with CGRAs is typically the tool situation. Currently available tools require an adaption of the source code 978-0-7695-3917-1/09 $26.00 © 2009 IEEE DOI 10.1109/ReConFig.2009.21

Static transformation from high level languages into fine grain reconfigurable logic has been researched by a number of academic and commercial research groups. Only very few of them support the full programming language [8]. Static transformation from high level languages into coarse grain reconfigurable logic is also investigated by several groups. The DRESC [11] tool chain targeting the ADRES [10] architecture is one of the most advanced tools. Yet, it requires hand written annotations to the source code and in some cases even some hand crafted rewriting of the source code. Also, the compilation times easily get into the range of days. Dynamic transformation from software to hardware has been investigated already by other researchers. Warp processors dynamically transform assembly instruction sequences into fine grain reconfigurable logic [9]. Yet, only very short basic blocks are taken into consideration, delivering only very limited application speedups. B. Paper Outline In the following chapter we will present the model of our processor which allows an easy integration of synthesized functional units at runtime. In chapter 3 we will detail how we figure out the performance sensitive parts of the application by means of profiling. Chapter 4 explains our online synthesis approach. Results for some benchmark applications are presented in chapter 5. Following we give a short conclusion and an outlook onto future work. 433

conflicting bus taps and we have also shown a heuristics to modify the bus structure to minimize the conflicts. Also, in [4] we have shown that the characteristics of the FUs can be changed to optimally suit the needs of the running application. FUs can either be latency optimized or throughput optimized. Finally, we can augment the processor with some specialized FUs that implement often used code sequences in a highly optimized way. In [5] we have presented mechanisms to identify the right code sequences as candidates for a HW implementation. Also, we have shown that HW implementations of relevant code sequences can lead to substantial application speedup (or power saving vice versa) [6].

Communication Structure

FU 1

FU 2

token generator

FU n

Token Distribution Network

Figure 1.

Abstract Model of an AMIDAR Processor Communication Structure

code memory

object heap

local variables

operand stack

jump unit

method stack

IALU

FALU

token generator

C. Synthesizing Functional Units in AMIDAR

Token Distribution Network

Figure 2.

AMIDAR processors need to include some reconfigurable fabric in order to allow the dynamic synthesis and inclusion of FUs. Since fine grained logic (like FPGAs) requires large amount of configuration data to be computed and also since the fine grain structure is neither required nor helpful for the implementation of most code sequences, we focus on CGRAs for the inclusion into AMIDAR processors. The model includes many features to support the usage of newly synthesized FUs into the running application. It allows bulk data transfers from and to data memories, it allows the token generator to synchronize with FU operations that take multiple clock cycles and finally it allows synthesized FUs to inject tokens in order to influence the data transport required for the computation of a code sequence.

Model of a Java Virtual Machine on AMIDAR Basis

II. T HE AMIDAR PROCESSING MODEL A. General Model Execution of instructions in AMIDAR processors differs from other execution schemes. Neither microprogramming nor pipelining are used to execute instructions. Instead, instructions are broken down to a set of tokens which are distributed to a set of functional units (FU). These tokens carry the information about the type of operation that shall be executed, the version information of the input data that shall be processed and the destination of the result. Figure 1 sketches the structure of an AMIDAR processor. The token generator receives an instruction over the data bus and computes a set of tokens that composes the semantics of the instruction. This set of tokens is distributed to the FUs over a dedicated token distribution network. The token generator can be built such that every FU which shall receive a token is able to receive it in one clock cycle. Tokens which do not require input data can be executed immediately. Otherwise, FUs wait for input data that has the appropriate tag information. Once the right data is available, the operation starts. Upon completion of the operation, the result is sent to the FU that was denoted in the token. Eventually, one of the tokens must trigger the transport of the next instruction to the token generator. Figure 2 shows a possible structure for the execution of Java bytecode on an AMIDAR processor. A more detailed explanation of the model, its application to Java bytecode execution and its specific features can be found in [4].

III. RUNTIME A PPLICATION P ROFILING A major task in synthesizing hardware FUs for AMIDAR processors is runtime application profiling. This allows the identification of candidate instruction sequences for hardware acceleration. Plausible candidates are the runtime critical parts of the current application. In previous work [5] we have shown a profiling algorithm which generates detailed information about every executed loop structure. Those profiles contain the total number of executed instructions inside the affected loop, the loops start program counter, its end program counter and the total number of executions of this loop. The algorithm is also capable to profile nested loops, not only simple ones. A profiled loop structure becomes a synthesis candidate in case its number of executed instructions surmounts a given threshold. The size of this threshold can be configured dynamically for each application. Furthermore, an instruction sequence has to match specific constraints in order to be synthesized. Currently, we are not capable of synthesizing every given instruction sequence. Bytecode containing the following instruction types cannot be synthesized as our synthesis algorithm has not evolved to this point yet. • method invocation operations

B. Adaptivity in the AMIDAR Model The AMIDAR model exposes different types of adaptivity. Firstly, the communication structure can be adapted to minimize the bus conflicts that occur during the data transports between the FUs. In [7] we have shown how to identify the

434

Table I C RYPTOGRAPHY D OMAIN B ENCHMARKS

Round Key Generation Single Block Encryption Single Block Decryption

• • • • •

wos 17760 21389 21381

Rijndael wssm 3135 6155 6196

wsdm 3135 6099 6140

wos 525276 12864 12911

Twofish wssm 31666 8272 8287

memory allocation operations access operations to multi-dimensional arrays exception handling thread synchronization some special instructions, e.g. lookupswitch

wsdm 30617 8263 8278

wos 44276 34855 37623

Serpent wssm 5232 2673 2724

wsdm 5232 2673 2724

wos 61723 17371 17405

RC6 wssm 2260 2417 2412

wsdm 1994 2417 2412

troduced multiplexer nodes are transfered to multiplexer structures. The annotated branching information helps to connect the different branches correctly and to determine the appropriate control signal. Furthermore, registers and memory structures are introduced. The registers are used to hold values at the beginning and the end of branches in order to synchronize different branches. The memory structures are connected to the consumer/producer components of their corresponding arrays or objects. A datapath equivalent to the instruction sequence is the result of this step. Arrays and objects may be accessed from different branches thar are executed in parallel branches. Thus, it is necessary to synchronize access to the affected memory regions. Furthermore, only valid values may be stored into arrays or objects. This is realized by special enable signals for all write operations. The access synchronization is realized through a controller synthesis. This step takes the created datapath and all information about timing and dependency of array and object access operations as input. The synthesis algorithm has a generic interface which allows to work with different scheduling algorithms. Currently we implemented a modified ASAP scheduling which can handle resource constraints and a list scheduling. The result of this step is a finite state machine which controls the datapath and synchronizes all array and object access operations. As mentioned above, we do not have a full hardware implementation yet, but we use a cycle accurate simulation. Hence, placement and routing for the CGRA structure are not required, as we can simulate the abstract datapath created in the previous steps. In case the synthesis has been successful the FU needs to be integrated into the existing processor. If one ore more marker instructions of previously synthesized FUs where found, the original instruction sequence has to be restored. Furthermore, the affected SFUs have to be unregistered from the processor and the hardware used by them has to be released for further use.

IV. O NLINE S YNTHESIS OF A PPLICATION S PECIFIC F UNCTIONAL U NITS The captured data of the profiling unit is evaluated periodically. In case an instruction sequence exceeds the given runtime threshold the synthesis is triggered. The synthesis runs as a low priority process concurrently to the application. Thus, it can only occur if spare computing time remains in the system. Also, the synthesis process cannot interfere with the running application. Firstly, an instruction graph of the given sequence is created. In case an unsupported instruction is detected the synthesis is cancelled. Furthermore, a marker of a previously synthesized functional unit may be found. If this is the case it is necessary to restore the original instruction information and then proceed with the synthesis. This may happen if an inner loop has been transfered to hardware before, and then the wrapping loop shall be synthesized as well. Afterwards, all nodes of the graph are scanned for their number of predecessors. In case a node has more than one predecessor it is necessary to introduce specific Φnodes to the graph. Furthermore, the graph is annotated with branching information. This will allow the identification of the actually executed branch and the selection of the valid data when merging two or more branches by multiplexers. In a second step the graph is annotated with a virtual stack. This stack does not contain specific data, but the information about the producer instruction that would have created it. This allows the designation of connection structures between the different instructions as the predecessor of an instruction may not be the producer of its input. Afterwards an analysis of access operations to local variables, arrays and objects takes place. This aims at loading data into the functional unit and storing it back to its appropriate memory after the functional units execution. Therefore, a list of data that has to be loaded and a list of data that has to be stored is created. The next step transforms the instruction graph into a hardware circuit. This representation fits precisely into our simulation. All arithmetic or logic operations are transformed into their abstract hardware equivalent. The in-

A. Functional Unit Integration The integration of the synthesized functional unit (SFU) into the running application consist of three major steps. (1) a token set has to be generated which allows the token generator to use the SFU. (2) the SFU has to be integrated into the existing circuit and (3) the synthesized code sequence has to be patched in order to access the SFU. 435

Rijndael

Serpent with synthesis and shared memory with synthesis and distributed memory

6

18

5

15

speedup

speedup

with synthesis and shared memory with synthesis and distributed memory

4 3 2

12 9 6

1

3 Round Key Generation

Single Block Encryption

Single Block Decryption

Round Key Generation

Twofish

Single Block Decryption

RC6 with synthesis and shared memory with synthesis and distributed memory

with synthesis and shared memory with synthesis and distributed memory

18

36

15

30

speedup

speedup

Single Block Encryption

12 9 6

24 18 12

3

6 Round Key Generation

Single Block Encryption

Single Block Decryption

Figure 3.

Round Key Generation

Single Block Encryption

Single Block Decryption

Speedups of the Cryptographic Algorithms

This last step comes with an adjustment of the profiling data which led to the decision of synthesizing a FU. The generated set must contain all tokens and constant values that are necessary to transfer input data to the SFU, process it and write back possible output data. All tokens are stored in first-in-first-out structures. This handling assures that tokens and constants are distributed and processed in their determined order. The lists of local variables which function as input and output local variables now can be used as input data to the token generation algorithm. In a second step it is necessary to create tokens for the controlling of the SFU itself. One token has to be created in order to trigger the SFUs processing when all input data has arrived. A second token triggers the transfer of output data to its determined receiver. In a last step the output token sequence for the SFU is created. This set of tokens will be distributed to the functional units which will consume the generated output. The sequence must specify all operations that have to be processed in order to write the functional units output data to its receivers corresponding memory position. In a next step it is necessary to make the SFU known to the circuit components. Firstly, is has to be registered to the bus-arbiter. This allows the assignment of bus structures to the SFU. Furthermore, the SFU must be accessible by the token generator. The token generator works on a table which holds a token set for every single instruction. Thus, the token generator needs to know a set of tokens which are necessary to control the SFU. This information is passed to the token generator in special data structures which are characterized by the ID of the SFU. This ID is necessary to distribute the correct set of tokens once the SFU is triggered by the new instruction. The SFU needs to be accessed through a bytecode sequence in order to use it. Therefore, it is necessary to patch the affected sequence. The first instruction of the loop

is patched to a specific newly introduced bytecode which signals the use of a SFU. The next byte represents the global identification number of the new SFU. It is followed by two bytes representing the offset to the successor instruction which is used by the token generator to skip the remaining instructions of the loop. In order to be able to revoke the synthesis it is necessary to store the patched four bytes with the token data. Patching the bytecode must not be done as long as the program counter points to one of the patched locations. Now, the sequence is not processed in software anymore but by a hardware SFU. Thus, it is necessary to adjust the profiling data which led to the decision of synthesizing a FU for a specific code sequence. The profile related to this sequence now has to be deleted, as the sequence itself does no longer exist as a software bytecode fragment. V. R ESULTS We used applications of two different domains to test our synthesis algorithm. Firstly, we benchmarked several cryptographic ciphers as the importance of security in embedded systems increases steadily. Secondly, we chose frequency domain transformation as another group of benchmarks. The baseline for all measurements is the program execution without synthesis. Furthermore, there have been two benchmark runs with enabled synthesis and a slightly different setup. The first run with an enabled synthesis assumes a shared memory for all objects and arrays inside a SFU. Secondly, a distributed memory architecture has been assumed. This means that every object and every array is stored in a separate memory, which is supposed to decrease the number of read/write conflicts while accessing data. The controller of the SFUs has been created under usage of list scheduling as scheduling algorithm neglecting hardware constraints..

436

Table II H ARDWARE C ONSUMPTION OF S YNTHESIZED F UNCTIONAL U NITS FOR C RYPTOGRAHPIC B ENCHMARKS

ops 85 225 224

Round Key Generation Single Block Encryption Single Block Decryption

Rijndael regs mux 16 13 18 9 17 9

ops 489 103 103

Twofish regs mux 132 117 18 12 18 12

ops 231 451 411

Serpent regs mux 63 32 47 31 47 31

Table III F REQUENCY D OMAIN B ENCHMARKS IDCT wssm 4599

wsdm 4052

wos 528373

FFT wssm 32583

RC6 regs 26 32 32

mux 15 14 14

Frequency Domain Transformations with synthesis and shared object memory with synthesis and distributed object memory 36 speedup

Runtime

wos 124881

ops 40 54 54

wsdm 23436

30 24 18 12 6 IDCT

FFT

A. Cryptography Algorithms Figure 4.

The group of cryptography benchmarks contains four block ciphers. Rijndael [3], Twofish [13], Serpent [2] and RC6 [12] all were part of the Advanced Encryption Standard (AES) evaluation process. All of the given cryptographic algorithms were taken from a framework provided by ”The Legion of the Bouncycastle” [1]. The bouncycastle crypto APIs for Java provide a complete TLS/SSL implementation. The functionality of a single algorithm can be split into three parts. First, there is the round key generation out of a given master key. All our benchmarks were run with a 256 bit master key. The second and third part are the encryption and decryption of a data block. The standard block size for all four of the algorithms is 16 byte. As every encrypted communication has its own characteristics we benchmarked each of the three parts on its own. This allows a sophisticated look at the synthesis impact and no side-effects influence the results. First benchmarks with ”out-of-the-box” code have shown no improvement in speed through the synthesis. Analysis of this runs have pointed to a failed synthesis due to unsupported bytecodes. The crucial parts of all ciphers contained either method invocations or access to multidimensional arrays. In order to unlock the potential of our algorithm we inlined the affected methods and flattened the multidimensional arrays to one dimension. In table I the runtime of a single round key generation out of a 256 bit master key, the encryption and decryption of a single 16 byte block are shown. The column named wos shows the runtime without synthesis. The column wssm shows the runtime under usage of a SFU based on a shared memory, while wsdm shows the runtime with synthesis and distributed memory. The corresponding speedups are shown in figure 3.

Speedups of Frequency Domain Transformations

Fourier Transformation (FFT) and the Inverse Discrete Cosine Transformation (IDCT). The 2-D IDCT has been implemented as two serialized 1-D IDCT transformations. The results of the IDCT benchmark correspond to the transformation of a single 8x8 block. The FFT transformed a 256 sample signal to the frequency domain. Table III shows the runtime of the IDCT and FFT in clock cycles. The speedups gained on the IDCT and FFT by using our synthesis algorithm are displayed in figure 4.

C. Hardware Effort Besides the runtime improvement that has been gained through our synthesis algorithm it is necessary to discuss the hardware effort. As our synthesis currently does not support reuse of hardware structures inside SFUs every processed operation allocates exclusive hardware. As we mentioned above, only loop structures can be synthesized yet, and the synthesis only considers loops which consume more runtime than a given threshold. This threshold can be exceeded in three different ways. Firstly, a loop with a short body can be executed in a large quantity. Secondly, a loop is only executed rarely but it’s got a large body which consumes a lot of runtime. The first is the case in applications such as IDCT, FFT and RC6. The SFUs provide hardware implementations for short but frequently executed code segments. Thus, they consume a small amount of hardware. The critical loops of Twofish, Rijndael and Serpent have a large body. Consequential they consume a large amount of hardware ressources. In order to get a grip on the hardware effort for the SFUs we measured the consumption of operations, registers and multiplexers for every application. Table II and table IV show the amount of hardware used by our benchmarks.

B. Frequency Domain Transformations Another typical application domain in embedded systems is the transformation of data between spatial and frequency domain. Two common domain transformations are the Fast 437

on implementing access to multidimensional arrays and inlining of invoked methods at synthesis time. Furthermore, we are planning to introduce an abstract description layer to our synthesis. This will allow easier optimization of the algorithm itself and will open up the synthesis for a larger number of instruction sets. Currently, we are able to simulate AMIDAR processors based on different instruction sets, such as LLVM-Bitcode, .NET Common-IntermediateLanguage and Dalvik-Executables. In the future we are planning to investigate the differences in execution of those instruction sets in AMIDAR-Processors.

Table IV H ARDWARE C ONSUMPTION OF S YNTHESIZED F UNCTIONAL U NITS FOR F REQUENCY D OMAIN T RANSFORMATION B ENCHMARKS

Single Execution

ops 33

IDCT regs mux 35 21

ops 28

FFT regs 41

mux 37

D. Discussion All of our benchmarks increased their speed in case hardware was dynamically synthesized for their runtime critical parts. The largest improvements were gained by synthesizing FUs for the domain transformation applications. Depending on the current synthesis configuration speedups from 16 to 31 could be gained. Furthermore, the hardware effort that has been made for those FUs is very small. The results for the crypto cipher benchmarks are diverse. The round key generation for all of the four ciphers could be accelerated. The RC6 key generation gained a speedup of about 30, Twofish of 18 and Rijndael was accelerated by the factor of 5. The Serpent ciphers initialization could be slightly speed up by 2.5. The best results in encryption and decryption of data were achieved by the RC6 and Serpent cipher. They gained speedups from 7 to 14. The Rijndael cipher could also be accelerated by factor 3.5. The only steady benchmark has been the encryption and decryption of data by Twofish. This originates from the communication overhead provided by synthesized functional units. All arrays and objects used by the SFU have to be transferred to it before the processing can start. Afterwards, they have to be transferred back to the heap. This overhead equalized the speedup gained by the SFUs hardware execution.

R EFERENCES [1] www.bouncycastle.org. [2] E. Biham, R. J. Anderson, and L. R. Knudsen. Serpent: A new block cipher proposal. In FSE, pages 222–238, 1998. [3] J. Daemen and V. Rijmen. The Design of Rijndael: AES The Advanced Encryption Standard. Springer, 2002. [4] S. Gatzka and C. Hochberger. The AMIDAR class of reconfigurable processors. The Journal of Supercomputing, 32(2):163–181, 2005. [5] S. Gatzka and C. Hochberger. Hardware based online profiling in AMIDAR processors. In IPDPS, page 144b, 2005. [6] S. Gatzka and C. Hochberger. On the scope of hardware acceleration of reconfigurable processors in mobile devices. In HICSS, page 299, 2005. [7] S. Gatzka and C. Hochberger. The organic features of the AMIDAR class of processors. In ARCS, pages 154–166, 2005. [8] A. Koch and N. Kasprzyk. High-level-language compilation for reconfigurable computers. In ReCoSoC, pages 1–8, 2005.

VI. C ONCLUSION Online synthesis of functional units in AMIDAR processors is a major challenge. We have shown an algorithm with low complexity which is capable of synthesizing functional units for authentic non-trivial applications. Depending on the applications structure and the size of the handled data we could gain a speedup up to 31 by using our synthesis algorithm. Furthermore, the hardware costs for many of the synthesized functional units have been small. Nonetheless, the known and expected limitations of our algorithm have been clearly shown. The synthesis could not take effect on the cryptographic ciphers until we had inlined method invocations and flattened multidimensional arrays. Furthermore, the hardware effort for the realization of some functional units has been vast. In summary we have reached our goal of implementing a plain and working synthesis algorithm for AMIDAR processors, which provided solid speedups and affordable hardware costs for most of our benchmarks.

[9] R. L. Lysecky and F. Vahid. Design and implementation of a microblaze-based warp processor. ACM Trans. Embedded Comput. Syst., 8(3):1–22, 2009. [10] B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins. ADRES: An architecture with tightly coupled vliw processor and coarse-grained reconfigurable matrix. In FPL, pages 61–70, 2003. [11] B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins. Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling. In DATE, pages 10296–10301, 2003. [12] R. L. Rivest, M. J. B. Robshaw, R. Sidney, and Y. L. Yin. The rc6 block cipher. In First Advanced Encryption Standard (AES) Conference, page 16, 1998. [13] B. Schneier, J. Kelsey, D. Whiting, D. Wagner, C. Hall, and N. Ferguson. The Twofish Encryption Algorithm: A 128-Bit Block Cipher. John Wiley & Sons, 1999.

VII. F UTURE W ORK

[14] S. Vassiliadis and D. Soudris, editors. Fine- and Coarse-Grain Reconfigurable Computing. Springer, 2007.

The full potential of online synthesis in AMIDAR processors has not been reached yet. Future work will concentrate 438

DOI

des documents recommandant