Designing a Runtime Reconfigurable Processor for General Purpose

Functional Units, Cache, Register File) are split into parti- tions that can be ..... scribed in [15] which means to create a complete microar- chitecture at runtime.
111KB taille 25 téléchargements 290 vues
Designing a Runtime Reconfigurable Processor for General Purpose Applications Adronis Niyonkuru and Hans Christoph Zeidler Universit¨at der Bundeswehr Hamburg Holstenhofweg 85 D-22043 Hamburg, Germany niyonkur, h.ch.zeidler@unibw-hamburg.de Abstract A superscalar microprocessor with a variable number of execution units which are dynamically configured during program execution has been modeled. The runtime behaviour of an executed application is determined using a Trace Cache and the most suitable hardware configuration is loaded dynamically. This paper discusses major design aspects of the ongoing implementation process based on a partial reconfiguration design flow. Thus, some microarchitectural components are put together to form a fixed module while different sets of execution units build up reconfigurable ones. The communication between fixed and reconfigurable modules is assured by Xilinx Bus Macros.

1. Introduction In accordance to Moore’s law, new and more powerful processors have been developed and brought to market at a regular time period. These general purpose processors are very efficient for a wide range of applications but fail to meet restrictive time requirements for specialized applications. Such applications can be implemented optimally using application specific integrated circuits (ASICs). However, for both ASICs as well as ”hard-wired” processors the flexibility to implement different hardware solutions on the same device is not supported. Hence, field-programmable gates arrays (FPGAs) are more and more used to implement flexible application-specific solutions. Furthermore, a significant performance gain can be achieved if partial and runtime reconfiguration is used. There are many ways to take advantage of runtime reconfiguration, and many reconfigurable systems and architectures have been already proposed. However, to get the best performance, one must be deeply involved in the underlying hardware implementation to adapt the own application to the particular reconfigurable processing machine in an optimal way. In many cases, different programming models

have to be applied, so that there is no compatibility between existing reconfigurable systems [4]. The use of specialized hardware and software libraries, a modified compiler process, software partitioning etc. are some of the additional tasks a programmer has to manage. Due to their inherent instruction-level parallelism, only well-suited applications such as signal processing applications, image processing, encryption/decryption etc. are often used to show a performance gain. Many research projects have been carried out to increase the performance by using reconfigurable hardware in processor microarchitecture. Some of them add a reconfigurable datapath as a coprocessor [5] or as an extended functional unit [13] while others implement a fully configurable microarchitecture [8]. All these reconfigurable processors showed a notable performance gain for specialized applications but were not intended to execute any general purpose application. Therefore, the challenge today is to build a competitive but also compatible reconfigurable processor which can be used also for mainstream computing with little effort of hardware and software adaptation. We propose a runtime reconfigurable processor based on the universal von-Neumann computing model, so that a wide range of applications can be easily executed. Moreover, no additional software and hardware effort has to be spent by the user, and compatibility with existing processors implementing the same processor architecture is assured. Although lower circuitry speed of FPGAs and reconfiguration time penalty may decrease the performance, additional microarchitectural enhancements such us runtime adaptation of superscalar execution units (the number of available execution units is updated at runtime) will help to avoid performance loss. Our basic goal is not to build a reconfigurable processor with the best performance but to implement an easy-to-use processor based on dynamic reconfiguration which runs any kind of application with at least the same performance with comparable hard-wired systems. This may lead to a wide acceptance of the new technology of reconfigurable computing based on FPGAs.

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

In the following section we present some related work which makes use of reconfigurable hardware in processor architecture. Section 3 describes the proposed reconfigurable processor microarchitecture. Section 4 discusses the current implementation process in details and presents major design aspects focusing on superscalar out-of-order execution as well as on the partial reconfiguration design flow. The last section contains a short conclusion.

2 Related Work There is a wide range of research projects using reconfigurable hardware in processor architecture. The first group we want to refer to proposes fixed microarchitectures based on conventional processors or conventional functional units extended with reconfigurable processing units. The latter execute specialized hardware-coded instructions to speed up applications. They act as a hardware accelerator or as a coprocessor. Programs running on these configurable machines have to be partitioned or differently scheduled, so that one code portion is executed on the conventional execution path and another one on the reconfigurable path. Furthermore, the programmer has either to invoke pre-compiled hardware libraries that implement specialized functions or tasks in hardware, or he has to encode himself the respective compute-intensive tasks in hardware and to map them to the reconfigurable processing units. This kind of computing has been tested with reported performance benefits in the following projects: PRISC [11], DISC [16], CoMPARE [13], GARP [5], OneChip [3], T1000 [19] etc. All these configurable processors support the conventional way of sequencing instructions. A different way of computing is presented in [4]. The SCORE execution model is based on three essential components:

¯ a compute page: fixed-size amount of reconfigurable hardware resources; this represents the basic processing unit, ¯ a memory segment: holds local data to be processed, ¯ a stream link: a logical connection between successive compute pages. If an application is executed in SCORE, it first has to be partitioned into consecutive tasks called operators (e.g. multiplier, FFT, FIR-Filter). These operators are statically or dynamically loaded into the reconfigurable hardware as well as their corresponding data memory segments. If there are enough hardware resources to implement all operators needed to build up a given application, there is no need to reconfigure the hardware dynamically. The application

will be completely loaded before program execution. Otherwise, the first operators which fit into the hardware resources are loaded, and after the computation is finished they are swapped into the memory to let the following operators to be loaded. Additionally, a conventional processor is required to sequence the compute pages. Another particular architecture is MATRIX [8]. This is a highly flexible reconfigurable computing architecture which allows the definition of the most suitable microarchitecture for each application. The instruction memory, the instruction flow, the data memory as well as the data path are not fixed but they are updated for every application within a multilevel configuration scheme. The basic architecture building component is an array of 8-bit basic functional units (BFU). Each BFU includes a local memory, an 8-bit ALU and a control logic. It can be configured to serve as instruction memory, data memory, datapath element or control element. A hierarchical network implements the configurable interconnect and enables data transfers between BFUs. The programming model of MATRIX consists of a hand-coded mapping of an algorithm on the BFUs and setting up appropriate connections between them. Flexible instruction processors (FIPs) [14] and their extensions to the runtime adaptive flexible instruction processors [15] realize an unique approach to implement a dynamic processor microarchitecture. FIPs consist of processor templates that allow to implement different processor types dynamically by varying a set of predefined parameters. These parameters determine for example whether the processor is stack based or register based. Furthermore, they are used to customize data and instruction widths, to change the pipeline depth, or to add or remove hardware resources etc. By investigating the behaviour of the application at runtime and using dynamic reconfiguration, it is possible to adapt the implementation to the requirements of the application during execution [15]. The behaviour of the application is determined by runtime statistics such as the number of times functions are called or the most frequently used opcodes. Thus, the collected data are used to determine the most suitable microarchitecture. The last architecture referred to in this context is the Complexity-Adaptive Processor (CAP) [18] which allows the hardware complexity and the processor clock cycle to be adapted at runtime to match the requirements of the application. The result of this architecture is the optimization of power dissipation and an improved performance [1]. The CAP is implemented on fixed conventional hardware augmented with enable partition signals. Therefore, key architectural processor components (Register Update Unit, Functional Units, Cache, Register File) are split into partitions that can be enabled or disabled dynamically according to the behaviour of the software. All these processors related make use of hardware recon-

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

figuration/adaptivity to enhance the processor performance. However, none of them is compatible with another one or with another system. Each of them requires different hardware and software tools to map applications on them. In our approach, we take advantage of hardware reconfiguration provided by FPGAs to build an enhanced runtime reconfigurable microarchitecture which is compatible with existing general purpose processors. Thus, it is not necessary neither to re-implement applications nor to develop new software tools, e.g compiler. Therefore, the program code passes through the same compiler process as in the case of conventional computing. This is an evolutionary approach that may lead to a better software compatibility and to the use of existing programming tools. In the same way an AMD processor executes the same program differently as does an Intel processor, our microarchitecture intends to modify only the way instructions are executed. Doing so, it will be easier for potential users to take advantage of dynamic reconfiguration without having to deal with the microarchitectural complexity.

3 A Partial Runtime Reconfigurable Processor Designing a superscalar processor with partial reconfiguration capabilities is a challenge with many design issues. The first aspect to be considered is the choice of a suitable device which supports partial and dynamic reconfiguration. Only a few SRAM-based programmable devices provide this feature, and this project targets a Xilinx Virtex-II FPGA. Moreover, a practical design flow is needed to implement a processor based on partial reconfiguration. Therefore, the module-based partial reconfiguration design flow [17] has been adopted since it has been successfully used by other researchers [7] in this area. The second decision to make is to find an appropriate processor architecture, especially an Instruction Set Architecture (ISA) to implement. The ARM Thumb instruction set [2] was preferred because here the instructions are coded in 16 Bits, thus allowing an optimized resource allocation on the FPGA. In this way, any software development tool for the ARM Thumb instruction set can be also be used to develop a program to be executed on this processor. The proposed partial runtime reconfigurable microarchitecture (Fig. 1) includes fixed functional units and different sets of execution units which are swapped during program execution [9]. They are listed and described as follows:

¯ Instruction Memory We implemented an Instruction Memory using 18Kb dual-port block SelectRAMs of type RAMB16 S36 S36 to get a memory bandwidth of 64 Bit, since these RAMs provide two completely

independent access ports. Therefore, we are able to fetch four 16 Bit Thumb instructions per cycle. This high fetch bandwidth is needed to provide enough instructions to the superscalar execution units at the beginning of program execution before the Trace Cache is involved. This block RAM is implemented as a read-only memory, since we provide a separate Data Memory and self-modifying code is not allowed.

¯ Fetch Unit/Predecoder The Fetch Unit provides a valid address to the Instruction Memory and to the Trace Cache. At the beginning of program execution, four instructions are fetched from the Instruction Memory with each clock cycle. The Trace Cache gets and stores a copy of the instructions fetched, and following fetch operations first check the current fetch address in the Trace Cache. By a Trace Cache hit, a block of coherent instructions stored previously in a cache line and starting with the current fetch address is fetched from the Trace Cache. Otherwise, fetching is still performed from the Instruction Memory as usual. Pre-decoding scans the operations included in the instructions fetched (opcodes) and notifies the Configuration Manager about the execution units needed to execute them. It is possible to determine exactly how many integer operations, load/store operations and floating-point operations are included within an instruction block. The result of this runtime evaluation is that the Configuration Manager invokes the partial reconfiguration process to load the most suitable configuration if necessary. At the end of a clock cycle, four instructions and the content of the program counter are sent to the Decoder. ¯ Trace Cache The Trace Cache holds instructions that may be executed many times within one application. The concept of a Trace Cache was originally introduced to avoid an instruction supply bottleneck [12]. Meanwhile, Intel extended the concept as one of the key improvements of its Netburst microarchitecture and implemented it in the latest Pentium 4 Processor [6]. We combine the Trace Cache and the Predecoder to determine at runtime the hardware resources required to execute a running application in an optimal way. This specialized cache consists of a circular queue which is organized in cache lines. Each line is able to store up to 16 instructions. But, if a branch instruction is encountered, this will be the last instruction in the current trace line and the next block of instruction starts at the following line. ¯ Decoder The Decoder acts as a conventional Instruction Decoder. It decodes instructions as usual, reads operands

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

B In s tr u c tio n M e m o ry

C o n fig 1

T ra c e C a c h e

U

U S M

4 In t-A L U C

A D R A

C

2 In t-A L U R

1 In t-M D U O

O

2 In t-M D U

T

1 In t-M D U A

4 L S U B

B

4 L S U U S

U S

2 L S U 2 F P -A L U M

R e g is te r U p d a te U n it

D e c o d e r

B

2 In t-A L U A

C o n fig u r a tio n M a n a g e r

C o n fig 3

S M

F e tc h U n it P re d e c o d e r

C o n fig 2

M M C E O M

1 F P -A L U A C

A

1 F P -A L U R

R

2 F P -M D U O

1 F P -M D U 1 F P -M D U B S

S

O

B

R e g is te r F ile

R U

U M C

C

Y

M A A R R O O F ix e d M o d u le

C o n fig u r a b le M o d u le s

F ix e d M o d u le

Figure 1. Partial Runtime Reconfigurable Microarchitecture from the Register File and sends them to the Register Update Unit. Since the program execution can force the pipeline to stall, two instruction buffers are provided at the decoder stage. In this way we prevent that decoded instructions overwrite those waiting for execution. One buffer is used to store instructions coming from the Fetch Unit and to be decoded within the current clock cycle. Another buffer is used to store instructions decoded during the previous clock cycle. It serves as an input buffer for the Register Update Unit.

¯ Register File The current processor implementation uses a Register File with eight 32 Bit general purpose registers. An additional four Bit register is provided to store the condition code flags. The implementation is parameterized, so that this Register File can be easily extended to 16 or 32 registers by only changing the value of one parameter. The Register File can be synthesized as a distributed or block RAM which fits, as a constraint, completely into the fixed module of the processor design. ¯ Register Update Unit (RUU)

The RUU (Fig. 2) forms the central unit to collect decoded instructions and to dispatch them to the different execution units. Furthermore, it has to resolve all dependencies (hazards) that may occur within instructions. For this purpose, a Dependency Buffer is provided which keeps the dependencies between instructions and registers up-to-date. A circular Instruction Queue stores instructions coming from the Decoder. These instructions stay on the same position until they are completely executed, i.e. after the write-back stage. The Dependency Buffer and the Instruction Queue allow out-of-order execution, in-order completion and operand forwarding [10]. The RUU is the only one functional unit which writes computation results to the Register File during the write-back stage. At each clock cycle, the RUU gets input vectors from the Configuration Manager that indicate the number of execution units currently available. These vectors have the size of the maximum number of execution units allowed. The real number of execution units available at each clock cycle is given by counting the number of bits set to ’1’ within these vectors. Moreover, if an execution unit is active on the FPGA but is still busy

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

executing a multicycle operation, its corresponding bit is set to ’0’ to show that this unit is not ready to accept a new instruction. To issue an instruction, the RUU has to go through all vectors representing the execution units and find out the right unit which is activated and not busy.

¯ Config1/Config2/Config3 The proposed superscalar microarchitecture includes different kinds of execution units: – arithmetic and logical unit for integer operations (Int-ALU), – multiplier/divider for integer operations (IntMDU), – load/store unit for memory operations (LSU), – floating-point ALU for floating-point operations (FP-ALU), – floating-point multiplier/divider for floatingpoint operations (FP-MDU). A specific configuration provides a set of these execution units with a fixed number of each of them. By loading a different configuration during program execution, the number of execution units can be dynamically changed. At the current implementation stage, only an integer ALU and a Load/Store Unit have been designed. The Int-MDU, the FP-ALU and the FPMDU have not yet been implemented since they require an enormous amount of hardware resources. The ongoing prototyping intends to provide first of all a proof of the concept, and these units will be added after having realized the partial reconfiguration scheme with a variable number of ALUs and LSUs. The Int-ALU and the LSU are replicated to get up to four execution units for each of them. Hence, putting together different numbers of these units allows to implement different configurable modules. A startup configuration may consist of one ALU and one LSU. Another configuration includes two ALUs and two LSUs. There are many design issues that have to be considered to assure that instruction flow, data flow, dependencies etc. are accurately implemented. We are working on the partial reconfiguration design flow to integrate different configurations (configurable modules) with the fixed module to a fully reconfigurable processor.

¯ Configuration Manager The Configuration Manager stores pre-defined configurations and performs configuration swapping dynamically. To achieve this, each execution unit (ALU, LSU, etc.) is designed and implemented separately.

Afterwards, these units are replicated as many times as needed. Putting together different functional units build up the pre-defined configurations. A common bus and some additional design arrangements ensure that all these execution units will work together properly. At runtime, the Configuration Manager gets an input signal from the Predecoder which indicates the kind and the number of execution units required to execute the fetched instructions. If the currently active configuration does not meet these requirements a more suitable configuration is loaded using once again the partial reconfiguration flow. Moreover, the input vectors fed into the RUU are updated, so that the RUU can issue instructions appropriately.

¯ Data Memory Data are stored into a separate memory to enable a Havard-Architecture [10]. The Load/Store Unit provides valid addresses and 32 Bit values are loaded or stored with a four Byte alignment. ¯ Bus Macros Bus Macros as provided by Xilinx are used to assure data communication between fixed modules and reconfigurable ones. In their current definition, a bus macro has only a 4-bit width. Therefore, it is able to transmit only four bits from one module to another one. Since data transmitted between the different modules are more that four bits wide, we have to instantiate as many bus macros as required to transmit the corresponding data width. The advantage of the described reconfigurable microarchitecture is that no additional software adaptation is needed. The microarchitectural hardware evaluation using the Trace Cache is used to choose the best suitable configuration. However, even if up to 16 instructions can be fetched from the Trace Cache, the amount of instructionlevel parallelism still depends on the application executed. Nevertheless, this microarchitecture is able to execute not only specific applications but any general purpose application from mainstream computing. If we consider a highly flexible reconfigurable platform with no hardware restrictions, the use of such a Trace Cache can help to reconfigure dynamically other microarchitectural components. Instead of configuring the amount of execution units only, one can decide to adapt dynamically the hardware resources deployed for the Fetch Unit, the Decoder, the RUU, the Register File etc. to the variable size of an instruction block. Extending this concept to all microarchitectural processor components leads to the approach described in [15] which means to create a complete microarchitecture at runtime. Another aspect to be considered is the possibility to manage the power consumption of the processor dynamically.

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

C o n fig u r a tio n M a n a g e r

D e c o d e r

In t-A L U

V e c to r: [1 0 1 0 ]

R e g is te r F ile

In s tr u c tio n Q u e u e

D e p e n d e n c y B u ffe r

In t-M D U V e c to r: [1 0 0 0 ] L S U V e c to r:

[1 1 0 0 ]

F P -A L U V e c to r: [1 0 0 0 ] In t-A L U

V e c to r: [0 0 1 0 ] R U U

B U S M A C R O

In t-A L U

In t-A L U

L S U

In t-M D U

B U S M A C R O

B U S M A C R O

L S U

F P -A L U

F P -M D U

c o n fig _ 1

Figure 2. The Register Update Unit We generally want to increase the number of execution units in order to achieve a better performance. It may be possible to achieve the same performance by loading a configuration with a reduced number of units which are not frequently used. Thus, the power consumption can be optimized dynamically. Currently we use the Xilinx XPower to monitor the power consumption. In addition, we are working on a global solution to model and analyze the dynamic power consumption.

4 Conclusion In this work, a model of a runtime reconfigurable processor based on partial and dynamic reconfiguration is presented. This processor implements the ARM Thumb Instruction Set Architecture as a pipelined superscalar microarchitecture. The key novelty consists of its variable number of execution units which are dynamically updated according to the behaviour of the program at runtime. Hence, no additional software tools are needed, so that compatibility with software and hardware systems based on the ARM Thumb ISA is assured. Furthermore, the ongoing implementation process us-

ing the Xilinx partial reconfiguration within the modular design flow is described. The current state of the implementation and the major aspects of the design process are presented and discussed. The functional units which are not reconfigured (Instruction Memory, Trace Cache, Fetch Unit/Predecoder, Decoder, RUU, Register File and Configuration Manager) build up a fixed module. Sets of execution units (Int-ALU, Int-MDU, LSU, FP-ALU, FP-MDU) consisting of an adaptable number for each of them form the configurable modules. These are dynamically swapped to allow a reconfiguration of the number of execution units at runtime. Xilinx Bus Macros are used to assure the control and the data flow between the fixed and the configurable modules. The fixed module which includes the most complex functional unit (RUU) has been already designed. An integer ALU and a Load/Store Unit are also available, so that multiple instances of these units permit already now to integrate different configurable modules together with the fixed one. It is too early to make a statement about the performance gain over existing systems. Indeed, this is not our basic goal. The first goal of this work it to show the feasibility of implementing a general purpose processor which

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE

takes advantage of dynamic reconfiguration in an easy-touse way. Unlike many other reconfigurable systems the proposed processor is compatible with other processors executing the same instruction set (ARM Thumb), so that no hardware and software adaptation is needed for new users. The main focus of our current and future work is the implementation process on real hardware. Moreover, a model to analyze the dynamic behaviour of power consumption of the processor is also included in our future work. The impact of reconfiguration time penalty on performance and the performance gain by changing at runtime the superscalar execution units have to be also investigated. All these parameters will determine finally the real performance balance.

References [1] D. H. Albonesi. Dynamic ipc/clock rate optimization. Proc. 25th Int. Symp. Computer Architecture, pages 282–292, June 1998. [2] ARM. ARM Architecture Reference Manual, 2000. [3] J. Carrillo and P. Chow. The effect of reconfigurable units in superscalar processors. 9th ACM Int. Symp. FieldProgrammable Gate Arrays, pages 141–150, February 2001. [4] E. Caspi, M. Chu, R. Huang, J. Yeh, J. Wawrzynek, and A. DeHon. Stream computations organized for reconfigurable execution (score). Proc. 10th Int. Conf. FieldProgrammable Logic and Applications, pages 605–614, August 2000. [5] J. Hauser and J. Wawryzynek. Garp: A mips processor with a reconfigurable coprocessor. Proc. IEEE Symp. Field-programmable Gate Arrays for Custom Computing Machines, April 1997. [6] Intel. A Detailed Look Inside the Intel NetBurst Microarchitecture of the Intel Pentium 4 Processor, November 2000. [7] T. Marescaux, A. Bartic, D. Verkest, S. Vernalde, and R. Lauwereins. Interconnection networks enable fine-grain dynamic multi-tasking on fpgas. Proc. 12th Int. Conf. Field-Programmable Logic and Applications, pages 795– 805, September 2002. [8] E. Mirsky and A. DeHon. A reconfigurable computing architecture with configurable instructions and deployable resources. Proc. IEEE Symp. FPGAs for Custom Computing Machines, April 1996. [9] A. Niyonkuru, G. Eggers, and H. Zeidler. A reconfigurable processor architecture. Proc. 12th Int. Conf. FieldProgrammable Logic and Applications, pages 1160–1163, September 2002. [10] D. Patterson and J. L. Hennessy. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, Inc., San Francisco, CA, 1996. [11] R. Razdan and M. D. Smith. A high-performance microarchitecture with hardware-programmable functional units. Proc. 27th Ann. Symp. Microarchitecture, pages 172–180, November 1994.

[12] E. Rothenberg, S. Bennett, and J. E. Smith. Trace cache: a low latency approach to high bandwidth instruction fetching. Proc. 29th Int. Symp. Microarchitecture, pages 24–34, December 1996. [13] S. Sawitzki, A. Gratz, and R. Spallek. Compare: A simple processor architecture exploiting instruction level parallelism. Proc. 5th Australasian Conf. Parallel and Real-Time Systems, pages 213–224, April 1998. [14] S. Seng, W. Luk, and P. Cheung. Flexible instruction processors. Proc. CASES, February 2000. [15] S. Seng, W. Luk, and P. Cheung. Run-time adaptive flexible instruction processors. Proc. 12th Int. Conf. FieldProgrammable Logic and Applications, pages 545–555, September 2002. [16] M. J. Wirthlin and B. L. Hutchings. A dynamic instruction set computer. Proc. IEEE Work. FPGAs for Custom Computing Machines, pages 99–107, April 1995. [17] Xilinx. Two Flows for Partial Reconfiguration: Module Based or Small Bit Manipulations, May 2002. [18] B. Xu and D. H. Albonesi. Runtime reconfiguration for efficient general purpose computation. IEEE Design & Test of Computers, Special Issue on Configurable Computing, pages 42–52, March 2000. [19] X. Zhou and M. Martonosi. Augmenting modern superscalar architectures with configurable extended instructions. 7th Reconfigurable Architectures Workshop 2000/Proc. 15th Int. Parallel and Distributed Processing Symposium, pages 141–150, May 2000.

Proceedings of the 18th International Parallel and Distributed Processing Symposium (IPDPS’04)

0-7695-2132-0/04/$17.00 (C) 2004 IEEE