Context saving and restoring for multitasking in ... - Xun ZHANG

ware task stores its current states in flip flops or RAMs distributed over certain ... communication infrastructure can also be used to read and restore the states of ...
93KB taille 3 téléchargements 178 vues
CONTEXT SAVING AND RESTORING FOR MULTITASKING IN RECONFIGURABLE SYSTEMS Heiko Kalte

Mario Porrmann

University of Western Australia, School of Computer Science & Software Engineering, 35 Stirling Highway, Crawley WA 6009 Email: [email protected]

Heinz Nixdorf Institute, System and Circuit Technology, University of Paderborn, 33102 Paderborn, Germany email: [email protected] systems includes, e.g., scheduling, placement, suspension, restoring, and any kind of reorganization or relocation of incoming and running hardware tasks. In this paper we concentrate especially on the mechanisms for relocating hardware tasks during runtime, which are necessary whenever defragmentation or any reorganization has to be performed. As most tasks contain state information (e.g. state machines or pipeline registers), not only the hardware structures of the task, but also the states have to be relocated. Only in this way hardware tasks can continue at exactly the state at which they were interrupted before the relocation process was started.

ABSTRACT Today’s Field Programmable Gate Arrays (FPGAs) can be reconfigured partially, which makes it possible to share resources between various functional modules (hardware tasks) over time. This concept is well known in the area of conventional operating systems. However, in order to transfer resource sharing concepts to operating systems on FPGAs, several underlying mechanisms have to be developed. One of these mechanisms is to suspend hardware tasks and restart them at another time and/or another area of the FPGA. Addressing this problem, this paper discusses ways to save and restore the state information of a hardware task. Afterwards, an implementation of a state relocation mechanisms is presented that uses the standard configuration port. In contrast to similar approaches, we significantly reduce the amount of readback data by reading only those configuration frames that contain state information. We finally determine the time overhead for task relocation, which is essential for most multitasking concepts, like defragmentation1.

2. COMPARISON OF DIFFERENT CONTEXT RELOCATION APPROACHES Task switching in conventional CPU based operating systems means that the execution of a running task is suspended and a previously interrupted task is restored. In order to suspend a task, all necessary state information of the software task have to be saved, e.g., control register, flag register, code and data descriptors. State information of a hardware task in an FPGA based reconfigurable system usually do not have one specific location. A running hardware task stores its current states in flip flops or RAMs distributed over certain FPGA areas. Consequently, one basic requirement to implement task switching in reconfigurable systems is to capture the content of all memory cells. The pros and cons of two different approaches are discussed in the following. The basis of our discussion is the Xilinx Virtex architecture [9].

1. INTRODUCTION Currently available Field Programmable Gate Arrays (FPGAs) can integrate complete systems and are partially reconfigurable during runtime, which make them suitable for general purpose reconfigurable computing. These systems suffer neither from the inflexibility of Application Specific Integrated Circuits (ASICs) nor from the mostly sequential processing of CPUs. Arbitrary functions can be implemented in hardware and can be downloaded to the FPGA fabric as exchangeable hardware tasks. However, as the amount of hardware tasks increases with the available resources, the management of these resources becomes more and more a concern. In order to solve this problem, researchers have started to develop hardware operating systems, which are capable of managing the reconfigurable resources. Resource management in hardware operating 1

Task Specific Access Structures A straight forward implementation to read and restore the state information of a hardware task is to add an extra read/write interface to all state registers. That means, during normal operation the register can be read and written by the circuit of the task and during the context saving and restoring, the registers can be accessed by the relocation circuitry. There are several ways to implement this interface, e.g. as a scan chain or an addressable RAM structure. The data width of the interface can vary, but the tasks are

This work was supported by the Deutsche Forschungsgemeinschaft.

0-7803-9362-7/05/$20.00 ©2005 IEEE

223

usually connected by a communication infrastructure that is used to exchange data during normal operation. This communication infrastructure can also be used to read and restore the states of the tasks. That leads to a fixed data width of the state register interface. Depending on the number of state registers of the hardware task, the additional hardware structures that are needed in this context relocation approach can significantly increase the resource consumption. One way to reduce the number of registers that have to be handled is to implement a shutdown process for each task. This process is initiated before a task suspension to shutdown the task to a state in which not all registers have to be read (e.g. a pipeline that is flushed). However, the advantage of reducing the state information by a shutdown process comes with extra hardware resources, extra design effort and extra shutdown time overhead. Especially, the design effort to implement the interface to the state registers and the shutdown process can become a disadvantage. Each task is different and there is no standard or generic interface or shutdown process that can be reused for all tasks. The designer has to have detailed knowledge about the structure and the behavior of the task. Furthermore, if the task comes as an IP core (Intellectual Property) of a third party provider it can be completely impossible to add the mentioned structures. However, one advantage of this approach is the high data efficiency, as only the raw state information are read. Besides that, there is no need to filter the state information. Another advantage is the independency from the FPGA architecture. When the interface and shutdown structures are included in the design description, these structures are mapped to the FPGA like the task itself. There is no need to know any details about the bitstream or the FPGA architecture, like in the following approach. This or similar approaches are discussed, e.g., [2] and [10].

figuration port. Consequently, there is no additional design effort or any extra resource consumption, the bandwidth of the normal communication infrastructure is not effected, it is not necessary to shut down the tasks and the designer needs no information about the structure or the internal behavior of the task. The only requirement is that no interaction between the suspended task and other tasks takes place during relocation. However, the biggest disadvantage of this approach is the poor data efficiency, which means the proportion of useless data in the readback stream is rather high. In order to give an example, the complete readback stream of a XCV600E consists of about 483kByte and there are 15.552 Logic Cells (each containing one flip flop) and 294.912 Block RAM bits. Using all flip flops and all Block RAM results in 38kByte of state information, which is less than 8% of the complete readback stream. In our designs typically less than 30% of the flip flops are used, which further reduces the state information in the readback stream. Additionally, the state information bits have to be extracted out of the readback stream which requires additional computation power. Finally, reasonable knowledge about the configuration stream is necessary, which makes this approach rather technology dependent. 3. OUR RELOCATION APPROACH Our relocation approach is based on the Configuration Port Access approach, because for this approach no knowledge about the internal structure or behavior of each task is necessary, simply all register values are stored. Besides that, no extra design effort has to be spent to add task specific access structures or to add a shutdown mechanism. A few implementations of the Configuration Port Access approach have been reported in literature, e.g. by Simmler et al. in [5]. However, our implementation has some significant differences compared to previously published ones. The most important difference is the fact that not all configuration data is read back, but only those data that include state information and belong to the task to be suspended. Furthermore, the actual state information extraction is not done after but during reading the configuration data. These differences to other approaches significantly reduce the amount of data to be read back, the data to be stored, and finally the processing time.

Configuration Port Access The Configuration Port Access approach is completely different from the previous one. It is based on the bitstream readback facilities of the FPGA’s configuration port (in our case the Xilinx SelectMAP interface). This port offers the possibility to read arbitrary columns of the configuration memory including the current register values and the RAM contents. After or during reading the whole bitstream or just parts of it, the state information have to be filtered out of the readback stream. Subsequently, the stored state information can be restored in a relocated instance of the task by manipulating the preset bits of the appropriate flip flops as well as the RAM content in the bitstream to be downloaded. An example of this approach can be found in [4] and [5]. In contrast to the previously mentioned concept, the Configuration Port Access approach has the mayor advantage that no hardware structures have to be added to the tasks itself. Instead of using access structures which are added by the designer, this approach uses the inherent access structures of the configuration circuitry and the con-

3.1. FPGA Architecture and System Approach In order to understand the mechanism of context relocation, some background information are necessary. Therefore, we give a brief introduction to the Xilinx Virtex architecture and to our reconfigurable system approach. FPGA Architecture The Xilinx Virtex series FPGAs consist of configurable logic blocks (CLBs), input/output blocks (IOBs), RAM, a clock distribution tree, configurable routing resources and

224

configuration circuitry. The configuration bitstream determines the behaviour of the device and consists of a sequence of commands and data. The internal configuration memory stores the bitstream and can be visualized as a rectangular array of bits. The bits are grouped into one bit wide vertical columns that extend from the top of the device to the bottom. These so called frames are the atomic unit of configuration which means frames are the smallest portion of configuration memory that can be read or written. According to the different resources, multiple frames are grouped together into larger IOB, RAM and CLB columns. The address of each frame is subdivided into the major address (MJA), which determines the configuration column position in the memory, and the minor address (MNA), which is related to one of the frames within a column (see [9] for further information).

stream manipulations that change the column addresses (MJA) of individual hardware tasks during the download process of the configuration bitstream. A hardware implementation of the REPLICA filter, which is capable of these manipulations (relocation without state information) has been published in [8]. The next chapter will describe the necessary extensions to this filter to relocate the state information of a task in addition to the relocation of its hardware structures. For detailed information about our system approach, mechanisms to allocate and de-allocate hardware tasks, defragmentation, the design flow and aspects of communication infrastructure optimization, see [1]. 3.2. Context Relocation Approach Overview The architecture of our context relocation approach can be seen in figure 2. There are four main function blocks and a database to perform a relocation process. The main blocks are the Configuration Manager, the State Extraction Filter, the State Inclusion Filter and the REPLICA Filter. The context relocation procedure will be introduced briefly, followed by a detailed description of the different blocks. A context relocation process can be initiated, e.g., by the Resource Allocation block of our reconfigurable system approach (see fig. 1) in order to perform a defragmentation of the FPGA’s resources or to perform any kind of reorganization. The first step of the whole process is to stop the clock of the particular hardware task or of all hardware tasks to prevent state changes during the read process. This is realized by clock gating and can be implemented much easier than the previously mentioned task shutdown process. Subsequently, the Configuration Manager initiates the SelectMAP interface (or ICAP) to read all frames that contain state information. The addresses of the frames are calculated on the basis of the location information given by a database entry of the task. During the read process, all frames are continuously transferred to the State Extraction Filter, which determines the state value within each frame. By this means, all state values of the task are updated in the database entry without storing the whole stream. The task is now suspended, but not de-allocated. That means, a partial “empty” bitstream has to be downloaded to completely erase the task’s circuitry. This de-allocation process is necessary whenever the placement approach allows arbitrary task sizes and locations (e.g. 1D- and 2D-placement, see [1]). Not properly erased task structures can cause contentions, when a smaller task is allocated is this area later on. After suspending the task, the restoring process can be initiated immediately or any time later on. The restoring process starts with the State Inclusion Filter, which integrates the register values of the database into the original partial bitstream of the hardware task, which is generated and stored before runtime. The resulting bitstream would still allocate the task at its original location, but with the new initial register states. Afterwards the REPLICA Filter relocates the hardware task from its original location to the FPGA col-

Reconfigurable System Approach Our reconfigurable system approach is mainly inspired by the previously introduced Xilinx Virtex architectures. Especially, the column-wise configuration and readback facilities have led to a 1D-placement approach, as can be see in figure 1. All hardware tasks can be dynamically placed, relocated and erased within the area of the dynamic modules. In order to ensure high resource efficiency, minimum configuration overhead and small bitstream sizes, all hardware tasks are supposed to occupy the full height of the reconfigurable array while the width varies according to the module’s complexity (see figure 1). In a previously published study including the implementation of several different designs we could prove that this constraint has hardly any effect on the resulting clock frequency, resource consumption and energy consumption of the hardware tasks [3]. This fact will also reduce the amount of frames to be read for capturing all state information. Static Modules

Resource Allocator

Bridge

DSP CRC

Address Bus

MMU

Config .Manager + Relocation Approach

CPU

Arbiter

Cache

Fig. 1.

CTRL

IO-Unit

Dynamic Modules Decoder

Data Bus

AMBABus

Our reconfigurable system approach [1]

In order to provide communication facilities between dynamically exchangeable hardware tasks, we have implemented a communication infrastructure that horizontally spans the whole FPGA (see figure 1). This bus system is completely homogeneous, which makes it possible to dynamically relocate hardware tasks along the horizontal bus structure. This relocation process can be realized by bit-

225

Database Entry { Task Name: CPU Current Start Location: Col. 4 Current End Location: Col10 Bitstream Addr (Alloc): 0x0100 0000 Bitstream Addr (De-Alloc): 0x0200 0000 …

Read Current Location of Task State Values State Location

States: { Signal Name: PRG_CNT Location: Col. 7, Row 27, Slice 0, FF 1 State Value: 1 … Signal Name: PRG_CNT Location: Col. 7, Row 35, Slice 1, FF 0 State Value: 0 ... } }

State Extraction Filter

Partial Readback Bitstream

Configuration Manager

FPGA SelectMAP

(Read/Write) Res. Alloc. IO

Partial Bitstream (orig. Loc., orig. States)

Writeback

State Location

State Values

Partial (new Loc., Writeback new States) Bitstream

State Inclusion Filter

Partial Bitstream

REPLICA Filter

(orig. Location, new States)

Conf. Reloc. Cache

1D Placement Approach

(Task Relocation)

New Column Location

Fig. 2.

Relocation Approach Overview

umn that is determined by the New Column Location input. Finally, the new partial bitstream, which is relocated and includes the states is downloaded by the Configuration Manager. After resetting the hardware task, all registers are set to the proper value and the task can start processing in exactly the state in which it was interrupted before.

Configuration Manager The Configuration Manager reads and writes configuration data from/to the SelectMAP configuration interface. The writing part of the Configuration Manager simply reads 32bit bitstream words from arbitrary memory locations and writes them in byte portions to the interface. For performance reasons, this part is already implemented in hardware, see [8] for further details. However, the reading part of the Configuration Manager is more complex, because we do not want to readback the whole configuration stream, but only those frames that contain state information. Therefore, the Configuration Manager takes the column (Col), slice (Slice) and flip flop (FF) values of the database entries for each state bit and generates an address of the frame that contains the current state value. The frame address consists mainly of the MJA and the MNA. Equation 1 and 2 show the necessary calculations (Chip_Cols determines the maximum CLB column number of the FPGA). The MNA can have only four different values, which means all flip flop states of one CLB column are stored in only 4 frames. This results in a heavy reduction of the amount of data to be read, as a complete CLB column consist of 48 frames. Consequently, it makes sense to implement task in as less CLB columns as possible to ensure a reasonable amount of state information in each frame that is read. The output of the Configuration Manager is finally a stream of single frames that contain the state information of the hardware task.

Database The database stores all necessary information about each hardware task. The most important data are the current location in terms of CLB columns, the memory address of the partial bitstreams to allocate and to de-allocate (empty bitstream) the task, and finally the location of all state registers. The state register location entries include the row and column values of the appropriate CLB as well as the slice number and flip flop number and the current state value. Comm. HardMacro

Column Constraint

Logic Allocation File (task.ll)

task.ll File Parser

Row/Col Addresses of State Registers

Task Database Task HDLDescription

Synthesis and Place and Route

Fig. 3.

Complete Bitstream

PARBIT Tool

Partial Bitstream

Offline module database generation

Figure 3 depicts parts of the design flow and the database generation process, which is performed before runtime. The inputs of the synthesis tools are the HDL descriptions of the hardware tasks, a hard macro of the communication infrastructure and the column constraints (see also [1]). The outputs are a complete bitstream and a logic location file for each task. With the help of the PARBIT tool [7], we generate the partial bitstream by cutting the appropriate columns out of the complete bitstream. The logic location file (task.ll) is parsed by a small self-developed tool that extracts the important data (Row, Col etc.) and stores the data in a database entry. By this process all used registers of the design are saved and restored. Optionally, the user can reduce the amount of database entries if further knowledge about the design and its state registers is available. Currently we reserve 19 bit per state register (8 bit col, 8 bit col, 1 bit slice, 1 bit flip-flop and 1 bit current state value).

MJA = Chip_Cols − Col × 2 + 2 (valid for left chip half and Virtex only) MNA = Slice × (12 × FF − 43) − 6 × FF + 45

(1)

(2) Slice, FF ∈ {0,1} ⇒ MNA ∈ { 45, 39, 2, 8 } Currently, most of the read process is implemented in software, running on a host processor. The access to the SelectMAP interface is provided by a register that controls all SelectMAP signals. However, in order to reduce the overall time overhead, we plan to completely implement the read process in hardware.

226

ded or host CPU), we intend to realize the whole approach in hardware. In this way we can ensure the minimum suspension and restoring times and consequently the minimum relocation time overhead for each task.

State Extraction Filter The State Extraction Filter takes the readback stream of the Configuration Manager, extracts the state values and updates the database entries. For extracting the state value, the filter has to determine the correct bit index within the readback frames passed from the Configuration Manager. The necessary calculation is given in equation 3 (see [9]). bit _ idx = (18 × row) + 1

4. RELOCATION TIME The overall task relocation time is a key performance issue in each operating system. Only if this time can be kept as low as possible, concepts like defragmentation will make sense. The relocation time in our hardware implemented relocation approach consists of three times: the state capture time, the de-allocation time and finally the allocation time. The missing bitstream manipulation processes of state inclusion and task relocation are assumed to be completely hidden in the task allocation time, which has already been shown for the task relocation with the REPLICA filter in [8]. Equation 5 shows the calculation of the overall task relocation time (assuming all blocks to be implemented in hardware).

(3)

It can be seen, that the bit index depends only on the CLB row of the appropriate flip flop. Which means that all flips flop values of the same column and the same type (e.g. Slice=0, FF=1) are located within one frame. Consequently, it is usually necessary to extract multiple state values out of one frame. State Inclusion Filter The State Inclusion Filter is the first step of the restoring process. The filter takes the original partial bitstream of the hardware task and includes all database state values by manipulating the preset bit of the registers. Similar to the state extraction process, the frame address and bit index of all state bits have to be calculated. The equations for the MJA and the bit index are the same as for the state extraction process (see eq. 1 and 3); solely the MNA values differ. The MNA values for the preset bit of each of the four flip flops in a CLB are depicted in equation 4. The values are taken from [6] as they are not documented in [9]. ⎧ Slice = 0 ⎪ Slice = 0 MNA = ⎨ Slice = 1 ⎪ ⎩ Slice = 1

∧ ∧ ∧ ∧

FF FF FF FF

=0 =1 =0 =1

⇒ ⇒ ⇒ ⇒

41 35 6 12

TReloc = 4 N Cols

N Init + 2 N Byte / Frame f SelectMAP

+2

N BitstreamSize f SelectMAP

(5)

with N Init = 7 ⋅ 4 Byte The first summand represents the state capture time, while the second represents the allocation and de-allocation time. For the state capture time we assume to read all 4 frames for each task column (NCols) (see eq. 2). However, depending on the placement of the flip flops, it is often sufficient to read less frames to determine all state values. In order to read a frame, the SelectMAP interface has to be initialized by NInit cycles (7 double words), including synchronization, setting the frame address and the number of bytes to be read. Basically, it is possible to read multiple frames within one access, but as the frames are not located on consecutive addresses, a new read access has to be initiated for each frame. Unfortunately, the first frame of all new read accesses is always a pad frame which does not contain any useful data. Therefore, we have to read two times the frame size (NBytes/Frame) to capture the intended frame.

(4)

REPLICA Filter Downloading the output stream of the State Inclusion Filter would allocate the task at its original location (after initial place and route). However, in most cases a new location has to be found according to the current resource allocation. Therefore, we developed the REPLICA filter, which is capable of relocating tasks by manipulating the partial bitstream of the task. The REPLICA filter parses the bitstream and replaced the column addresses (MJAs) within the bitstream. The relocation process can only be performed horizontally, which suits our reconfigurable system approach, where the task can only be placed and relocated along the horizontal communication infrastructure. The necessary manipulation including the update of the CRC (Cyclic Redundancy Check) values within the bitstream is implemented in hardware and does not cause any extra time overhead. The architecture and the hardware implementation of the REPLICA filter as well as an example have been published in [8]. Although currently only the REPLICA filter as well as parts of the Configuration Manager are implemented in hardware (the remaining blocks are running on an embed-

TReloc ≈ 4 N Cols = N Cols

2 N Byte / Frame

f SelectMAP 104 N Byte / Frame

+2

48 N Cols N Byte / Frame f SelectMAP

(6)

f SelectMAP

Allocating and de-allocating a task takes the same time, because both bitstreams must reconfigure the same number of CLB columns in the FPGA. The time for deallocation/ allocation can be simply determined by dividing the bitstream size (in byte) by the SelectMAP (byte wide) interface frequency. Equation 6 shows a proper simplification, which is only dependant on the task size (NCols), the frame size (NBytes/Frame) and the SelectMAP frequency (fSelectMAP). For the simplification, all SelectMAP commands are left

227

Table 1.

Resulting relocation time for several different hardware tasks 8-bit Divider

FIR Filter

S-Core CPU Octchip 1072 2137 3795 5747 7 14 24 36 2240 4480 7680 11520 1150 868 1414 2287 64,6 129,0 221,1 331,7 1322 µs 2642 µs 4529 µs 6793 µs 56 112 192 288 10976 21952 37632 56448 235,2 µs 470,4 µs 806,4 µs 1209,6 µs 2,9 ms 5,8 ms 9,9 ms 14,8 ms x includes all SelectMAP commands

32-bit Linear AES Divider Controller Rijndael

Design size [slices] Min. CLB Columns Available Flip Flips Used Flip Flops Bitstream Size [KB] De-/Allocation Time

112 323 861 1 2 6 320 640 1920 58 213 804 9,3 18,5 55,3 190 µs 379 µs 1133 µs # 8 16 48 Frames to Read Readout Data [Byte] 1568 3136 9408 x Readout Time 33,6 µs 67,2 µs 201,6 µs Complete Relocation 0,4 ms 0,8 ms 2,5 ms # includes one pad frame per information frame

unconsidered and the de-/allocation bitstream size is assumed to be 48 frames per task column (see [9] for details). Table 1 shows the complete relocation time for 7 different hardware tasks we investigated in [3]. All tasks have been implemented on an XCV2000E device, which has a frame length of 196 bytes. The complete relocation time has been calculated by eq. 5, assuming a SelectMAP (or ICAP) frequency of 50 MHz. The task size ranges from 1 to 36 (30% of the device) CLB columns and the overall relocation time ranges from 0.4 ms to 14.8 ms. A very interesting aspect to mention is the fact that for all tasks, the read time comes to only about 8.2% of the complete relocation time. This is because the de-allocation and allocation time outweighs the state capturing process. Consequently, it does not make sense to put more effort in reducing the read process, even the data efficient Task Specific Access approach would not significantly reduce the overall relocation time.

6. REFERENCES [1]

H. Kalte, M. Porrmann, and U. Rückert: System-onprogrammable-chip approach enabling online fine-grained 1D-placement. In Proc. of the 11th Reconfigurable Architectures Workshop (RAW), Santa Fé, New Mexico, USA, 2004.

[2]

M. Ullmann, M. Hübner, B. Grimm, J. Becker: An FPGA Run-Time System for Dynamical On-Demand Reconfiguration. In 11th Reconfigurable Architectures Workshop (RAW), Santa Fé, New Mexico, USA, 2004.

[3]

H. Kalte, G. Lee, M. Porrmann, U. Rückert: Study on Column Wise Design Compaction for Reconfigurable Systems. In Proc. of the International Conference on Field Programmable Technology (FPT), Brisbane, Australia, 2004.

[4]

S. A. Guccione, D. Levi, and P. Sundararajan: JBits: A Javabased interface for reconfigurable computing. In Second Annual Military and Aerospace Applications of Programmable Devices and Technologies Conference (MAPLD), September 1999.

[5]

H. Simmler, L. Levinson, and R. Manner: Multitasking on FPGA Coprocessors. In Proc. of the 10th International Workshop on Field Programmable Gate Arrays (FPL), 2000.

[6]

H. Simmler: Preemptive Multitasking auf FPGA Prozessoren, Dissertation of Univertsity of Mannheim, 2001

[7]

E. Horta, J. W. Lockwood. PARBIT: A tool to transform Bitfiles to Implement Partial Reconfiguration of Field Programmable Gate Arrays (FPGAs), Tech. Rep. WUCS01-13, Washington University, July 2001.

[8]

H. Kalte, G. Lee, M. Porrmann, U. Rückert: REPLICA: A Bitstream Manipulation Filter for Module Relocation in Partial Reconfigurable Systems. In Proc. of the 12th Reconfigurable Architectures Workshop (RAW), Denver, Colorado, USA, April 2005.

[9]

Xilinx Application Notes 151: Virtex Series Configuration Architecture User Guide, 2000, http://www.xilinx.com.

5. SUMMARY In this paper we discussed the pros and cons of two different context relocation approaches. While the first approach relies on task specific access structures added by the designer, the second one uses the inherent configuration port. The advantages of the configuration ports approach is the fact that no hardware structures have to be added to each task and there is no need to have detailed knowledge about the internal structure or behavior of the tasks. However, the previously published implementations of the configuration port approach always read all the configuration memory. In comparison to that our implementation reads only those frames that contain state data (4 out of 48 frames). In this way, we can significantly reduce the amount of data to be read and to be stored as well as the effort to extract the state values of the registers. Currently, some parts of our context relocation approach are still realized in software, however after implementing the whole approach in hardware, the state extraction and insertion will be perform as a filter operation during the read and writeback process respectively. This leads to a minimum relocation time overhead which is essential for any kind of reorganization like defragmentation.

[10] J.-Y. Mignolet, V. Nollet, P. Coene, D. Verkest, S.

Vernalde, R. Lauwereins: Infrastructure for Design and Management of Relocatable Tasks in a Heterogeneous. In Proceedings of DATE 2003 Conference, Munich, Germany, 3-7 March 2003.

228