Considering run-time reconfiguration overhead in Task ... - Xun ZHANG

40.77. 32.85. 7. 19.4%. 512 X 512. 7. 40.20. 32.3. 7. 19.7%. 8. 40.20. 26.47. 9. 34.2% jpgencode. 6. 161.37. 125.03. 7. 22.5%. 1024 X 1024. 7. 160.80. 124.46. 7.
85KB taille 2 téléchargements 269 vues
Considering run-time reconfiguration overhead in Task Graph Transformations for dynamically reconfigurable architectures Sudarshan Banerjee Elaheh Bozorgzadeh Nikil Dutt Center for Embedded Computer Systems University of California, Irvine, CA, USA banerjee,eli,dutt @ics.uci.edu 

In modern dynamic FPGA-based platforms where multiple processes may be executing concurrently, partial dynamic reconfiguration (RTR) is a key technique for maximizing application performance under resource constraints. For platforms with columnbased partial RTR, we propose a new technique to statically transform linear task graphs (common in image processing applications). In our approach, the granularity of data parallelism for each task is determined while considering the reconfiguration overhead along with architectural constraints imposed by partial RTR. On JPEG applications, our technique can improve the execution time by upto 37% by choosing the right granularity of task parallelism.

1.

INTRODUCTION

Run-time reconfiguration (RTR) provides the ability to change hardware configuration during application execution, thus enabling designers to cope with ever increasing demand for larger and faster applications on resource-constrained systems [4]. However, there are major challenges in realizing the potential performance of such architectures. Our target architecture is a single context RTR device that offers the possibility of additional performance through partial reconfiguration– an example is the Virtex series from Xilinx. On such devices, some key issues are the significant reconfiguration overhead, the constraint imposed by a single reconfiguration controller, the columnar placement constraints, etc. Reconfiguration related issues need to be considered by systemlevel approaches that synthesize task-level applications (schedule, bind, place) onto such devices. The significant reconfiguration overhead affects key issues such as choice of task granularity, decisions on task reuse, etc. As a specific example, parallelizing a task (or, decomposing it into several simultaneously executing subtasks) typically reduces the execution time. However, on such platforms, the reconfiguration overhead for the additional resources can significantly reduce the expected speedup. In this work, we focus on simple linear task graphs common in image-processing applications [1], [2]. In such applications, while some tasks such as Huffman (used in JPEG encoding) are inherently sequential, other tasks are data parallel. As an example, the basic 2-dimensional DCT (discrete cosine transform) operates on 8 X 8 blocks of pixels. An image of 256 * 256 pixels could theoretically be processed in parallel by (256 X 256)/(8 X 8) = 1024 independent DCT blocks, subject to availability of sufficient logic resources. However, each additional block incurs a significant reconfiguration penalty. We propose a technique that considers reconfiguration overhead and configuration prefetch [3] issues while selecting a suitable task granularity, and thus, statically transforms the task graph. We effectively trade-off data parallelism with reconfiguration overhead to decide on the number of independent blocks for each data parallel

task. This transformation is followed by simultaneous scheduling and columnar placement, where the scheduling integrates prefetch to reduce overhead. Experimental results on the JPEG encoder show a potential improvement upto 37% with our proposed technique.

2. TARGET ARCHITECTURE AND GOALS off-chip memory memory region Height

ABSTRACT



 



CLB

 





 





 





 





 

 

Computation region

Frame

Width

Figure 1: Target dynamic architecture The target dynamically reconfigurable device as shown in Figure 1 consists of a set of configurable logic blocks (CLB) arranged in a two-dimensional matrix. The basic unit of configuration is a frame spanning the length of the device- a task occupies a contiguous set of columns. We assume that adjacent tasks communicate by a shared memory abstraction- this shared memory can be physically mapped to local on-chip memory and/or off-chip memory depending upon memory requirements of the application. With this abstraction, the communication overhead between two tasks is independent of physical placement and can be integrated in the task latency. A task Ti executing on such a system can be represented as a 3-tuple (ci , ti , ri ) where ci is the number of resource columns occupied by the task, ti and the ri are the execution time and reconfiguration overhead respectively. Our problem objective is to statically determine the shortest schedule for execution of a linear chain of tasks given that C columns are available.

3. PROPOSED APPROACH We explain the key concepts behind our approach with Figure 2. Let us assume we have a simple chain of two tasks represented as vertices v0 and v1 – these tasks occupy 2 and 3 columns respectively. Given an area constraint of 6 columns, the best possible schedule length is given by t0 t1 where t0 and t1 are the execution time of the two tasks. This of course assumes perfect prefetch, i.e., the reconfiguration overhead r1 of task v1 can be hidden. Such a schedule is show in Figure 2 (a) where the x axis represents the area dimension (in columns) and the y-axis represents the time dimension. 

Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05) 0-7695-2445-1/05 $20.00 © 2005 IEEE

v0

v00

v1











 

 



 









t0





 

















































r1

 

































































 

















 



 





















 





























 









 

 





 

 







jpgencode 1024 X 1024













 













 

 











 

 









 









t0 + t1

(b)

(c)

Assuming that both tasks v0 and v1 are perfectly data parallelizable, an alternate schedule is given by Figure 2 (b). We have essentially split task v0 into 3 parallel blocks, each operating on identical volumes of data. We assume a constant cost for initial reconfiguration of the FPGA– thus, splitting v0 into 3 parallel blocks is essentially ”free of charge”. However, now we can not hide the reconfiguration overhead for task v1 , so we can start task v1 only at time instant t0 3 r1 instead of at t0 for the original task set. Once we schedule execution of task v1 we note that there is potential for further improvement by making two copies of v1 . However, we incur an additional penalty for the second copy. Thus, the resultant schedule is of length t0 3 2 r1 t1 r1 2 and the corresponding task graph with 5 nodes is shown in Figure 2 (b). This transformation is beneficial only if: t0 3 2 r1 t1 r1 2 t0 t1 Note that the transformation for task v1 is distinct from traditional transforms for parallelism. We have considered the effect of prefetch and the reconfiguration penalty, resulting in identical tasks that process unequal volumes of data. There obviously are more possibilities even in this trivial example with only two tasks. Another possibility is shown in Figure 2 (c) where the transformed graph has 3 nodes and a schedule length of: t0 r1 t1 r1 2. Given an area constraint of C columns, and width of task Ti as ci , it is easy to see that the maximum number of copies of Ti that could possibly yield benefits is C ci . Thus, the effective search space for this transformation is C c0 C c1 .... We have integrated the concepts described above into a greedy heuristic that evaluates the potential benefits (reduction in schedule length) of a subset of the search space and chooses the most promising configuration. While our heuristic also schedules and does linear placement, the transformed graph can of course be scheduled by any other approach targeted towards such systems, i.e., does simultaneous scheduling and placement. 











































T new (ms) 9.69 9.12 7.72 32.85 32.3 26.47 125.03 124.46 100.69

TG size

%

6 7 7 7 7 9 7 7 9

8.8% 9.3% 23.2% 19.4% 19.7% 34.2% 22.5% 22.6% 37.4%



Figure 2: Transformation



simple

Topt (ms) 10.62 10.05 10.05 40.77 40.20 40.20 161.37 160.80 160.80



(a)



6 7 8 6 7 8 6 7 8



 

jpgencode 256 X 256



 





C



 

r1





 











 

Test case

jpgencode 512 X 512

 





 



 





 

 







 

v1

Columns

t0 3













r1





















Time

 





Time

Time

 













 

1

v1



 

0

v1

Columns 



 

v0

1

v1

 

 

v02

0

Columns  

v01





Table 1: Performance improvement a chain of 4 tasks: RGB2YCbCr, DCT, Quantize, Huffman. Huffman is a sequential task, but the other tasks are data parallelizable. The encoder requires a total of 11 columns– while the Xc2V2000 has more than 11 columns, for experimental purposes we assume that few columns are available in a multi-tasking environment. We carried out experiments with different image sizes of 256 X 256, 512 X 512, 1024 X 1024 and varied the area constraint from 6 to 8 columns. While the task resource consumption is identical (independent of data size), the execution time of a task is directly proportional to the data size. Table 1 consolidates our experimental data. Column C represents simple is the optimal schedule (in the area constraint (# columns), Topt new is the schedule from the transms) for the input task graph, T formed graph, T G size represents the number of nodes in the transformed graph, and, % represents the performance improvement. As the column % shows, there is a significant potential for improvement in schedules with our proposed transformation, specially as data size increases. It is interesting to note that the transformed task graph for a 256 X 256 image under an area constraint of 8 columns has 7 tasks while the graph for a 512 X 512 image has 9 tasks – this demonstrates different trade-offs between execution time and reconfiguration time dependent on the data size. Another interesting aspect of the data is the significantly higher improvement possible once 8 columns are available – this is because the largest (in area) task in the chain (and also the most computationally expensive), DCT, occupies 4 columns, and it is possible to make two parallel copies of DCT once 8 columns are available.

















4.







EXPERIMENTS

For the numerical data in our experiments, we assume a hardware device similar to the Xilinx XC2V2000. We obtained area (number of columns) and timing data by synthesizing tasks with Synplicity and Xilinx Place & Route tools (ISE 6.2) – placement was constrained such that each task occupied the minimum number of columns, and intra-task routing was constrained to lie within the task boundary. Using data from the manual, we estimate reconfiguration overhead for a CLB column on this device to be 0.19 ms– i.e, for a task occupying ci columns, reconfiguration overhead is estimated as 0.19 * ci ms. We carried out a case study of the JPEG encoder represented as

5. SUMMARY In this paper, we presented experimental evidence demonstrating the potential for performance improvement by task graph transformations while taking into account the reconfiguration overhead and physical (placement) constraints imposed by partial RTR. Our initial approach is a simple greedy heuristic focused on linear task graphs common in image processing applications. In the future, we will be continuing to work on more sophisticated approaches and extend these concepts to generalized DAGs. We also plan to integrate more detailed communication and memory issues into our framework.

6. REFERENCES [1] J. Noguera, R. M. Badia, ”Power-Performance trade-offs for reconfi gurable computing”, CODES+ISSS, 2004 [2] H Quinn, L A Smith King, M Leeser, W Meleis, ”Runtime Assignment of Reconfi gurable Hardware Components for Image Processing Pipelines”, FCCM, 2003 [3] S. Hauck, ”Confi guration pre-fetch for single context reconfi gurable processors”, FPGA, 1998. [4] M. J. Wirthlin, ”Improving functional density through Run-time Circuit Reconfi guration”, PhD Thesis, Electrical and Computer Engineering Dept, Brigham Young Univesity, 1997.

Proceedings of the 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’05) 0-7695-2445-1/05 $20.00 © 2005 IEEE