EFFICIENT SCHEDULING OF RATE LAW FUNCTIONS ... - Xun ZHANG

Of course, while this sharing first. 667 ... ical path first scheduling policy which achieves the short- est critical path execution .... Biochemistry, vol. 130, no. 8, pp.
131KB taille 5 téléchargements 281 vues
EFFICIENT SCHEDULING OF RATE LAW FUNCTIONS FOR ODE-BASED MULTIMODEL BIOCHEMICAL SIMULATION ON AN FPGA Naoki Iwanaga† , Yuichiro Shibata† , Masato Yoshimi‡ , Yasunori Osana‡ , Yow Iwaoka‡ , Tomonori Fukushima‡ , Hideharu Amano‡ , Akira Funahashi∗ , Noriko Hiroi∗ , Hiroaki Kitano∗ , Kiyoshi Oguri† Dept. Comp. Info. Sciences Nagasaki University, Japan





Dept. Info. Comp. Science Keio University, Japan

ABSTRACT

Kitano Symbiotic Systems Project ERATO-SORST, JST, Japan

enabled by perfect static scheduling. SBML predefines 33 frequently used rate law functions, therefore efficient arithmetic scheduling of these functions is crucial. This paper shows and compares some scheduling of rate law functions, and analyzes the tradeoffs between performance and hardware amount.

A reconfigurable biochemical simulator by solving ordinary differential equations has received attention as a personal high speed environment for biochemical researchers. For efficient use of the reconfigurable hardware, static scheduling of high-throughput arithmetic pipeline structures is essential. This paper shows and compares some scheduling alternatives, and analyzes the tradeoffs between performance and hardware amount. Through the evaluation, it is shown that the sharing first scheduling reduces the hardware cost by 33.8% in average, with the up to 11.5% throughput degradation. Effects of sharing of rate law functions are also analyzed.

2. BACKGROUND 2.1. SBML and rate law functions

1. INTRODUCTION A demand for high speed simulation of cellular systems has been rapidly growing in the field of bioinformatics. Although many biochemical simulators have been developed so far [1][2][3][4][5], whole-cell simulation is still a challenge for both biologists and computer scientists. We are developing a reconfigurable ordinary differential equation (ODE)-based biochemical simulation system called ReCSiP [6]. The structure of ReCSiP can be tailored to each simulation target by using FPGAs, and high speed simulation is carried out without losing the flexibility. Moreover, in order to ensure wide versatility for simulation targets, ReCSiP adopts Systems Biology Markup Language (SBML) [7] as description language for simulation models. Since SBML is supported by many biochemical software packages, simulation models are easily exchanged with other software simulators. Using ReCSiP front-end software, simulation models written in SBML are translated into hardware ODE solvers and are directly executed on an FPGA. The ODE solvers calculate concentrations of substances in the simulation target for each time step integrating rate law functions. The source of ReCSiP’s performance advantage is deep pipeline structure

0-7803-9362-7/05/$20.00 ©2005 IEEE



666

Systems Biology Markup Language (SBML) is a commonly used description language for simulation models in systems biology. SBML uses a common XML-based format for encoding systems biology models so that it can be used as an exchange format by software tools. With modeling tools such as CellDesigner[8], which is a structured diagram editor for drawing gene-regulatory and biochemical networks, users can construct simulation models by drawing the target system on the screen. SBML predefines 33 rate law functions that are frequently used in biochemical modeling. Arguments of a rate law function are the concentrations of substances, the velocities of related reactions and some coefficients. The return value is the velocity of the reaction. 2.2. ReCSiP ODE solvers The heart of ReCSiP is ODE solvers which integrate the velocities of reactions and calculate the concentrations of substances for each time step. Although this requires a lot of floating point arithmetic operations, high degree of throughput can be achieved by statically and completely scheduling the pipeline structure of solvers. Especially, biologists often want to launch multiple simulations for the same target with different parameters at the same time, perfectly scheduled deep pipeline structure is quite efficient. Moreover, by configuring only the solvers that are required in a target SBML description, it is possible to devote the rest of FPGA area to

S

Table 1. Floating point arithmetic modules

Modules adder multiplier divider

Stages 6 8 17

Slices used (%) 557/33088 (1.68) 273/33088 (0.82) 1702/33088 (5.14)

38

Frequency (MHz) 133.3 207.6 148.7

Ka

38

P

Km

37

37 P

31

31

S

Km

31

31

Ka

40

40 V

34 25

34

17

3. SCHEDULING OF SOLVER CORES

17

Cr iti ca

lP

at h

replicate and parallelize the solvers.

25

0

3.1. Floating point arithmetic modules

v

A solver core, which is the reaction-specific circuits to calculate the change in concentrations for each time step, consists of some IEEE-754 compliant single-precision floating point arithmetic modules and shift registers. In order to achieve high throughput operation, fully pipelined arithmetic modules are designed. The operation latency for adder, multiplier and divider is 6, 8 and 17 clock cycles, respectively, and they are designed for fixed regardless operand values. Implementation results of arithmetic modules are shown in Table 1. Here, synthesis, place and route are done by Xilinx ISE 6.3i. The target device is Xilinx XC2VP70 which is implemented on the ReCSiP board. In terms of used slices, the FPGA can accommodate a few dozens of arithmetic modules.

Fig. 1. A DFG for the example solver 0

40

Div Add Mul Div

Fig. 2. Final scheduling results of the example solver and Nk = 3 , P = 3 for the example function. This means the solver core for the example function can calculate the velocities every three clock cycles in a pipeline manner.

(1)

where, S and Ac are concentration arguments while V , Km and Ka are coefficient arguments. 1. At the beginning, the pipeline pitch P of the given rate-law function is determined. In ReCSiP, a solver core has two input ports; one for reading concentration data and the other for reading coefficients. Therefore, P is determined as P = m a x (Nx , Nk )

30

Add

To achieve high throughput as possible, solver cores must work in a fully pipelined manner. In the pipeline structure, floating point arithmetic modules have to be efficiently scheduled and connected with each other. Here, some scheduling alternatives are possible, and there are tradeoffs between performance and required resources. The following critical path first scheduling policy intends to minimize the execution latency. For the explanation, we use the following example of rate raw function: (S + Ka )/ V (Km + Ac + S + Ka )(Km + Ac )

20

Ka Km V

3.2. Critical path first scheduling policy

v=

10

S Ac

2. Next, the arithmetic operations in the critical path of the corresponding data flow graph (DFG) are scheduled in an as soon as possible (ASAP) manner (Fig. 1). This ensures the minimized execution latency. 3. Then, the other arithmetic operations are scheduled so as not to affect the critical path operations. Here, sharing of arithmetic units is also taken into consideration. If the same two operations are scheduled at time ta and tb , the corresponding arithmetic units can be shared unless ta ≡ tb (m o d P ). Therefore, operations not included in the critical path are scheduled at between the ASAP and the As Late As Possible (ALAP) so as to minimize the number of required arithmetic units (Fig. 2). 3.3. Sharing first scheduling policy The critical path first scheduling policy explained above achieves the shortest latency in terms of clock cycles, since no other scheduling methods can shorten the critical path of a DFG. On the other hand, by inserting some shift registers between arithmetic units in the critical path, starting time of each arithmetic unit might be adjusted so that more arithmetic units are shared. Of course, while this sharing first

(2)

where Nx is the number of concentration arguments and Nk is the number of coefficients. Since Nx = 2

667

Table 4. Shared functions

Table 3. Performance decrease by sharing first scheduling Function UCTR UMAR UMR UNIR UCTI UMAI UMI UNII UUCI UUCR UAII UAR UCII UCIR

Throughput (%) 6.15 11.48 11.05 7.88 0.00 3.83 0.00 0.00 0.00 0.00 4.37 0.00 4.42 0.00

Execution time (%) 12.61 13.35 13.45 13.93 1.83 5.79 1.89 1.83 1.83 10.82 8.46 2.04 8.46 2.17

Name A-shared B-shared C-shared D-shared 10-shared

Covered functions UMR, UCTR, UMAR, UNIR UCTI, UMAI, UMI, UNII UUCI, UUCR UAII, UAR, UCII, UCIR A-, B-, C-shared

Critical path

Pipe pitch

Add

Mul

Div

56

6

2

1

2

54

4

1

1

2

56

5

2

1

1

49

5

1

1

2

56

6

2

1

2

cores. In terms of the throughput, the average degradation ratio is not so severe (3.8%), while UMAR and UMR show the degradation ratio of more than 11%. On the other hand, up to 14% of degradation is shown for the execution time. This is due to the stretched critical path operations which directly impact the execution time. However, since multiple set of simulation with different parameters for the same target is often launched in bioinformatics, the throughput has more importance than the execution time. Compared to software execution on 1.13-GHz Intel Pentium III, sharing first scheduling achieves 2.8 times throughput. Moreover, small solver cores can be replicated and parallelized on the FPGA. Therefore, this degree of the execution time degradation is acceptable as the cost of hardware size reduction.

scheduling policy would alleviate the hardware cost, the execution latency of the solver core would be degraded. 4. EVALUATION AND DISCUSSION 4.1. Required hardware For comparison, we design and implement 14 solver cores for SBML predefined rate law function. Table 2 summarizes the implementation and results performance of solver cores and compares the impact of two scheduling policies; the critical path first scheduling policy which achieves the shortest critical path execution time and the sharing first policy which aims at reducing the hardware amount. The target device is again Xilinx XC2VP70. As shown in Table 2, a solver core occupies 10 to 15% of FPGA slices with the critical path first scheduling, while 7.8 to 9.0% of slices are used by stretching out the critical path operations. In terms of the required number of slices, 33.8% of hardware reduction ratio is achieved in average with the sharing first scheduling. The arithmetic units are most effectively shared for UCII solver core, which shows 44.3% of hardware reduction ratio.

4.3. Function sharing Some of the predefined functions have similar formulas. Therefore, shared function solver cores that calculate multiple types of predefined functions can be implemented by sharing some arithmetic units and by adding an extra multiplexer to select the function to be calculated. Here, 5 types of shared functions are designed as shown in Table 4. For instance, the A-shared solver can calculate one of UCTR, UMAR, UMR and UNIR functions, and 10shared solver can calculate one of 10 functions covered by A-shared, B-shared and D-shared solver cores. Table 5 shows the implementation results of shared function solver cores when the critical path first scheduling policy is applied. All the shared function solvers occupy 10% or less of XC2VP70. Compared to the sum of slices used by UCTR, UMAR, UMR and UNIR single solver cores, Ashared solver core reduces the number of required slices by approximately 70%. Similarly, B-shared, C-shared and Dshared solvers reduce the area by 75%, 56% and 76%, respectively. Moreover, area reduction effect of the 10-shared solver core amounts to 88%. Although function sharing efficiently reduces the hardware cost of solver cores, performance is degraded as shown in Table 5. For instance, the A-share solver core degrades the throughput of UCTR and UNIR single solvers by 25%, while the throughput degradation for UMAR and UMR is

4.2. Performance Here, the impact of selection of a scheduling policy on performance is analyzed. Table 2 shows the execution latency, maximum frequency, maximum throughput and execution time for the solver cores. As show in Table 2, the solver cores calculate the velocity for 15 to 40 million reactions a second. Applying the sharing first scheduling, the execution latency of the solvers is increased by up to 4 clock cycles. Moreover, the maximum frequency is decreased by about 5 MHz in average, since high degree of arithmetic sharing results in generation of complicated multiplexers. Table 3 shows how much the sharing first scheduling policy degrades the throughput and execution time of the solver

668

Table 2. Required FPGA resources and performance comparison Function

Slices (%)

FFs

UCTR UMAR UMR UNIR UCTI UMAI UMI UNII UUCI UUCR UAII UAR UCII UCIR

3537 (10.7) 5100 (15.4) 5071 (15.3) 3500 (10.6) 4330 (13.1) 4429 (13.4) 4379 (13.2) 4305 (13.0) 4234 (12.8) 3434 (10.4) 4683 (14.2) 4324 (13.1) 4660 (14.1) 4283 (12.9)

3254 4836 4840 3251 4309 4405 4399 4305 4263 3206 4601 4358 4599 4345

Critical path first LUTs Throughput (Mreaction/s) 6210 24.4 9087 18.3 9030 18.1 6146 24.1 7641 38.9 7815 28.7 7729 27.7 7597 38.9 7490 38.9 6044 22.9 8398 40.7 6950 22.4 8355 40.7 6875 22.4

Exec time(ns) 457.9 507.6 509.1 463.9 462.6 468.9 486.9 462.6 462.6 487.3 384.3 436.9 384.3 436.9

Slices (%)

FFs

LUTs

A-shared B-shared C-shared D-shared 10-shared

5183 (9.1) 4480 (8.8) 3435 (8.8) 4320 (8.1) 5235 (9.1)

4874 4460 3203 4347 4872

9233 7905 6051 6963 9338

Throughput (Mreactions/s) 15.9 27.6 22.3 22.3 17.2

FFs

2934 (8.9) 2910 (8.8) 2897 (8.8) 2987 (9.0) 2799 (8.5) 2888 (8.7) 2849 (8.6) 2765 (8.4) 2694 (8.1) 2898 (8.8) 2630 (7.9) 2673 (8.1) 2596 (7.8) 2692 (8.1)

3003 2963 2961 3010 2737 2855 2856 2732 2687 2962 2699 2800 2694 2818

Sharing first LUTs Throughput (Mreaction/s) 4987 22.9 4959 16.2 4939 16.1 5092 22.2 4822 38.9 4976 27.6 4899 27.7 4752 38.9 4646 38.9 4956 21.9 4605 38.9 4494 22.4 4536 38.9 4483 22.4

Exec time(ns) 524.0 585.8 588.2 539.0 471.2 497.7 496.3 471.2 471.2 546.4 419.8 446.0 419.8 446.0

Reduction ratio (%) 17.0 42.9 42.9 14.7 35.4 34.8 34.9 35.8 36.4 15.6 43.8 38.2 44.3 37.1

6. REFERENCES

Table 5. Implementation results of shared function cores Solver core

Slices (%)

Exec time(ns) 596.2 497.7 536.6 446.8 552.3

small. This is because the latency and pipeline pitch of shared function solvers are affected by the original single solvers that have the longest latency or pipeline pitch. The A-shared solver calculates UCTR and UNIR functions with 6-cycle pipeline pitch, which is degraded by 1 clock cycle compared to original solvers due to support for UMR and UNIR functions. Compared to A-shared, B-shared, C-shared and D-shared solvers, the throughput of 10-shared solver is degraded by 44%. In terms of execution time, the 10-shared solver is degraded by 14% compared with the fastest execution time among the original single solvers. This implies that the adequate degree of function sharing is 2 to 4.

[1] B. A. Barshop et al., “Analysis of numerical methods for computer simulation of kinetic processes: Development of kinsim - a flexible, portable system,” Analytical Biochemistry, vol. 130, no. 8, pp. 134–145, Aug. 1983. [2] M. Pedro, “Gepasi: a software package for modeling the dynamics, steady states and control of biochemical and other systems,” Computer Applications in the Biosciences, vol. 9, no. 5, pp. 563–571, Oct. 1993. [3] I. Goryanin et al., “Mathematical simulation and analysis of cellular metabolism and regulation,” Bioinformatics, vol. 15, no. 9, pp. 749–758, Sept. 1999. [4] M. Tomita et al., “E-cell: software environment for whole-cell simulation,” Bioinformatics, vol. 15, no. 1, pp. 78–84, Jan. 1999. [5] S. B. I. Moraru, S. JC and L. LM, “The virtual-cell: an integrated modeling environment for experimental and computational cell biology,” Annals of the New York Academy of Sciences, vol. 971, pp. 595–596, Sept. 2002. [6] Y. Osana et al., “An FPGA-based multi-model simulation method for biochemical systems,” Proc. Reconfigurable Architectures Workshop, vol. 105, no. 42, pp. 49– 53, May 2005.

5. CONCLUSION

[7] M. Hucka et al., “The systems biology markup language (SBML): A medium for representation and exchange of biochemical network models,” Bioinformatics, vol. 19, no. 4, pp. 524–531, Mar. 2003.

In this paper, two arithmetic scheduling policies for solver cores of biochemical rate law functions are presented, and the tradeoffs between performance and hardware costs are analyzed. By stretching the critical path operation and sharing arithmetic units, hardware resources of solvers were reduced by 33.8% in average, with the up to 11.5% throughput degradation. In addition, by sharing 2 to 4 rate law functions into a solver core, further hardware resource were effectively reduced alleviating the performance degradation.

[8] A. Funahashi, N. Tanimura, M. Morohashi, and H. Kitano, “CellDesigner: a process diagram editor for gene-regulatory and biochemical networks,” BIOSILICO, vol. 1, no. 5, pp. 159–162, Nov. 2003.

669