MULTIDIMENSIONAL DYNAMIC PROGRAMMING FOR ... - Xun ZHANG

We have already proposed a computational method for three-dimensional alignment using reconfigura- tion and co-processing by FPGA and software[14]. In this.
434KB taille 2 téléchargements 372 vues
MULTIDIMENSIONAL DYNAMIC PROGRAMMING FOR HOMOLOGY SEARCH Shingo Masuno, Tsutomu Maruyama, Yoshiki Yamaguchi

Akihiko Konagaya

Systems and Information Engineering, University of Tsukuba RIKEN Genomic Sciences Center, 1-1-1 Ten-ou-dai Tsukuba Ibaraki, 305-8573, Japan 1-7-22 Suehiro-cho Tsurumi-ku [email protected] Yokohama Kanagawa, 230-0045, Japan two sequences) because of the complexity to calculate alignment among more than two sequences under limited hardware resources. We have already proposed a computational method for three-dimensional alignment using reconfiguration and co-processing by FPGA and software[14]. In this paper, we describe a system for more than three-dimensional alignment. Our system consists of only an off-the-shelf FPGA board and a host computer. In our approach, high performance is achieved by configuring optimal circuit for each dimensional alignment, and by two phase search in each dimensional alignment by reconfiguration. In order to realize multidimensional search with a common architecture, twodimensional dynamic programming is repeated along other dimensions. With this repetition approach, we can minimize the size of units for alignment and achieve high parallelism. The two phase search makes it possible to achieve high performance under limited memory bandwidth. In the first phase, only the final result (score of the optimal alignment) is calculated. In this phase, no causal connections to the final result are output, and the system achieves maximum performance. Then, if the first phase gives good scores, the circuit for the second phase is configured on the FPGA, and all causal connections are output to obtain optimal alignment. The performance of this phase is limited by the memory bandwidth of the FPGA.

ABSTRACT Alignment problems in computational biology have been focused recently because of the rapid growth of sequence databases. By computing alignment, we can understand similarity among the sequences. Many systems for alignment have been proposed to date, but most of them are designed for two-dimensional alignment (alignment between two sequences). In this paper, we describe a compact system with an off-the-shelf FPGA board and a host computer for more than three-dimensional alignment based on dynamic programming. In our approach, high performance is achieved (1) by configuring optimal circuit for each dimensional alignment, and (2) by two phase search in each dimension by reconfiguration. In order to realize multidimensional search with a common architecture, two-dimensional dynamic programming is repeated along other dimensions. With this approach, we can minimize the size of units for alignment and achieve high parallelism. Our system with one XC2V6000 enables about 300-fold speedup as compared with single Intel Pentium 4 2GHz processor for four-dimensional alignment, and 100-fold speedup for five-dimensional alignment. 1. INTRODUCTION Alignment problems in computational biology, namely homology search, have been focused recently because of the rapid growth of sequence databases[1, 2, 3]. By computing alignment, we can investigate similarity among the sequences. Dynamic programming is a technique to find optimal alignment among sequences. In dynamic programming, all causal connections to the final result are stored, and back-traced in order to obtain the optimal alignment. Its computational complexity, however, is very large (order to compare sequences of length ), and it is not realistic to use algorithms based on dynamic programming even for alignment between two sequences on desk-top computers. In order to reduce the computation time, many heuristic algorithms[4, 5, 6] or hardware systems [7, 8, 9, 10, 11, 12, 13] have been proposed. Most of them, however, are designed for two-dimensional alignment (alignment between

0-7803-9362-7/05/$20.00 ©2005 IEEE

2. DYNAMIC PROGRAMMING FOR HOMOLOGY SEARCH In the dynamic programming for homology search, sequences are compared inserting gaps with extra costs. Figure 1 shows an example of alignment of two sequences by dynamic programming (two-dimensional). In Figure 1(A), scores on each node on the search space ( ) are calculated using the equation in Figure 2. Scores for each matching between ) and inserting gaps (GC ) are two elements (Ms given by score matrices [15, 16]. In each node, there are three candidates of its score (from the left-upper node, upper node and left node) in two-dimensional search, and the maximum of them is chosen. The paths which give the maximum values are stored, and after calculating scores of all

173

Y b[0]

b[1]

b[2]

Start Node

...................... b[N-1]

y

a[0]

t=1 a[1]

X

a[M-1]

............

..........

t=2 Last Node

x

t=X

(A) computation of scores of each node b[0]

Start Node

b[1]

b[2]

t=X+1 .......................... t=X+Y-1

(A) Parallel Processing of two dimensional dynamic programming

...................... b[N-1]

y

Z

a[0]

Y

a[1]

..........

X

a[M-1]

x

Last Node

(B) backtracing from Last Node t=X+Y+ k’

Fig. 1. Two Dimensional Dynamic Programming Two-Dimensional Search: score score - +Ms score - +GC score +GC Three-Dimensional Search: score score - +Ms score - +Ms + score - +Ms + score +Ms -+ score +GC score +GC score - +GC -

(B) Parallel Processing of three dimensional dynamic programming

Fig. 3. Parallelism in Dynamic Programming in parallel. The maximum parallelism in -dimensional search is the product of the size of - sequences (in the maximum case).

- - -

3. MULTIDIMENSIONAL DYNAMIC PROGRAMMING ON FPGA Our target system consists of one off-the-shelf FPGA board with one FPGA and a host computer. Problems in implementing multidimensional dynamic programming on a system with one FPGA are

Fig. 2. Equations to calculate Scores nodes, the paths are back-traced from the last node to the start node to obtain the alignment of the two sequences (Figure 1(B)). We do not need to store scores of nodes, because the scores of the nodes on the obtained alignment can be recalculated easily (its computation order is at most ). To obtain an alignment of more than two sequences, the same procedure is applied to the sequences. The search space of -dimensional dynamic programming becomes (when sequences have same length ). As indicated by the equations in Figure 2,

1. the size of score matrices, 2. the memory bandwidth to output all paths, and 3. how to achieve more parallelism under limited hardware resources and memory bandwidth. 3.1. Repetition of Two-Dimensional Dynamic Programming

1. the number of candidates of the score for each node is in -dimensional dynamic programming, and

The maximum size of the score matrices becomes as described above. In protein sequences, is 24, and the size becomes 13.5K words when = , and 324K words when = . In three-dimensional dynamic programming, we could implement enough number of score matrices on FPGA by dividing them into two kinds of sub-matrices, and minimizing the size of each of them [14]. However, this method does not work well for larger than three. Therefore, we decided to calculate -dimensional dynamic programming by repeating two-dimensional dynamic programming along other dimensions. For example, in the four-dimensional dynamic programming (a four-dimensional is used), suppose that score matrix Ms we repeat the following procedure.

2. the maximum size of score matrices is ( is the number of type of elements in the sequences), which becomes very large for larger . In the dynamic programming, we need to store paths to each node to back-trace. The total size of the paths becomes (the number of the nodes in the search space) (data bit width of a path), which becomes very large for larger . However, if the given sequences are not apparently similar, we do not need the alignment. The similarity of the sequences can be judged using the score of the last node. Figure 3 shows the maximum parallelism. As shown in Figure 3, nodes on a diagonal line (plane) can be processed

174

Z

Scan Window

X

Wx

Scan Line

Z

Wy

Scan Line

Y

Wx

Y

Wx

Y

Wx

Y

X Wy

Scan Line

X

Scan Window

Scan Window Step0 ........... StepK ....

Step(Z+ 0) .....Step(Z+K) ....

Fig. 4. Parallel Processing of Two-Dimensional Dynamic Programming

Z

X

1. Calculate the alignment between two sequences ( and ) without changing other two sequences ( = and = ; and are constants). 2. Increment , and then (

or(and)

Wx

Y

................ Wy

Step(Z*m+0) ..... Step (Z*m+K) ....

is changed).

Fig. 5. Three-Dimensional Dynamic Programming Then, we need only a part of the four-dimensional matrix, ) Input from the previous scan line which is a two-dimensional score matrix (Ms Z Z in the first step of the procedure. However, we need difY Y Input from the ferent two-dimensional score matrix when the value of scan line above or is changed. In our implementation, two-dimensional X X scan line score matrices are implemented using dual-port RAMs in or/and (namely FPGA, and score matrices for next Output to the next scan line next parts of the four-dimensional score matrix) are downCurrent Previous Scan Current Scan loaded from external RAMs on the FPGA board in parallel (B) (A) Window Window Scan Window with the computation of scores. The number of score maOutput to the scan line below trices which are download during the computation becomes . Thus, with a certain value of , the downloading Fig. 6. Boundary Data for Three-Dimensional Dynamic time of the next score matrices exceeds the time of the comProgramming putation of the two-dimensional dynamic programming, and becomes the bottleneck of this approach. (light gray square), and two rectangles with slanted lines In the following discussion, suppose that , , and show the output by the scan window. The outputs are stored, are length of sequences placed along , , and axes, and used for the computation of other scan lines. In Figure , , and are part of sequences which can be and 6(B), in order to calculate scores in the current scan window, processed continuously without extra input/output for bounddata in previous scan window are also necessary (those data ary data. For example, the maximum parallelism in twoare not necessary in Figure 6(A), because the scan window dimensional dynamic programming is , but because of is placed at the boundary on the search space, and boundary limited hardware resources, nodes can not be processed conditions are given instead of those data). is divided by , and nodes in parallel. Therefore, Figure 7 shows the scan cube for four-dimensional dyare processed in parallel. After processing the first namic programming (a cube is used instead of the window). nodes, next nodes are processed (Figure 4). Processing of nodes in the cube (size is ) Figure 5 shows how three-dimensional dynamic programis scanned along axis, changing positions of the scan line. ming is executed by the repetition of the two-dimensional In order to calculate scores of the nodes in the cube, the dynamic programming. In Figure 5, processing of scan window in the cube (light gray square in Figure 7(A)) nodes in the scan window (gray square in the figure) is is scanned along axis. Suppose that current cube is on scanned along axis (the black arrow shows the scan line). =Ck . In order to start the calculation of the scan When the scan window reaches at the end of axis, it is window (Figure 7(B)(1)), we need scores in dark gray parts , and is scanned along axis shifted along axis by =Ck and scores in the previous cube along axis ( again. After processing nodes, the scan window which are temporally held on the FPGA (not shown in the , and the same procedure is shifted down along axis by figure) as boundary data. Among these data, two dark gray is repeated. rectangles in the figure can be obtained while calculating the scores of the nodes in the scan window. However, data in the Figure 6 shows the data input/output for the threedark gray square (the last scan window in the previous scan dimensional dynamic programming. In Figure 6(A), two cube along axis) need to be loaded before starting the caldark gray rectangles show the inputs to the scan window

175

2. data width of the score of each node is 18 bits, and the width of the external memory banks on the FPGA board is 36 bits,

culation, because the size of data is large, and can not be loaded in parallel with the computation. The outputs by the scan window are two rectangles with slanted lines. When the scan window is in the cube (figure 7(B)(2)), scores calculated in the previous scan window are held on the FPGA, and used for the calculation of the current scan window (the scores in the previous cube along axis which are held on the FPGA are also used). When the scan window reaches at the end of the cube, scores in the current window are stored for later processing (figure 7(B)(3)). In this processing of the scan cube, there are two types of data;

3. the external memory banks run at the same speed with the circuit on the FPGA. Then, the total clock cycles by our approach are estimated as follows. In the following equations, the first term chooses the maximum of the computation time of the scan window ( ) and the time to update score matrices which is executed in parallel with the computation. In other terms, constant values show the time to download score matrices, and other values show the time to input/output boundary data (some matrices can not be loaded in parallel with the computation, and we need to download them when and so on are changed).

1. data which can be loaded, and output in parallel with the computation of the scores of the nodes in the scan window (two dark rectangles in Figure 7(B)(1,2,3)), and 2. data which have to be loaded before the computation (dark gray square in Figure 7(B)(1)) and which have to be stored after the computation (dark gray square in Figure 7(B)(3)).

Three-Dimensional: Four-Dimensional:

3.2. Two Phase Search

Five-Dimensional:

In order to store all paths, we need to output bits to external RAMs on the FPGA board at each clock cycle, because scores of nodes in the scan window are calculated in parallel, and the data width of each path is . The maximum number of external memory banks on offthe-shelf FPGA boards are eight (36 bit-width) as for as we know. Therefore, the parallelism ( ) is limited not by the amount of hardware resources, but memory bandwidth. The is 4, even if all memory parallelism becomes 64 when banks are used to store the paths (in practice, all memory banks can not be used only for the paths). Therefore, in our approach, the search consists of two phases. In the first phase, a circuit to obtain only the score of the last node is configured on the FPGA. With this circuit, we can achieve maximum parallelism decided by the amount of the hardware resources. Then, users judge the similarity among the sequences, and if the alignment is necessary, another circuit for the second phase is configured on the FPGA, and all paths are output into the external memory banks on the FPGA board, though it takes more time than the first phase. With this two phase search, we can preclude the computation of unuseful alignments.

Six-Dimensional:

In the equations above, and are the parameters which decide the system performance. In order to obtain the maximum performance, we need to fix our target system first; size of FPGA and the number of memory banks on the FPGA board. We assume that the FPGA board has eight independent memory banks (our target board is ADM-XRC-II by alpha data), and one XC2VP100 (we choose XC2VP100 to estimate performance because it is one of the largest FPGA, and widely available). Table 1 shows the values of the parameters which give the best performance to each search phase of each dimension (only the score of the last node is output in the first phase, and all paths are output in the second phase). In order to obtain those values, we estimated the circuit size based on circuits for four and five-dimensional homology search which we implemented on XC2V6000 to evaluate actual performance (the size of a unit to calculate a score of one node, the size of units to control data input/output, and the size of score matrices implemented by dual port RAMs (two-dimensional score matrices are implemented using block RAMs, while one-dimensional score matrices by distributed RAMs) are used to estimate the circuit size on XC2VP100). As shown

3.3. Maximum Parallelism under Limited Hardware Resources Suppose that 1. the data width of each element in score matrices is 8 bits,

176

previous scan cube on x-axis

previous scan cube on y-axis Start node

Wz Wy Outputs to next scan cubes on z axis

Wx

(1) End node

previous scan cube on z-axis

(A) Scan Cube

(2)

(3)

Outputs to next scan cubes on x and y axes

(B) Boundary Data Given to the Currnet Search Cube and Stored for other Cubes

Fig. 7. Four-Dimensional Dynamic Programming

Table 1. Values of the Parameters to Realize the Best Performance Dim. Parameters

Output Only the Last Score

Table 2. Estimated Computation Time calculation of scores

Dim. Output All Paths

downloading score matrices & score matrices boundary data

Table 3. Size and Operational Frequency of the Circuits

in Table 1, the values are different for all cases, which means that we need different configuration (different circuit) for each dimension. Table 2 shows the estimated performance (total number of clock cycles) under the system described above, when the length of all sequences is 256, and only the the score of the last node is output (the first phase). As shown in Table 2, when the dimension is larger than three, downloading time for the score matrices which are downloaded in parallel with the computation of the scores in the search window becomes longer than the computation time, though they do not differ so much if the dimension is less than six. The time to download score matrices and input/output boundary data, which can not be processed in parallel with the computation of the scores, begins to dominate the total computation time when the dimension is larger than four. By using faster SRAMs as external RAMs on the FPGA board, such as DDR SRAMs, we can reduce the time to download score matrices and input/output boundary data to half (latency can be ignored in our implementation). Then, we can process up to five dimension, not being distressed by the input/output speed of the FPGA.

Dim. 4 5

Hardware Usage slice BRAMs 64% 98.6% 73% 100%

Frequency 36.6 MHz 31.0 MHz

vious scan window and cube on the FPGA, and controllers to download score matrices and store/restore boundary data. We have implemented two circuits (four-dimensional and five-dimensional homology search) on an off-the-shelf PCI board (ADM-XRC-II by Alpha Data) with one Xilinx XC2V6000. With one XC2V6000, the number of block RAMs becomes the bottle-neck (only 144 (444 in XC2VP100)), and the performance of the first and second circuits becomes same. and are and respectively. Table 3 shows the hardware usage and the operational frequency of the two circuits. The operational frequency of the five-dimensional circuit is worse, because we need to choose the maximum score from 31 candidates, while 15 in the four-dimensional circuit.

4. RESULTS The details of the circuit for two-dimensional dynamic programming was described in [13]. Major differences for multidimensional dynamic programming are the number of adders and selectors (to calculate scores of candidates and choose one of them), block RAMs to hold scores in the pre-

177

Table 4 shows speedup of the circuits compared with a software program on Intel Pentium 4 2GHz processor. The drastic speedup comes from many parallelism in the computation; parallel processing by units, parallel calculation candidates in each unit and choosing one of scores of of them, parallel accesses to many score matrices and so on.

programs”, Nucleic Acids Research, Vol.25, No.17, pp.3389-3402, 1997.

Table 4. Performance Compared with Software Dim. 4

5

Length of Sequences 64 128 256 64

Execution Time (sec) hardware software 0.0508 11.5 0.817 228 13.1 3900 11.3 1180

Speed up

[6] William R. Pearson and David J. Lipman “Improved tools for biological sequence comparison”, Proceedings of the National Academy of Sciences of the USA, Vol.85, pp.2444-2448, 1988.

226 279 298 104

[7] PARACEL,“GeneMatcher2”, http://www.paracel.com/. [8] Dominique Lavenier, “SAMBA Systolic Accelerators For Molecular Biological Applications”, Technical Report RR-2845, 1996.

5. CONCLUSIONS In this paper, we described a system for more than threedimensional alignment using dynamic programming. In our approach, high performance is achieved by configuring optimal circuit for each dimensional alignment, and by two phase search in each dimensional alignment by reconfiguration. In order to realize multidimensional search with a common architecture, multidimensional dynamic programming is realized by repeating two-dimensional search along other dimensions. With this repetition approach, we can minimize the size of units for alignment and achieve high parallelism. The circuits implemented on XC2V6000 showed very good results. However, we still have many points which should be improved. First, their operational frequencies are slow. We have proposed a technique (multi-thread execution) to improve the frequency of dynamic programming [13]. The technique was not implemented on our circuits, because of the complexity to design circuits for the multidimensional dynamic programming following the technique. As for the second phase, circuits which have same number of units as the first phase, but stop calculation while storing the paths using all external memory banks will achieve higher performance according to our estimation.

[9] C. Thomas White, Raj K. Singh, Peter B. Reintjes, Jordan Lampe, Bruce W. Erickson, Wayne D. Dettloff, Vernon L. Chi, Stephen F. Altschul, “BioSCAN: A VLSI-Based System for Biosequence Analysis”, IEEE International Conference on Computer Design: VLSI in Computer & Processors, Vol.147, pp.504-509, (1991). [10] TimeLogic Corporation, bioinformaticsacceleration http://www.timelogic.com/products.html,

[11] Kiran Puttegowda, William Worek, Nicholas Pappas, Anusha Dandapani, and Peter Athanas, “A Run-Time Reconfigurable System for Gene-Sequence Searching”, International VLSI Design Conference, 2003. [12] Steven A. Guccione and Eric Keller, “Gene Matching Using JBits”, International Conferenece on FieldProgrammable Logic and Applications, pp.1168-1171, 2002. [13] Yoshiki Yamaguchi, Yosuke Miyajima, Tsutomu Maruyama, Akihiko Konagaya, “High Speed Homology Search with Run-time Reconfiguration”, International Conferenece on Field-Programmable Logic and Applications, pp.281-291, 2002.

6. REFERENCES [1] National Center for Biotechnology Information (NCBI), “NCBI-GenBank Flat File Release 137.0”, http://www.ncbi.nlm.nih.gov/, Aug 2003.

[14] Yoshiki Yamaguchi, Tsutomu Maruyama, Akihiko Konagaya, “Three-Dimensional Dynamic Programming for Homology Search”, International Conferenece on Field-Programmable Logic and Applications, pp.505-514, 2004.

[2] European Molecular Biology Laboratory (EMBL), http://www.ebi.ac.uk/embl/. [3] DNA Data Bank of http://www.ddbj.nig.ac.jp/.

Japan

“Decypher solution”, 2002.

(DDBJ),

[4] Stephen F. Altschula, Warren Gisha, Webb Millerb, Eugene W. Meyersc, and David J. Lipman, “Basic Local Alignment Search Tool”, Journal of Molecular Biology, Vol.215, Issue 3, pp.403-410, 1990.

[15] Henikoff, S. and Henikoff, J.G.: Amino Acid Substitution Matrices from Protein Block, Proc. Natl. Acad. Sci. 89, pp.10915-10919, 1992. [16] Jones, D. T. et. al: The Rapid Generation of Mutation DataMatrices from Proten Sequences, CABIOS 8, pp.275-282, 1992.

[5] Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch ffer, Jinghui Zhang, Zheng Zhang, Webb Miller and David J. Lipman, “Gapped BLAST and PSIBLAST: a new generation of protein database search

178