AN APPROACH TO SCALABLE MOLECULAR ... - Xun ZHANG

arithmetic is deployed completely on reconfigurable hard- ware. We compare, in terms of communication costs and processing efficiency, our technique based ...
27KB taille 1 téléchargements 347 vues
AN APPROACH TO SCALABLE MOLECULAR DYNAMICS SIMULATION USING SUPERCOMPUTING ADAPTIVE PROCESSING ELEMENTS Luis E. Cordova, Duncan A. Buell Department of Computer Science and Engineering University of South Carolina Columbia, SC 29208 email: cordoval, buell @engr.sc.edu 2. SUPERCOMPUTING ADAPTIVE PROCESSING ELEMENT

ABSTRACT We implement and report performance numbers of an entire molecular dynamics application in floating point on reconfigurable hardware achieving sustainable speedup and scaling with a novel technique based on adaptive processing elements. 1. INTRODUCTION Molecular Dynamics (MD) simulations are mechanisms by which the particles in a system are time evolved by integrating their equations of motion. MD is a numerical simulation method that can contain different kernels used to compute different types of interactions: many-body (MB), variable-charge (VC), tight-binding (TB), and density-functional theory (DFT). The Lennard-Jones force field is a particular case of the MB kernel. Some kernels are more computationally expensive than others, but, the generic MD simulation–regardless of the kernel used–is parallelized either through data decomposition ( data replication, atom decomposition, and force decomposition ) or using domain decomposition methods (DDMs) [1]. In decomposition methods one decomposes the interactions of particles accross processing elements (PEs) provided that interactions between the -th and -th particles are null when the distance between them surpasses a threshold. Decomposition is achieved using lists of neighbor particles, a.k.a. cell-linked lists. This reduces the computation to . costs from However, despite the high efficiency of a compiler for a von Newmann processor model, the PE level interactions computation is performed poorly in a serial fashion and the advantage gained via cell-linked lists is lost due to time spent in complex load balancing, inefficient memory caching and data buffering, and increased inter-processor communication because of many large neighbor lists and poor data locality [1].

0-7803-9362-7/05/$20.00 ©2005 IEEE

Here we introduce a novel scaling parallel technique to accelerate MD simulations in a reconfigurable supercomputer, and for the first time the MD application with floating point arithmetic is deployed completely on reconfigurable hardware. We compare, in terms of communication costs and processing efficiency, our technique based on supercomputing adaptive processing elements (SAPEs) against standard microprocessors-based techniques. An SAPE element uses the multi-adaptive processor (MAP) of the reconfigurable computer and is able to implement, in a single MAP, data-paths and a data access mechanism–independent of the MD kernel used–reducing . SAPEs model the computation costs down to most optimized approach to intelligent caching and operate as fully pipelined compute engines fed with data at each clock cycle. The memory access mechanism that may look simple is rather non-trivial, and is solved using a technique that groups common read(write) operations from(to) common memory locations. We ported the MD application entirely to the SRC-6 supercomputer [2] leaving the host CPU with only I/O file reading tasks and initial and final data movements. The first chip–out of two user FPGAs of the MAP–does the part of the computation, looping on steps, advancing the simulation over time; it comprises the velocity Verlet update module which updates positions and velocities for each step over simulated time. The second chip does the force calculation of an -particle subsystem. Optimization of the algorithm is carried out through transformations [3] to explore the space of high performance architectures. Some of the techniques employed include effective use of block RAM on-chip memory, pipelining and scheduling of computation loops, tiling threading, and resource factoring and sharing, among others. Preliminary results contribute beyond previous work in (1) performance record achieved, (2) more effective metrics

711

than the ones reported regarding throughput such as results per second or seconds per step, and (3) the fact that we know of no previous work implementing the entire simulator system on reconfigurable hardware. Therefore, we provide application speedup using a reconfigurable supercomputer without leaving a standard ANSI C programming flow with no reliance on hardware description languages. Implementations based on SAPEs thus demonstrate the reduction of inter-processor communication that is not otherwise achieved by other coarse grained schemes, such as multithreading, distributed computing, or multicore technologies. In addition, we explore SAPE-based acceleration on more complex ab initio density functional theory (DFT++) codes. In particular, we study more useful implementations in the terms of ab initio Car-Parrinello Molecular Dynamics (CPMD)–a DFT type of MD kernel–where the domain decomposition makes heavy use of linear algebra tensorial operations and 3-d Fast Fourier Transforms to move data back and forth between the real and reciprocal spaces [4]. Finally, a methodology is introduced for high performance and low power optimizations. Low power techniques for high performance reconfigurable computing is a new field and some initial techniques still rely on utilization of special features of the underlying hardware such as Block RAM and embedded multipliers. We propose a modification of these methods seeking for evaluating the overall power budget and merging configuration bistreams to achieve computation bounded by power constraints. 3. REFERENCES [1] T. Straatsma and J. McCammon, “Load balancing of molecular dynamics simulation with nwchem,” IBM Systems Journal, vol. 40, no. 2, pp. 328–341, 2001. [2] “Src-6e c-programming environment guide,” SRC Computers Inc, 2004. [3] W. Bohm and J. Hammes, “A transformational approach to high performance embedded computing,” in Proc. HPEC, vol. 1, Sept. 2004. [4] R. Vadali, L. Kale, G. Martyna, and M. Tuckerman, “Scalable parallelization of ab initio molecular dynamics,” Technical Report, 2003.

712