GPGPU and Matlab - Florent Bouchez Tichadou .fr

Jan 8, 2010 - GPUlib2 is available under the General Public License (GPL), and is a .... the assignment “C2p29_PSET4_1b” from a course at the MIT [3], and.
216KB taille 1 téléchargements 481 vues
Research Report: GPGPU and Matlab Florent Bouchez Indian Institute of Science, Bangalore, India [email protected] January 8, 2010

Abstract GPGPU stands for General-Purpose computation on Graphic Processing Unit (GPU). Its goal is to use GPUs, which are traditionally used mainly for graphics computations, to perform High Performance Computing (HPC), in particular for programs with massively independent parallelism. This report shows my investigations while trying to accelerate a high-level language, Matlab, using GPUs. I introduce existing toolboxes and explain why we need to build better framework for accelerating Matlab. I also consider real-life examples that would greatly benefit from acceleration, how I managed to accelerate them and how this steps could be reproduced in an automated framework. Finally, I present some more ideas that I think worth keeping in mind to improve performance in custom GPU kernels.

Contents 1

Why accelerating Matlab?

3

2

Existing solutions to accelerate Matlab 2.1 CUDA with MEX files . . . . . . . . 2.2 GPUlib . . . . . . . . . . . . . . . 2.3 GPUmat . . . . . . . . . . . . . . . 2.4 Jacket . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

4 4 4 5 6

3

Limitations in existing acceleration solutions

7

4

Combining operations into one kernel

8

5

Improving Matlab acceleration 5.1 Selecting the code to accelerate . . . . . . . . . . . . . . . . . . . . . 1

10 10

6

Accelerating the code 6.1 The filter function . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The objgrad function . . . . . . . . . . . . . . . . . . . . . . . . .

11 11 14

7

Future improvements

16

8

Conclusion

19

A Using Matlab with GPU on ccx09

20

B Matlab codes

20

2

With the advance of the gaming industry, Graphic Processing Unit (GPU) are now capable co-processors which can perform billions of instructions per seconds, relying mainly on completely independent processors performing the same instruction at the same time. Let us compare two recent Central Processing Unit (CPU) and GPU:

Gflops Price Price per Gflop Number of “cores”

CPU

GPU

Intel Core i7 70 $ 900 $ 12.9 4

GeForce GTX 295 1800 $ 500 $ 0.28 480

It is clear that if one can manage to utilize the full power of the GPU, it is possible to execute computations much faster than with a regular sequential CPU. However, while the different cores of a CPU can perform completely different computations at the same time, all processing units of GPUs run the same program at the same time, only on different data sets. They are indeed massively parallel architecture, and are good to execute massively data-independent programs. Finding parallelism in a program has been a challenge for at least the past twenty years of research in parallelism, and is still one today. But with the arrival of GPU chips capable of executing many instructions in parallel, available to average users, it become even more important to be able to utilize this power so that the end-users of computers can benefit from it and not only elite or research members. In my opinion, languages should evolve and low-level languages like C are not well suited for programming complex parallel tasks. It is a necessity to give to programmers new tools to effectively develop parallel program while reducing the need of a full parallelism comprehension and good knowledge of the underlying architecture. Matlab1 is a high-level language with a powerful matrix-based system. It is often used for easy prototyping in early development stages because it relieves the programmer of many details like memory management and has an impressive library of built-in many matrix functions. Hence, it is a useful tool for many simulation domains like fluid mechanics, biology or mechanical engineering, especially for people whose first domain of competence is not computer science and are programmers by necessity. Unfortunately, all the user-friendliness of Matlab comes at a price, which is that: first Matlab is not free and licences are expensive; second, its performance is well under the performance of an equivalent hand-programmed C code.

1

Why accelerating Matlab?

One particularity of Matlab is that it is a matrix-based high level language, hence many parallel computations are easily expressed in the form of matrix operations. For example, the addition of two arrays A and B is simply expressed by the instruction [C = A + B], which is much simpler that the corresponding C code. 1 http://www.mathworks.com

3

Matlab

C code int i; for(i=0; i> gpuInit(); >> x = rand(N, 1,’single’); >> xgpu = gpuZeros([N,1],’single’);

% allocate GPU memory for x

2 http://www.txcorp.com/products/GPULib 3 Interactive

Data Language, http://www.ittvis.com/ProductServices/IDL.aspx

4

>> >> >> >> >> >>

resgpu = gpuZeros([N,1],’single’); gpuSet(xgpu, x); gpuGammaln(xgpu, resgpu); y = gpuGet(resgpu); gpuFree(xgpu); gpuFree(resgpu);

% % % % %

allocate GPU memory for result transfer x to GPU memory perform Gammaln on all elements of x transfer the result back on the CPU free GPU variables

As one can see, the conversion required is tedious. It requires some level of memory management from the user and explicit decisions of what needs to be computed on the CPU or the GPU. It is moreover limited by the list of functions that are supported by GPUlib and for which a GPU version has been implemented.

2.3

GPUmat

GPUmat4 is a Freeware library, and requires no fee for academic usage. It is in a more advanced state that GPUlib, so memory management on the GPU is automated, even if data movements still need to be explicit. This is done through the addition of a new type of variables, GPUsingle, which makes use of GPU memory transparent for users. Deletion of unused variables is taken care of by a garbage collector so as not to clobber the GPU memory. This is important as data processed by GPUs tend to be very big since computations are based on data-parallelism. The use of this new type allows Matlab functions to be overloaded so that either the GPUmat or the original Matlab function is called depending on the type of the arguments. Let us see on an example how to get the sinus of every cell of a 100 × 100 matrix: >> >> >> >> >> >> >>

GPUstart; A = single(rand(100,100)); g_A = GPUsingle(A); B = sin(A); g_B = sin(g_A); h_B = single(g_B); whos Name Size A B g_A g_B h_B

100x100 100x100 100x100 100x100 100x100

>> max(max(abs(B - h_B)))

% % % % %

A in on CPU memory g_A is on GPU memory compute sinus on CPU compute sinus on GPU copy g_B to h_B in CPU memory

Bytes

Class

40000 40000 60 60 40000

single single GPUsingle GPUsingle single

Attributes

% verify B and h_B are identical

ans = 5.9605e-08

Since Matlab operates in floating-point, it is not surprising to see little differences in the results, but a 10−7 precision for single precision floats is completely acceptable. Note that this difference depends on hardware differences between CPU and GPU and on algorithmic differences between Matlab and CUDA. Please also note that we used 4 http://www.gp-you.org

5

single-precision float operations: Matlab works in double-precision by default, but on our GPUs, we only have access to single-precision operations. I discuss this problem later in Section 6.2. This choice in GPUmat to rely of the type of variables to choose whether to compute on the CPU or the GPU is very nice for the users, especially when converting already existing Matlab code to utilized a GPU. If one is lucky, then a single conversion like A = GPUsingle(A); at the beginning of a computation, and then the converse B = double(B); at the end does the trick. This even works if a Matlab function internally performs GPU-compatible computations or functions calls. However, this is often not that simple because not all Matlab functions and constructs are supported, and we often have errors such as the following, because ismember is does not have a GPU implementation: >> ismember(1,g_A); ??? Undefined function or method ’any’ for input arguments of type ’GPUsingle’.

This usually means that, at points of the program where it happens, one has to manually copy back to the CPU versions of variables on which the unsupported function must be applied, execute them on the CPU, and copy the result back on the GPU. Such limitations often impact drastically the performance of the program, in many cases cancelling every benefit of using a GPU since memory transfers between the CPU and the GPU can quickly become the bottleneck of a computation. There is another possibility to improve acceleration using GPUmat. This toolbox also provides “User Modules,” and a way to creates one’s own modules. This allows for example to implement oneself a Matlab function missing GPUmat’s catalogue. Of course, it again means having a good knowledge of CUDA and NVIDIA’s GPU cards. But there are many advantages compared to the traditional plain MEX files, the main one being having access to GPUsingle variables. This allows to interleave calls to one’s own GPU kernels with GPUmat function or Matlab CPU computations, without adding unnecessary memory transfers since GPUsingle variables are resilient GPU memory. It also gives access to many practical functions used internally by GPUmat to create new Matlab variables, retrieve informations on size, type, dimensions of input arguments. To summarize, I found that GPUmat is a very good tool for accelerating Matlab on GPU, but only if: • either the Matlab code is very simple and regular and uses common matrix-based operations; • or one is willing to spend some time learning CUDA, GPU technical features, the User Module of GPUmat and them implement customized kernels fitted to one’s needs.

2.4

Jacket 5

Jacket operates in a way very similar to GPUmat, so I will detail here only the main differences. A practical one is that this toolbox is very expensive even for academic 5 http://www.accelereyes.com

6

usage, so all my experience with this library was done during a mere 15-day of free trial before my licence expired. Similarly to GPUmat, Jacket proposes a “Jacket Developer SDK Upgrade” which allows users to develop their own kernels within the Jacket environment, but it comes at a prohibitive price for me so I could not test it. Identical to the GPUsingle type introduced by GPUmat, Jacket adds a new type of variable to Matlab: gsingle. They provide a extensive list of Matlab functions with their state of support, ranging from “fully supported,” meaning that they should work in any situation authorized by the original Matlab function, to “not supported,” which cannot be used with gsingle arguments and there is no plan to support them in the immediate future. The number of supported functions is much bigger in Jacket than in GPUmat at the time of this writing. One of the most promising feature of Jacket over GPUmat is the addition of a gfor/ gend loop construct, whose purpose is to launch simultaneously all of the iteration of a for loop on the GPU. It is still in a preliminary state and only allows simple functions to be used inside the loop: element-wise arithmetic, matrix transposition or multiplication, subscripting reference, etc., but does not support conditional statements, dependencies between loop iterations or nesting of gfor loops. The GP-you group behind GPUmat claims they are working on a similar construct but do not provide any beta-release at the moment. >> >> >> >> >>

A = gones(n,n,m); % creates 3-dimensional NxNxM array filled with 1’s B = gones(n); % creates NxN matrix filled with 1’s gfor i = 1:2:m % perform M matrix multiplication plus addition A(:,:,i) = A(:,:,i)*B + sin(i+1); gend

For the gfor construct to work, Jacket has to generate CUDA code on the fly, compile it, and then launch on the GPU. It may also do the same for other portions of Matlab code, but it is difficult to have precise information on that because the documentation does not detail it and the closed source code status makes it difficult to investigate. We will see in Section 4 a way to to observe some of the behavior of Jacket for fine tuning.

3

Limitations in existing acceleration solutions

While experimenting with the existing toolboxes, I found some limitations which diminish their usefulness. I give here an example taken from actual fluid mechanic programs, when I tried to accelerate it using Jacket. The excerpts of code presented here is adapted from the assignment “C2p29_PSET4_1b” from a course at the MIT [3], and consists of about 200 lines of Matlab code. I will review here portions of code where I made changes or did not manage to make some. Q calculation. n=[0:max_N]; a=1; b=10; alpha=(n+0.5)*pi; Q=2*(-2/3*a^3*b+sum(4/a./alpha.^5.*tanh(alpha*b)));

7

This code was very straightforward to compute on the GPU by converting the array n to gsingle. However, with the original value of max_N at 200, it was actually slower to run it on the GPU. I tested its scalability by increasing the value to 2.107 , at which point the CPU time was measured at 0.89 s and 0.028 for the GPU, hence a speedup of about 32x. However, a bigger max_N gives only a better precision for Q so there is in fact no reason to increase it to such high numbers, especially since our GPU works only with single precision floats, hence gives a less precise calculation of Q in any case. Accelerating a for loop. In a for loop in the code, I tried different solutions to accelerate it which all failed because of several reasons: • use of unsupported function linspace prevents using gfor construct; • subscript reference problem in A(i,j): only one of i and j can be a GPU variable; • no left matrix division available for A\B: need to move A and B back on the CPU, with overall drop of performance by 5%; • no dynamic matrices with Jacket, hence assignment U(n+1,n+1)=0 is not valid if U is an n × n matrix; also, Jacket generates no warning for this error; • no mpower to compute power of matrices, requiring to manually transform A2 to A × A. In the end, the loop could not be run itself on the GPU, and only few operations in it could be run on the GPU. The constraints required that some of the operations should be run on the CPU which negated any benefit. Conclusion. This was just one example of accelerating Matlab code using Jacket. Other similar experiments I conducted with GPUmat and other codes are consistent with this one and led me to the conclusion that this technology is not mature enough to be used effortlessly by the average user.

4

Combining operations into one kernel

One interesting question is how powerful is Jacket when it comes to generating custom CUDA code on the fly. This seems a big advantage over GPUmat and something definitely wanted in a toolbox that provides automated code conversion. Let us consider again the computation of Q from the previous section, in particular, the part involving the sum of the elements of an array computed by some operations. This was one of the only few operations that Jacket was able to perform without help on the GPU: ! X 4 sum(4/a./alpha.^5.*tanh(alpha*b) × tanh(α[i] × b) a × α[i]5 i Jacket is a closed-source project and their developers try to hide as much implementation details as possible. To investigate what was actually being run on the GPU, I tried two possibilities: 8

• GPU emulation using barra;6 • using the gcache function of Jacket. Unfortunately, the barra emulator is not stable enough to work with Jacket. This would have been the best way to get as much information as possible as it would have shown which kernels were executed as well as their frequency for instance. I used the gcache function of Jacket, which allows the user to “save GPU compiled code for given Matlab script.” Indeed, Jacket provides this function as a way to keep CUDA code generated on the fly in memory to speedup future computations of the same Matlab program. By studying the .jkt files produced by this function it was possible to investigate what was in the kernels generated. Those files contain a lot of GPU code stored in “cubin” form (CUDA binary). These are embedded in binary “garbage” being probably informations linking these code to the Matlab codes but that I was not able to decipher. So I choose to remove it and keep only the cubin code, using a custom perl script.7 By using gcache function two times, one after loading Jacket and one after executing the sum, I found there was one more kernel in the second save while all the other kernels (about 250) were identical. The study of this kernel using decuda8 revealed that the code produced did not contain the algorithmic requirements to compute the sum of an array or the tanh function. We can deduce from this observation that Jacket probably ran four different kernels in the following order: 1. multiply each element of α by b; 2. perform tanh on the result; 3. multiply the result by 4 and divide it by a × α (this was the kernel generated); 4. sum the elements. This is not the most efficient solution, as each kernel launch has an overhead, and each kernel will have to re-read values from the global memory and writes the result in it. Supposing these operations were done in one kernel, then the shared local (much faster) memory can be used to store intermediate results, which speeds up the computation and relieve the global memory which does not need to store temporary arrays. kernels. To verify this intuition, I conducted the following experiment in a custom CUDA code. I considered an array α of size 512 so that the reduction (sum) could be done in only one block (else, it requires several kernel launches to synchronize information), which is consistent with the original Matlab code where α had size 200. I then wrote four kernels: one combines all operations while the other separates the computations into: computing the α array; computing the expression inside the sum (including tanh); computing the sum of the resulting array. The table below shows the time in seconds required to execute n iterations of the computation. CUDA

6 http://gpgpu.univ-perp.fr/index.php/Barra 7 http://florent.bouchez.free.fr/?download=recherche/post-doc/jkt2cubin.pl 8 http://wiki.github.com/laanwj/decuda

9

Iterations 1 kernel 3 kernels

1 0.056 0.032

10 0.133 0.175

100 0.93 1.61

1000 8.8 15.6

100000 877 1560

The case with only one iteration aside, combining all the operations in only one kernels is between 1.3× to 1.7× faster. This shows the importance of generating code on the fly, combining existing functions in one kernel. In that case, tanh was easily included into the kernel using the math.h library, and the sum was tailored to fit the example, knowing that there was at most 512 elements. This experiments shows that for accelerating Matlab, just implementing CUDA versions of every Matlab function is not enough. It is more efficient to have generic versions of functions that can be tailored to the particular needs of a program and inlined in the middle of a bigger kernel. In the next section, I will describe how I manually did this on an example, to show how it could be done in a automated way.

5

Improving Matlab acceleration

I have shown in the previous section that: First, existing toolboxes cannot handle many Matlab situations even though they exhibit massively parallel behaviour; Second, in situations handled by Jacket or GPUmat, the performance can be degraded by creating unnecessary kernel launches. I will present here simple rules that allowed me to still use GPU acceleration and that could easily be included in an automated framework. I will base my explanations on a simulation performing topology optimisation of a bridge (Eiffel’s bridge) graciously given to me by Meenakshi Sundaram from the Mechanical Engineering department of IISc. This simulation currently uses about 6000 planar elements (quadrilaterals), each having on average about 25 neighbours. Ideally, working with giant structures would require more than 30000 elements and 3D bricks elements having up to 81 neighbours. These simulations would be too timeconsuming with the current version of the code which is sequential. Many operations in the program require independent computation for every element, based on its neighbours, which are repeated until stabilization occurs. So this program was an ideal candidate to try acceleration techniques.

5.1

Selecting the code to accelerate

Matlab provides a built-in profile option that is very useful for collecting run-time data such as the overall execution time spent by every line of Matlab code, and the number of times each was executed. This allows us to focus on the parts of the code that needs acceleration the most, and can be easily used in an automated or semi-guided optimization process. Profiling timed the overall execution at 1024 seconds, 71% of them being spend in a user function called filter presented in Fig. 1. The filter function. This function was called 900 times, however, all but one of these calls originate from within a while loop that changes the topvar variable. This 10

Time (s) 0.03 594.31 0.48 6.16 115.47

# calls 901 901 6094364 499154 5595210 5595210

6.79 6.18

5595210 6094364

function FILTER filtvar=zeros(tot_ele,1); for ele=1:tot_ele if(ismember(ele,non_design)) filtvar(ele)=1; else filtvar(ele)=sum((neigbors(ele).distances) . . . .*topvar(neigbors(ele).e))/neigbors(ele).divisor; end end

Figure 1: Function filter, with profiling data. means global synchronisation must be performed between calls to filter, hence this loop must be executed by the CPU. Moreover, about 99.5% of the time spent in the loop is used by the call to filter. Peeking inside the code of filter on Fig. 1, a quick dependence analysis shows that every iteration of the for loop is independent, making the whole function a good candidate for a GPU kernel. The objgrad function. Of the remaining 29% of the total time, about 85% was spent in a function called objgrad. In this function, there are two for loops that are responsible for most of the time, in particular by calling another user-defined function, stiffness. Again, the loops process every element independently, making them good candidates for GPU kernels. One can see here the importance of identifying first the regions which need the most GPU acceleration. Other parts of the code perform matrix operations, but are responsible only for a small part of the whole compute time. I advocate that the effort are first being put on the time-consuming part as any effort on another part will come unnoticed and might also impact acceleration on important parts (for instance by using up too much GPU memory).

6

Accelerating the code

Now that we have identified good candidates to be run on the GPU, we will see the constraints or problems that can arise and how the can be dealt with. It is important to decide, for each kernel, how many blocks of threads will be used (num_blocks), how many threads in each block (num_threads), and in how many kernels the GPU code needs to be generated, depending on the requirement of communication or synchronization between threads. Let us start our examples with the simpler function.

6.1

The filter function

This function (see Fig. 1) makes tot_ele independent computations using the for loop, so each iteration of the loop can be computed in a different thread. However, the loop contains a conditional statement, and if conditions that evaluates differently 11

in a warp of threads (i.e., 32 threads of the same block), i.e., that are divergent, cause the two branches of the condition to be executed sequentially. In our case, it would mean that even if the condition is false for only one thread of the warp, all the other 31 threads would go idle during the execution of the first one. The solution to avoid this problem is to realize that no synchronization is required between different iterations of the loop, hence they can be executed in different blocks, hence also different warps: num_blocks = tot_ele. It is important to note that a block for which the condition in true will have nothing to do but to set the value of filtvar corresponding to its element to 1. It is highly inefficient to launch many blocks doing nearly nothing, so it is important to evaluate how ofter this happens. This is easily done with the profiling data, which shows that the false branch is taken at least 10 times out of 11. So the time taken by the blocks executing true is likely to be negligible compared to the other blocks. We still have to decide how many threads will be run in each block. This will be driven by the computations that needs to be performed by every element. • the ismember line tests if an element appears in an array. This is a classic reduction algorithm, like the sum, except that in this case we need to compute a logical OR on all the elements of the array: ∨i (ele = non_design[i]) This is a classical parallel reduction problem for which we know good solutions, and is easily solved in log(#non_design) steps and using #non_design/2 threads. • the sum line compute a number for every neighbour of the element and then sums them. It would obviously benefit from as many threads as the number of neighbours of the element. So at this point it is not clear how many threads should be launched in each block. The ideal would be that every element has the same number of neighbours, and that this number is half the size of the non_design array. If this is not the case, and if the numbers vary greatly (which is our case), it is then best to generate two separate kernels to avoid having many threads with nothing to do during one of the two computations. Kernel ismember. This kernel is a direct parallel reduction algorithm. In our case, only one block is required since the size of non_design is less than 1024 (there can be at most 512 threads per block). So we do not need an external CPU loop to synchronize results between blocks. In fact, this is not how I implemented this function. Since we basically need a big OR over all elements, I used a particular feature of the GPU. In a block computing element e, the thread 0 first write false at location array_is_member[e]. Then, each thread t write true in the same location if e = non_design[t]. If at least one thread found that e is a member of the array, then array_is_member[e] will contain true after executing this block. This work because in CUDA, threads can write at the same

12

time in the same location as long as it is the same value (else, the result is unknown). This trick can easily be applied on many boolean computations. In this case, some automated conversion would be possible by peeking at the code of function ismember. However, the actual Matlab code of this function is more complicated since in provides more features. This is why I think that there should be both generic and specialized (for particular cases) GPU kernels for Matlab function in a database so that it could be used directly or inlined in another kernel. Kernel for sum line. Again this kernel is a parallel reduction. This time the same trick is possible since the sum of the elements really needs to be accumulated. One easy simplification however is that we know from the program source that the maximum number of neighbours is 61, so there is no need to synchronize temporary results on multiple blocks. This means the whole line, including the inside computation and the sum, can fit in one kernel, saving the overhead of a kernel launch. The sum in this kernel requires some synchronization and sharing of data between the threads of one block. This is easily done using the shared memory on the GPU and the __syncthreads function. The only annoying part is that the variable neighbors is an array of “cells”, a special Matlab structure that is not supported by GPUmat. So we cannot copy it as is on the GPU memory, however, it is straightforward to convert each field of the cells into regular matrices of size (tot_ele × max(num_neighbours)). In the end, this kernel must be launched on tot_ele blocks, each containing max(num_neighbours) threads. Of course, many threads will be inactive for blocks processing element that have fewer neighbours, especially since there is only about 25 neighbours on average, but this cannot be easily avoided. Launching code. Now that our two kernels are available, there remains to implement the code to make the junction between Matlab and the CUDA code. By using GPUmat, it is easy to modify the Matlab code to add copy instructions for the data required on the GPU memory. In our case, there is a lot of “read-only” data (neighbours, non_design. . . ). This is easily detected and the transfers can be executed before entering any loop so that memory transfers between GPU and CPU are minimized. On the other hand, the variable topvar is modified at each iteration of the while loop that call filter, so this one needs to be copied to GPU memory at each iteration. Actually, it would be possible to detect automatically that this variable could be directly computed on the GPU (it involves some independent computations based on its previous values). This would reduce the memory transfers but I did not tackle this issue. Finally, I had to write some C++ code to launch the kernels. This would be easy to generate as it is just a MEX file including GPUmat libraries to simplify the interfacing. We just need to get the pointers to the arguments (being GPUsingle variables or scalars), create a new Matlab variable to hold the filtvar result and launch the kernels with said number of blocks and threads. It is also there that the amount of shared memory is set; in our case, we need max(num_neighbours) floats for each block so that the sum can be done. Experiments. The main part of the whole program consist of a while loop that stops whenever some stability is reached. For the bridge data, it takes 30 iterations. I timed

13

them using the tic and toc Matlab functions, and each iteration in the initial program takes on average 12 seconds. By using the GPU kernels to accelerate the filter function, each iteration was then timed at 3.6 seconds on average: the speedup is about 3.3×. This is for the whole loop, and I stated previously that the filter function was responsible for 71% of the time, meaning that the maximum speedup that could be obtained theoretically (with a filter function executing instantly) would be 100/29 = 3.45×. In other words, we have managed to accelerate the most critical part of the program in such a way that its computation time is now almost negligible. More importantly, we have done so using only simple analysis on the source code, without knowledge of the purpose of the computation, i.e., without knowledge of the underlying algorithm. This means that all the steps we have been through could have been done by a automatic optimization framework. Note: since the variable non_design is constant and the call to ismember is so expensive, it would actually be a very good idea to execute it once for all for every element and store the result in a array of booleans. This indeed speeds the code up a lot, however, my goal was to study the effect of using a GPU to accelerate a Matlab code, by applying simple analysis to built kernels. Hence I decided not perform this optimization which is besides the point for studying GPU acceleration. I nevertheless timed the acceleration of function filter without ismember (i.e., only the branches of the if-condition), and found a speedup of about 17×.

6.2

The objgrad function

We have seen in the previous section how a simple function can be converted to GPU code. We will now see a different example based on the objgrad function. As explained before this function consists mainly of two similar for loops. Between them there is some code that cannot be run on the GPU, so we will focus on one of the loops, to see what can be done. I arbitrarily chose the second one. The code for the loop and the stiffness function called is given in Appendix B. The particularity of the core of the loop and the stiffness function is that there are many operations involving matrices of small sizes. XX is of size 4 × 2, Dmat is 3 × 3, K is 8 × 8, B is 3 × 8. . . On top of this, the loop with GSP has only 4 iterations. Still, the whole computation is inside a for loop on all “active” elements. For these reasons, creating a kernel for stiffness only would not be wise as the level of parallelism in this function is a bit low and, more importantly, not scalable. However, all calls to stiffness are independent, hence is is a good idea to write the whole loop in a kernel, inlining in it the code for stiffness. Since each active element computed in the loop is independent, different blocks can be used to compute the Grad of different elements. Inside the stiffness function, the loop with index GSP can be executed in parallel by 4 threads. It is probably possible to run more than 4 threads per block to speed up the computations of intermediate matrix multiplications but it is tedious to do so by hand I did not investigate this. I believe that in an automated framework, it could be done more easily by automatically computing the number of threads required to optimize each computation and taking the minimum over all computation so that no threads goes idle.

14

Since it is known at compile-time the size of all matrices involved in a computation of an element of Grad. We can use this information to fully unroll every matrix computation: the multiplications but also matrix creation. I did only some of the most important unrolling since it would have been too error prone to unroll every computation by hand. It would however be right in the field of competence of an automated framework. Experiments. I replaced the whole second loop with a call to a custom gGrad function, which launches n_act_ele × 4 CUDA threads. The effect on the whole program is that each iteration of main stabilization loop, previously timed at about 3.8 seconds, was 2.31 seconds after. This is a significant speedup of 1.65×. But it is more interesting to see some details of the profiling information (time in seconds): Function topopt objgrad stiffness gGrad

Time before 3.2 79.9 124.7 —

Time after 33.6 90.2 66.0 “0.06”

Speedup 0.1× 0.89× 1.9× —

It is important to know that topopt calls objgrad, which in turn calls stiffness (and gGrad in the accelerated version). It seems obvious that stiffness had the highest benefit from acceleration but it is in fact misleading. Indeed, remember that there was two for loop in objgrad that called stiffness, and we optimized one of them. The result is that there is no call at all in the second loop while all the calls of the first loop remain. Hence only half of the calls remain, meaning that stiffness takes only half has much time on overall as before. So, how are the missing half calls of stiffness computed? They are by gGrad, but CUDA calls are asynchronous so the 0.06 second given by the profiling is in fact false and that is the reason I put in in quotes. Actually, careful examination of the objgrad profile seems to say that it takes actually 4.4 seconds, and maybe up to 10.3 seconds as one can see this function now takes longer that before to execute, maybe because of kernel launch overhead or memory transfers. Still, this makes it a speedup of at least 5.7× on the second loop of objgrad. Finally, it is important to note that topopt is now much slower to execute. After investigation, it appears this is because of a matrix multiplication involving the matrix Grad. Because it is now computed on the GPU, it needs to be copied back on the CPU, which makes this transfer now a bottleneck of the computation. With a more involved analysis of the program, one can see that it would be possible to perform all subsequent computations involving Grad on the GPU, hence probably solving this problem. This shows that it can be difficult to predict precisely the gain of a particular transformation, and that iterative steps of optimization seem to be a good way to tackle problems that appears after improvements. Precision. One final word worth to be mentioned is the problem with manipulating single-precision floats. Indeed, Matlab internally operates in double precision, while NVIDIA’s GPUs either do not support double precision (in our case), or support it but 15

are much slower than single float operations. While it is probable that future releases of GPUs will be able to handle double floats much faster, it is still important to be aware of this problem for today’s computations. In my case, I did not have the choice and had to use single-precision floats. Of course, when accelerating a program, it is very important not to modify fundamentally its behaviour. In our example, the modified program still had to perform exactly 30 iterations before reaching stabilization, and the result values are the same. However, I noticed sometimes significant changes in intermediate results, luckily it did not affect the outcome. For example, in the computation of Grad, the results are in the range of 20-40 (in absolute value). There was on average very little difference between the computation on the CPU and the GPU: 4.6 × 10−4 , but the maximum difference was 0.23, i.e. about 1%. In fact, even when Matlab was also working on single precision, such differences could be observed! This is because float operations are not exact, and for example a sum operation (a + b) + c is in fact different from a + (b + c). For example (106 − 106 ) + 10−6 = 10−6 , but (106 + 10−6 ) − 106 = 0. Hence, just the fact that sums on a GPU are done in parallel while there are sequential on a CPU can create significant differences. In conclusion, it is important to evaluate the degree of precision required in order not to false the results by using inappropriate GPU acceleration.

7

Future improvements

I did not get the opportunity to try every possible optimization I have in mind. In this section, I quickly describe some situations that could benefit from particular optimizations in a context of automated acceleration. CPU/GPU parallelism: CUDA kernel launches are asynchronous, meaning that we automatically benefit from both CPU and GPU whenever a GPU computation is followed by independent CPU computations. Re-scheduling the Matlab code to take advantage of this is a more involved possibility. In the general case, it is a difficult problem of parallelization, but in the case of a linear execution, it is probably possible to write good heuristics based on the Directed Acyclic Graph (DAG) of the dependencies between computations.

Redundancy: if two parallel computations are separated by one (simple) sequential computation, it can be beneficial that each thread computes the intermediate result instead of launching multiple kernels. For example, given the Matlab code: A = c*D; x = A(1); B = A+x; it is possible either to launch an array multiplication kernel, assign x, then launch an array addition kernel, or to generate only one kernel in which all threads “compute” the value x: A[thread_id] = c * D[thread_id]; __syncthreads(); x = A[1]; B[thread_id] = A[thread_id] + x;

16

Register allocation: in classical register allocation, the goal is to keep the register pressure just below the number of registers, but to avoid storing temporarily variables to memory (spilling) as much as possible. On a GPU, threads executes concurrently on streaming multiprocessors (SM), each of them having a shared register file for eight processors. Hence, the running threads are in competition for register resources, which can limit the number of simultaneously threads on an SM. It is sometimes better to reduce the register pressure in a thread a little below the actual number of registers, so that more threads can run at the same time. “Proactive spilling,” i.e., spilling done by the programmer in advance, exploits this property; However, this is too specific and restricted to experienced programmers. Whenever the register pressure is a bit to high, I believe it would be possible to use the local memory—which is as fast as the registers— for spilling instead of the shared memory—which is two orders of magnitude slower. If done automatically, this would allow to run more threads at no cost. Currently, I have been able to exploit this idea on a special case, however, it is very difficult to control what is actually done by the NVIDIA compiler, so I did not managed to guide register allocation in a reliable fashion. Still, I think that it could be possible to either better understand the compiler and implement this feature in a more general framework, in particular, in a context where the compiler is known to the programmer. Memory accesses. The non-uniform memory architecture of GPUs presents differences with the standard memory hierarchy. In particular, data should be accessed using particular patterns to improve efficiency: for example, accesses to global memory of NVIDIA’s GPUs must be coalesced so that the needed data is contiguous to maximize the available bandwidth. This means special care must be taken when writing kernels. A less important and often disregarded fact is that the shared memory is organized into different banks. All threads accessing data from shared memory at the same time should access different banks, else conflicts arise and the accesses are sequentialized. The shared memory being very fast, this is an optimization often left aside in the literature, but which can be critical for some applications, for instance parallel reduction [2]. I tried different ways of rearranging data in local memory so as to prevent conflicts, by padding data as is already done by other people [1], or shuffling elements. Padding data means inserting unused cells inside an array, for example, so that subsequent element will be in a different bank. For example, the elements of an array are usually stored like this: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Suppose the GPU memory is organized into 4 banks (actually 16 in our GPUs), then the memory layout is such that the first four elements of the array are in different banks, then the pattern repeats, hence, for a fixed i, every element with index i + 4 j is in the same memory bank. This is the case in our example with 1, 5, 9, and 13. I represented indices in the same bank with the same background color. Suppose now that each thread of a kernel needs to access 4 consecutive element in the array, computing for example:

17

x = array[4 ∗ thread_id] x += array[4 ∗ thread_id + 1] x += array[4 ∗ thread_id + 2] x += array[4 ∗ thread_id + 3] then, at the first step, all threads 0, 1, 2 and 3 will access elements 0, 4, 8 and 12, all in the same bank. This causes a 4-way conflict hence the memory accesses will be sequentialized and the operation will be 4× slower. The same problem will arise in each of the consecutive three operations. But with a 1-padding every 4 elements, the array layout in memory becomes: 0 1 2 3 — 4 5 6 7 — 8 9 10 11 — 12 In that case, elements 0, 4, 8 and 12 are all in four different banks, as are elements {1, 5, 9, 13}, {2, 6, 10, 14}, and {3, 7, 11, 15}. There are no conflicts anymore, but the code needs to be rewritten into: x = array[4 ∗ thread_id + thread_id/4] x += array[4 ∗ thread_id + 1 + thread_id/4] x += array[4 ∗ thread_id + 2 + thread_id/4] x += array[4 ∗ thread_id + 3 + thread_id/4] where / represents the integer division. This means that there is an added computation required, but since it is faster to do it than to have a memory bank conflict. It is still worth doing it. This padding technique has however the drawback of wasting local memory storage, since the “padding” is not used to hold any data. In our case, we need 25% more memory to hold the same array. I investigated the use of data “re-arrangement,” i.e, shuffling element inside the array to avoid bank conflicts. This can be for example by putting all even indices in the first half of an array, then all odd indices (good for an odd-even parallel sort algorithm). In our example, I used a rotation every 4 elements like this: 0 1 2 3 5 6 7 4 10 11 8 9 15 12 13 14 and the subsequent code to compute in this array is then: x = array[4 ∗ thread_id + mod(0-thread_id,4)] x += array[4 ∗ thread_id + mod(1-thread_id,4)] x += array[4 ∗ thread_id + mod(2-thread_id,4)] x += array[4 ∗ thread_id + mod(3-thread_id,4)] Unfortunately, the modulo operation is costly on NVIDIA’s GPUs so the added computation time required to figure out the new location of the array’s elements overcomes the benefit of avoiding bank conflicts. To conclude, I have proposed and experimented a method of re-arranging the data layout of arrays and matrices to avoid bank conflicts without losing memory storage as does the method of padding. Unfortunately, it is not applicable with current GPUs where bank conflicts costs only a few cycles, hence the computation required to find the new placement of elements is often more expensive than the conflicts themselves. It should still be worth trying it on a architecture where conflicts cost more and modulo operations are cheaper.

18

8

Conclusion

Matlab is a high-level matrix-based language commonly used for simulation purposes because of its user-friendliness. It however suffers from performance compared to lower-level languages like C. The intrinsic parallelism exhibited in matrix computations makes Matlab a good candidate for GPU-based acceleration. In this report, I discussed the usage of existing toolboxes to accelerate Matlab, mainly Jacket and GPUmat. Although they provide accelerated version of many Matlab functions, we have seen that this is not enough to provide good acceleration to the users unless programs use regular matrix computations on very large data. My investigation also shows that Jacket’s generation of CUDA code on the fly, although being a very interesting feature, lacks many possibilities of grouping consecutive computations, at a cost of performance. I have shown that it is possible to build custom kernels for a wider range of code, to inline specialized versions of common functions or algorithms to diminish the number of kernel launches and use the parallelism of for loops. Although this was done manually, I am convinced that all additional code could have been generated automatically in an optimizing framework, using simple dependence analysis and a database of CUDA codes of generic functions that can be inlined. Finally, I think that a general framework would eventually generate more efficient code that my manual implementation by using fine tuning over register usage (to maximize resource utilization) and GPU memory accesses to avoid shared bank conflicts.

19

A

Using Matlab with GPU on ccx09

I used the machine ccx09.cc.serc.iisc.ernet.in for my experiments. It is an 8core machine with NVIDIA’s GeForce 8800 GTS 512 GPU. It is running Ubuntu 8.10 with CUDA version 2.3, and Matlab version 7.6.0.324 (R2008a). To whomever might need to use this machine with the current settings, it has Matlab installed in the / scratch directory, and other toolboxes in my home: ~florent/installs. Please email me for the root password in case it gets lost.

B

Matlab codes

Extract from the objgrad function: Grad=zeros(n_act_ele,1); incr=0; for ele = act_ele incr=incr+1; nod_con=nodcon(ele,:); eledof([1 3 5 7])=2*nod_con−1; eledof([2 4 6 8])=2*nod_con; XX=nodcoord(nod_con,:); dxval=3*topvar(ele)^2; dKe=STIFFNESS(Dmat*dxval,XX); Grad(incr)=−U(eledof)'*dKe*U(eledof); end

The stiffness function: function [K] = STIFFNESS(D,XX) K=zeros(8); Thic = 1.0; c=1/sqrt(3); XG = [c c −c −c]; YG = [c −c c −c]; WGT =[1.0 1.0 1.0 1.0]; B=zeros(3,8); for GSP = 1:4 R = XG(GSP); S = YG(GSP); RP = 1.0+R;SP = 1.0+S;RM = 1.0−R;SM = 1.0−S; P = [0.25*SP −0.25*SP −0.25*SM 0.25*SM; 0.25*RP 0.25*RM −0.25*RM −0.25*RP]; XJ=P*XX; Det = det(XJ); XJI = [XJ(2,2) −XJ(1,2); −XJ(2,1) XJ(1,1)]/Det; B(1,[1 3 5 7]) = XJI(1,:)*P; B(2,[2 4 6 8]) = XJI(2,:)*P; B(3,[1 3 5 7]) = B(2,[2 4 6 8]); B(3,[2 4 6 8]) = B(1,[1 3 5 7]); WT = WGT(GSP)*Thic*Det; K=K+WT*B'*D*B; end

20

References [1] Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. A compiler framework for optimization of affine loop nests for gpgpus. In ICS ’08: Proceedings of the 22nd annual international conference on Supercomputing, pages 225–234, New York, NY, USA, 2008. ACM. [2] Mark Harris. Parallel Prefix Sum (Scan) with CUDA. Technical report, Nvidia, February 2007. [3] MIT. Numerical Fluid Mechanics, 2007. [4] NVIDIA. Accelerating MATLAB with CUDA Using MEX Files, September 2007.

21