Optical Flow Computation on Compute Unified Device Architecture

purpose computation on graphics processing units. CUDA features ... to the local parallelism. For instance ... provides the instruction set for GPU programming as an ex- tension to the ..... sc2006/workshop/presentations/Buck NVIDIA Cuda.pdf.
178KB taille 14 téléchargements 261 vues
Optical Flow Computation on Compute Unified Device Architecture Yoshiki Mizukami and Katsumi Tadamura Yamaguchi University Faculty of Engineering 2-16-1 Tokiwadai, 755-8611 Ube, Japan {mizu,tadamura}@yamaguchi-u.ac.jp Abstract In this study, the implementation of an image processing technique on Compute Unified Device Architecture (CUDA) is discussed. CUDA is a new hardware and software architecture developed by NVIDIA Corporation for the generalpurpose computation on graphics processing units. CUDA features an on-chip shared memory with very fast general read and write access, which enables threads in a block to share their data effectively. CUDA also provides a userfriendly development environment through an extension to the C programming language. This study focused on CUDA implementation of a representative optical flow computation proposed by Horn and Schunck in 1981. Their method produces the dense displacement field and has a straightforward processing procedure. A CUDA implementation of Horn and Schunck’s method is proposed and investigated based on simulation results.

1

Introduction

Optical flow computation is one of the most fundamental problems in the field of computer vision [1, 7]. Especially there is a real need of the shortening of the required computational time in practical applications such as motion analysis and security systems. Many researchers have studied specific hardware for this problem. Earliest works developed analog circuits or VLSI chips [12]. From the end of the 20th century, FPGA (Field Programmable Gate Array) attracted great attention [16]. Graphical Processing Units (GPUs) have been originally developed for fast rendering in computer graphics, and recently applied to general-purpose computation [10]. For instance, in the field of computer vision and pattern recognition, Yang et al. proposed a real-time GPU implementation for computing the depth in binocular stereopsis [15], Fung et al. [4] proposed a usage of multiple graphics cards and Strzodka [13] discussed the GPU implementation of an image registration

14th International Conference on Image Analysis and Processing (ICIAP 2007) 0-7695-2877-5/07 $25.00 © 2007

method proposed by Clarenz et al. in 2002 [3]. Horn and Schunck proposed a regularization method for computing the dense displacement field in 1981 [5]. Since they linearized a matching error between two images after using Taylor expansion, their iterative equations are very straightforward and are fit to the GPU implementation due to the local parallelism. For instance, Pete discussed a GPU implementation of their method using a low-level programming language, ARB fragment program instructions [14]. Mizukami et al. investigated another GPU implementation using a high-level programming language of Cg (C for graphics) [9]. The same group also has conducted a comparison study between Horn and Schunck’s method and March’s method [6, 8]. Since the matching error was not linearized in March’s method, his iterative equations are more complicated than that of Horn and Schunck. However, the displacement field obtained by March’s method is more accurate especially in the case of large displacement. March’s method was also implemented on GPUs and the computational time was drastically shorten [9]. As referred to above, GPU implementations were originally programmed with low-level programming languages. The release of high-level programming languages such as Cg in 2002 and HLSL (High Level Shader Language) in 2004 facilitated the GPU implementation of generalpurpose computation. However, the fundamental knowledge on computer graphics was assumed for learning these high-level programming languages, for example, rasterizing, rendering on the framebuffer and OpenGL graphics library. Therefore GPU programming was not very userfriendly. More recently a team in NVIDIA Corporation developed Compute Unified Device Architecture (CUDA) [2]. CUDA provides the instruction set for GPU programming as an extension to the C programming language, which enabled the programmers to write codes for CPU and GPU computation in the same C source. In addition, it does not assume programmers’ knowledge on computer graphics. CUDA has a new mechanism called as shared memory for sharing data

effectively between threads in a thread block. With the extension to the C programming language and its shared memory, CUDA provides a user-friendly development environment for fast parallel computation on GPUs. This study proposes a CUDA implementation of Horn and Schunck’s regularization method for computing optical flow. Among various methods for optical flow computation, the main advantages of their method are the dense displacement field and the simplicity of the iterative procedure. The second section describes the implementation of Horn and Schunck’ method on CUDA and the third section illustrates the simulation results. The final section concludes this study.

2

Optical flow computation on CUDA

At first, Horn and Schunck’s method and the multiscale search method are overviewed. Next, the implementation on CUDA is described.

2.1

Horn and Schunck’s regularization method

Figure 1 illustrates a set of horizontal and vertical displacement function (u(x, y), v(x, y)) making the connection between different coordinates on two sequential images, f and g. In the framework of regularization theory [11], Horn and Schunck formulated this optimization problem of the displacement function as a minimization problem of the following functional E(u, v), E(u, v) = P (u, v) + λS(u, v)  P (u, v) = (fx u + fy v + ft )2 dxdy,  S(u, v) = ((u2x + u2y ) + (vx2 + vy2 ))dxdy,

(1) (2) (3)

where functional P is the matching error between two images, f and g, and functional S is a constraint for the departure from the smoothness on the computed displacement. The subscript denotes the partial differential operator and λ is a so-called regularization parameter controlling the effect of functional S. On the basis of calculus of variations, the following iterative equations were derived, u[t+1] = u ¯[t] −

fx (fx u ¯[t] + fy v¯[t] + ft ) , λ + fx2 + fy2

(4)

v [t+1] = v¯[t] −

¯[t] + fy v¯[t] + ft ) fy (fx u , λ + fx2 + fy2

(5)

where (¯ u, v¯) denotes the four-neighborhood average of (u, v).

14th International Conference on Image Analysis and Processing (ICIAP 2007) 0-7695-2877-5/07 $25.00 © 2007

Z W[ X

X

X Z[

+ +OCIGH

Z[ [ W Z[ + +OCIGI

Figure 1. Displacement function (u(x,y), v(x,y)).

Multiscale search method (or coarse-to-fine search strategy) is one of the fundamental techniques in signal processing. Before computing displacement between two images, the several pairs with lower resolution are generated from the original two images. The computation of displacement starts with the lowest resolutional pair, and the obtained low-resolutional displacement is utilized as the initial value for the finer pair. Eventually the displacement of the original resolution is computed with the original pair.

2.2

CUDA implementaion

As shown with Eqs.(4) and (5), Horn and Schunck’s iterative equations contain two types of local parallel procedure, that is, a four-neighborhood smoothing procedure of the first term and an updating procedure of the second term in the right side. In the CUDA implementation of this study, a thread is principally assigned to a pixel coordinate on the image. After the pixel data including the image intensities, derivatives, and displacement function are transferred from the main memory on the host to the global memory on the device, all threads can access any part of the pixel data on the global memory. Therefore, the four-neighborhood smoothing procedure can be realized on the global memory as shown in Fig. 2. At first, each thread reads the pixel data of the corresponding coordinate and the displacement of neighborhood coordinates from the global memory. Next, the displacement modified by the smoothing and updating procedures is written back to the global memory. However, this type of CUDA implementation with global memory (CUDA-GM) is expected to suffer from slow access to the global memory with large latency. CUDA provides an on-chip shared memory per microprocessor. For instance, GeForce 8800 GTX has 16 multiprocessors. Each multiprocessor deals with several blocks of threads. In a block, threads can share the pixel data via the shared memory as shown in Fig. 3. Since the shared memory is embedded on the multiprocessor, it provided

$NQEMKPFGZ



6JTGCFKPFGZ

$NQEMKPFGZ

























2KZGNFCVCQP )NQDCNOGOQT[ 



















 



$NQEMKPFGZ





























6JTGCFKPFGZ

























2KZGNFCVCQP )NQDCNOGOQT[

(a) read from the global memory

















(a) read from the global memory

























2KZGNFCVCQP )NQDCNOGOQT[ 



















 

6JTGCFKPFGZ



5JCTGFOGOQT[



(b) write to the global memory Figure 2. CUDA implementation with global memory.

$NQEMKPFGZ





5JCTGFOGOQT[

























6JTGCFKPFGZ

























2KZGNFCVCQP )NQDCNOGOQT[

















(b) write to the global memory Figure 4. CUDA implementation with global and shared memories.

&GXKEG )TKF $NQEM  5JCTGF/GOQT[

$NQEM  5JCTGF/GOQT[

$NQEM  5JCTGF/GOQT[

$NQEM  5JCTGF/GOQT[

6JTGCF 6JTGCF

 

6JTGCF 6JTGCF

 

6JTGCF 6JTGCF

 

6JTGCF 6JTGCF

 

)NQDCN/GOQT[ /CKP/GOQT[QP*QUV

Figure 3. Shared memory in CUDA. (a) image f(x,y) very fast read and write access for threads. However the threads can access only the pixel data in the block-sized region. It means that the threads at the block corner can not access all four-neighborhood pixel data necessary for smoothing procedure. Therefore, as shown in Fig. 4, this study employs redundant threads around the block corner to realize the fourneighborhood smoothing procedure at all the pixel coordinate. At first, each thread reads the corresponding pixel data and writes only the displacement to the corresponding shared memory. Here the threads at the corner and second corner of the block read the pixel data from the global memory in an overlapped manner with the next block. For instance, the threads indexed with 4 and 5 in Block 0 access the pixel data with the coordinate of 3 and 4, respectively, and the threads indexed with 0 and 1 in Block 1 access the same pixel data. Second, all the threads other than threads at the block corner read the four-neighborhood displacement from the shared memory and conduct the smoothing and updating procedure. Finally they write back the modified displacement to the corresponding pixel data on the global memory. This type of the CUDA implementation with the global and shared memories (CUDA-GSM) is expected to be faster than CUDA-GM due to the less access to the global memory.

14th International Conference on Image Analysis and Processing (ICIAP 2007) 0-7695-2877-5/07 $25.00 © 2007

(b) image g(x,y)

Figure 5. Yosemite Fly-Through.

3 Simulations

Computer simulations were conducted for investigating the CUDA implementations of Horn and Schunck’s method for computing optical flow. First their method was implemented on a CPU for studying the relationship between the regularization parameter and the computational error, and the required time for the CPU implementation. Then two types of CUDA implementation were investigated, that is, CUDA-GM and CUDA-GSM. Finally, the combinations of the multiscale search method with these implementations were discussed. From the practical viewpoint, it is desirable to conduct the simulations based on the sequence of real images captured with a video camera. However, for preparing the correct solution of the displacement field, this study employed a computer graphics sequence, Yosemite Fly-through, which has 256 grayscale images with the size of 316 × 252 pixels. The middle-frame image and the next image were used as images g and f , respectively. Figure 5 shows these images. This study adopted the following root-mean-square error, ,

as a criteria of computational error,

x=Xb ,y=Yb

(X − 2Xb )(Y − 2Yb )

,

(6)

where X and Y are the width and height of the images, respectively, and (u (x, y), v  (x, y)) is the correct value of horizontal and vertical displacement at the coordinate (x, y) on g. This equation takes no account of the region less Xb and Yb -pixels apart from the border in horizontal and vertical direction, respectively. In this simulation, both Xb and Yb were set to 10. The main specifications of the computer were CPU Pentium 4 (3.2GHz), 1GB main memory and Microsoft Windows XP SP2 Operation System. The graphics card was NVIDIA GeForce 8800GTX which follows the IEEE-754 standard for single-precision binary floating-point arithmetic. The main specifications were 575MHz core frequency, 768MB GDDR3 video memory and 128 scalar shader units. NVIDIA Windows Display Driver version 97.93 was installed on the computer. CUDA Toolkit Version 0.8 was used as a CUDA development environment. On the other hand, Microsoft Visual C++ 2005 was used for the CPU implementation.

3.1

CPU implementation

The regularization parameter λ was optimized on the computational error under the condition of 3,000 iterations. Figure 6 shows that the regularization parameters of 2,000 or 2,500 gave the smallest error of 0.865 pixels. Incidentally, the initial error was 1.742 pixels. Figure 7 indicates the correct displacement field and the computed displacement field. It should be noted that, in the left side of the hill and the ground surface ahead to the right, the computed displacement is smaller than the correct displacement. In the background of sky region, the computed displacement has distortion in the direction due to the changes in the cloud shapes. These inaccuracies mainly come from the linearized matching error between two images [8]. This CPU implementation with 3,000 iterations required 6,931 msec. Later, the speeding-up effect of CUDA will be discussed based on 3,000 iterations or the computational error of 0.865 pixels.

3.2

CUDA implementation

computational error [pixel]

=

 (u − u )2 + (v − v  )2

0.92 0.91 0.90 0.89 0.88 0.87 0.86 2000 4000 6000 8000 regularization parameter

10000

Figure 6. Effect of the regularization parameter on the computational error.

(a) correct displacement

(b) computed displacement

Figure 7. Computed displacement field.

GSM and CPU implementations. These CUDA implementations consumed 4,420 msec and 3,031 msec for 3,000 iterations, respectively. It means that CUDA-GM and CUDAGSM provided about 1.6 and 2.3 times the execution speed of CPU implementation with 6,931 msec. The intercepts of the CUDA implementations with about 46 msec came from the memory allocation on the device and the data transfer between the main memory on the host and the global memory on the device. The computational errors with CUDAGM and CUDA-GSM were the same with the CPU implementation with 0.865 pixels.

7000 required time [msec]

X−X b ,Y −Yb 

0.93

6000

CPU CUDA-GM CUDA-GSM

5000 4000 3000 2000 1000 0

In the CUDA implementations of this study, 22 × 22 threads were arranged in a block in a square-shaped manner, since the maximum number of threads in a block is 512. Figure 8 illustrates the relationship between the iteration number and the required time in CUDA-GM, CUDA-

14th International Conference on Image Analysis and Processing (ICIAP 2007) 0-7695-2877-5/07 $25.00 © 2007

0

500

1000 1500 2000 2500 3000 iteration number

Figure 8. Influence of the iteration number on the required time.

computational error [pixel]

1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8

singlescale multiscale

CUDA-GM 4,420 606

CUDA-GSM 3,031 443

Table 1. Required time for displacement computation [msec]. 0

100

200 300 iteration number

400

500

(a) Iteration number and the computational error. 1600 1400 required time [msec]

CPU 6,931 901

CPU CUDA-GM CUDA-GSM

1200

access to global memory smoothing procedure updating procedure total time

CUDA-GM 687 2,693 1,313 4,693

CUDA-GSM 660 970 1,176 2,806

1000 800

Table 2. Required time for each procedure [msec].

600 400 200 0 0

100

200 300 iteration number

400

500

(b) Iteration number and the required time. Figure 9. Multiscale implementations on CUDA-GM, CUDA-GSM and CPU.

3.3

Multiscale search method

Here the multiscale implementations with two resolutional levels on CUDA-GM, CUDA-GSM and CPU implementations are discussed. The finer resolution was 316 × 252 pixels the same with the original one, while the lower resolution was 158 × 128 pixels. The lower images were generated on the host with a 3 × 3-pixel averaging filter. The regularization parameter for the lower resolution was empirically set to 100, while that for the finer resolution was set to 2,000. Figure 9 shows the influence of the iteration number on both the computational error and the required time. Figure 9(a) shows that 300 iterations at each stage give the computational error of 0.865 pixels the same with that of the singlescale implementation with 3,000 iterations. Figure 9(b) indicates that multiscale CUDA-GM and CUDA-GSM required 606 msec and 443 msec for the iteration, respectively. Table 1 summarizes the required time in the singlescale and multiscale implementations on CPU, CUDA-GM and CUDA-GSM. In all the implementation on CPU, CUDAGM and CUDA-GSM, the multiscale search method provided more than 6.8 times the execution speed of the singlescale search method. In using the multiscale search method, CUDA-GSM gave about twice the execution speed

14th International Conference on Image Analysis and Processing (ICIAP 2007) 0-7695-2877-5/07 $25.00 © 2007

of the CPU implementation. The multiscale CUDA-GSM implementation was about 16 times faster than the singlescale CPU implementation.

3.4

Discussion

Finally, Table 2 shows the required time for the read and write access to the corresponding pixel data on the global memory, the smoothing procedure and the updating procedure of CUDA-GM and CUDA-GSM. The total time of processes in each implementation were slightly different from the time described in Table 1, that is, 4,420 and 3,031, due to the measurement condition. Since CUDA-GM read fourneighborhood pixel data from the global memory with slow access, its smoothing procedure required 2,693 msec. In contrast, CUDA-GSM read four-neighborhood pixel data from the shared memory and required 970 msec that was about 1/3 of CUDA-GM. Both implementations required almost same time for the updating procedure. These results clarified that the shortening of the computation time in CUDA-GSM was brought about by the effective use of the shared memory on the multiprocessor.

4 Conclusion This study investigated the implementation of Horn and Schunck’s regularization method on CUDA for speeding up the optical flow computation. CUDA is a new architecture for general-purpose computation on graphical processing units, and is expected to be increasingly popular in computer vision. This study focused on the shared memory with fast read and write access and leveraged it for

speeding up the smoothing procedure in computing the displacement. Simulation results clarified that the combination of CUDA and the multiscale search method is about 16 times faster than the ordinary CPU implementation with singlescale search method. Since the CUDA developement environment used in this study was still a β version, CUDA is expected to bring more drastic acceleration to the proposed implementation after its official release. In the future work, the implementation of March’s regularization method on CUDA will be studied for computing more accurate displacement [8].

5

Acknowledgements

We express our gratitude to L. Quam et al. in SRI Lab., the authors of Yosemite Fly-Through. This study was partially supported by JSPS Grants-in-Aid for Science Research (16700208).

References [1] J. Barron, D. Fleet, and S. Beauchemin. Performance of optical flow techniques. Internat. J. Comput. Vision, 12:43– 77, 1994. [2] I. Buck. GeForce 8800 & NVIDIA CUDA: A New Architecture for Computing on the GPU. website of Supercomputing ’06 Workshop ”General-Purpose GPU Computing: Practice And Experience”, 2006. http://www.gpgpu.org/ sc2006/workshop/presentations/Buck NVIDIA Cuda.pdf. [3] U. Clarenz, M. Droske, and M. Rumpf. Towards fast nonrigid registration. Inverse Problems, Image Analysis and Medical Imaging, AMS Special Session Interaction of Inverse Problems and Image Analysis, 313:67–84, 2002. [4] J. Fung and S. Mann. Using multiple graphics cards as a general purpose parallel computer : Applications to computer vision. In Proceeding of International Conference of Pattern Recognition, volume 01, pages 805–808, 2004. [5] B. Horn and B. Schunck. Determining optical flow. Artificical Intelligence, 17:185–203, 1981. [6] R. March. Computation of stereo disparity using regularization. Pattern Recognition Letters, 8(3):181–188, Mar. 1988. [7] B. McCane, K. Novins, D. Crannitch, and B. Galvin. On benchmarking optical flow. Computer Vision and Image Understanding, 84(1):126–143, 2001. [8] Y. Mizukami, T. Sato, and K. Tanaka. A comparison study for displacement computation. Pattern Recognition Letters, pages 825–831, 2001. [9] Y. Mizukami and K. Tadamura. A study on GPU implementation of march’s regularization method for optical flow computation. In Proc. 21th Int. Conf. Image and Vision Computing New Zealand, volume 1, pages 517–522, 2006. [10] J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohn, and T. J. Purcell. A survey of general-purpose computation on graphics hardware. In Eurographics 2005, State of the Art Reports, pages 21–51, Aug. 2005.

14th International Conference on Image Analysis and Processing (ICIAP 2007) 0-7695-2877-5/07 $25.00 © 2007

[11] T. Poggio, V. Torre, and C. Koch. Computational vision and regularization theory. Nature, 317(6035):314–319, Sept. 1985. [12] R. Sarpeshkar, J. Kramer, G. Indiveri, and C. Koch. Analog VLSI architectures for motion processing: From fundamental limits to system applications. Proceedings of IEEE, X4(7), 1996. [13] R. Strzodka, M. Droske, and M. Rumpf. Image registration by a regularized gradient flow - a streaming implementation in DX9 graphics hardware. Computing, 73(4):373–389, 2004. [14] P. Warden. GPU optical flow. Pete’s GPU Notes, 2005. http://petewarden.com/notes/archives/2005/05/ gpu optical flo.html. [15] R. Yang and M. Pollefeys. Multi-resolution real-time stereo on commodity graphics hardware. In Proceedings of International Conference of Computer Vision and Patter Recognition, pages 211–217, 2003. [16] A. Zuloaga, J. Martin, and J. Ezquerra. Hardware architecture for optical flow estimation in real time. Proceedings of 1998 IEEE International Conference on Image Processing, 3:972–976, 1998.