Multi-Frame Information Fusion For Image And Video ... .fr

30 Super-resolution applied as a preprocessing block to face recognition. . . . . 68. 31 Super-resolution .... Such a multi-frame reconstruction process is called super-resolution reconstruction. ... quantization noise and the additive noise in a stochastic framework. ..... as the conditional probability density function (PDF), the. 16 ...
2MB taille 1 téléchargements 310 vues
Multi-Frame Information Fusion For Image And Video Enhancement

A Thesis Presented to The Academic Faculty by

Bahadir K. Gunturk

In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

School of Electrical and Computer Engineering Georgia Institute of Technology August 2003

Multi-Frame Information Fusion For Image And Video Enhancement

Approved by:

Professor Yucel Altunbasak, Adviser

Professor Russell M. Mersereau

Professor Monson H. Hayes

Date Approved

To my parents.

iii

ACKNOWLEDGEMENTS

I would like to thank to my advisor Prof. Yucel Altunbasak for the invaluable support and guidance. My deepest thanks is also extended to many members of the CSIP. I am grateful to Prof. Russell M. Mersereau for his support and encouragement throughout my Ph.D. and also to Prof. Monson H. Hayes for his support in my academic pursuit. I also thank to the members of the Imaging Science Research Laboratory at the Eastman Kodak Company, for giving me the opportunity to work in an exceptional industrial research environment. Special thanks go to Dr. Majid Rabbani and Chris Honsinger. Finally, I would like to thank Prof. Xiaoming Huo and Prof. Faramarz Fekri for serving in my committee, to Aziz U. Batur for sharing his expertise on face recognition with me, and to John Glotzbach for the discussions on demosaicking.

iv

TABLE OF CONTENTS DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

LIST OF FIGURES

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

SUMMARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1

Spatial Resolution Enhancement Problem . . . . . . . . . . . . . . . . . .

2

1.1.1

CFA Interpolation: Multi-Channel Resolution Enhancement . . . .

4

1.1.2

Super-Resolution Reconstruction: Multi-Frame Resolution Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

1.2 II

xii

COLOR FILTER ARRAY INTERPOLATION USING ALTERNATING PROJECTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.2

Demosaicking Using Alternating Projections . . . . . . . . . . . . . . . . .

23

2.2.1

Inter-Channel Correlation . . . . . . . . . . . . . . . . . . . . . . .

23

2.2.2

Color Filter Array Sampling . . . . . . . . . . . . . . . . . . . . . .

24

2.2.3

Constraint Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.2.4

Alternating Projections Algorithm . . . . . . . . . . . . . . . . . .

30

2.3

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

2.4

Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

2.5

Extensions to the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2.6

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

III SUPER-RESOLUTION FOR COMPRESSED VIDEO . . . . . . . . . .

45

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

3.2

Imaging Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.3

Bayesian Super-Resolution Reconstruction . . . . . . . . . . . . . . . . . .

48

3.4

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

v

3.5

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

IV SUPER-RESOLUTION FOR FACE RECOGNITION . . . . . . . . . .

63

V

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

4.2

Dimensionality Reduction for Face Recognition . . . . . . . . . . . . . . .

65

4.3

Face-Space Super-Resolution Reconstruction . . . . . . . . . . . . . . . . .

66

4.3.1

Imaging Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

4.3.2

Reconstruction Algorithm . . . . . . . . . . . . . . . . . . . . . . .

68

4.4

Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

4.5

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

4.6

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

JOINT SPATIAL AND GRAY-SCALE ENHANCEMENT . . . . . . .

86

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

5.2

Imaging Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

5.3

Joint Gray-Scale and Spatial Domain Enhancement . . . . . . . . . . . . .

89

5.3.1

Constraint Sets from Amplitude Quantization . . . . . . . . . . . .

89

5.3.2

Projection Operations . . . . . . . . . . . . . . . . . . . . . . . . .

92

5.3.3

Complete Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .

93

5.3.4

Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . .

94

5.4

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

5.5

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

VI CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . 105 APPENDIX A

— CONVEXITY OF THE CONSTRAINT SETS . . . . 108

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

vi

LIST OF TABLES 1

Inter-channel correlation in different subbands. . . . . . . . . . . . . . . . .

26

2

Correlation among the full-channel (original) images and bilinearly interpolated (CFA-sampled) images in different subbands. . . . . . . . . . . . . . .

28

3

Mean square error comparison of different methods. . . . . . . . . . . . . .

40

4

Mean square error (MSE) comparison of different methods for Aerial image. 60

5

Mean square error (MSE) comparison of different methods for Boat image.

vii

60

LIST OF FIGURES 1

Image acquisition model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2

Multi-chip digital cameras split the light into three optical paths; each path has a spectral filter and a sensor array. At the end, three full color channels are produced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

Single-chip digital cameras have color filter arrays with specific patterns. The most commonly used pattern is the “Bayer” pattern, which is illustrated in this picture. In a Bayer pattern, red, green, and blue filters are placed in a checkerboard fashion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

When demosaicking is not performed well, images suffer from color artifacts. (a) Image captured with a multi-chip digital camera. (b) Image captured with a single-chip digital camera; the missing samples are estimated by bilinear interpolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

The green (G) sample at location “7” is estimated using edge-directed interpolation. The other green samples and the chrominance samples can be found in a similar way. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Constant color ratio based interpolation is used to find the red (R) sample at location “7” using the green (G) samples. The other red samples and blue samples can be found similarly. . . . . . . . . . . . . . . . . . . . . . . . . .

9

Second order derivatives of red (R) samples are used as the correction terms to find the green (G) sample at location “5”. When the missing green pixel is at a blue pixel, the blue values are used. . . . . . . . . . . . . . . . . . . .

10

Super-resolution reconstruction algorithms increase the resolution by exploiting the correlation among multiple frames. An application of super-resolution reconstruction is license plate recognition. . . . . . . . . . . . . . . . . . . .

12

3

4

5

6

7

8

9

Image acquisition model used in super-resolution reconstruction. Because the reconstructed signal must be digital, the model includes a discretization block to convert f (x, y) to f (n1 , n2 ). f (n1 , n2 ) is the signal to be reconstructed. 13

10

Video acquisition model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

11

Each pixel in a low-resolution observation is obtained as a weighted sum of several pixels in high-resolution image. The weights are determined by the point spread function. The relative motion between the observations can be simulated by warping the high-resolution image. . . . . . . . . . . . . . . .

15

Illustration of super-resolution reconstruction. Samples of different images are given in different colors. These images are registered with sub-pixel accuracy; and the irregular samples are interpolated to a regular grid. . . . .

16

In the POCS technique, the initial estimate is projected onto the constraint sets iteratively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

12

13

viii

14

Images used in the experiments. (These images are referred as Image 1 to Image 20 in the thesis, enumerated from left-to-right, and top-to-bottom.) .

25

15

Bayer pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

16

CFA sampling of the color channels. (a) Frequency support of an image. (b) Spectrum of the sampled green channel. (c) Spectrum of the sampled red and blue channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

17

Analysis and synthesis filterbanks for one-level decomposition. . . . . . . . .

30

18

Convergence for one-level decomposition.

. . . . . . . . . . . . . . . . . . .

33

19

Analysis and synthesis filterbanks for two-level decomposition. . . . . . . .

33

20

Convergence for two-level decomposition. . . . . . . . . . . . . . . . . . . .

34

21

Fine tuning of green channel is done from observed red and blue samples. Initial green pixel estimates corresponding to red and blue observations are combined to form smaller size green images, which are then updated using red and blue observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

22

Comparison of the methods for Image 4 . . . . . . . . . . . . . . . . . . . .

41

23

Comparison of the methods for Image 6 . . . . . . . . . . . . . . . . . . . .

43

24

MPEG compression is appended to the video acquisition. . . . . . . . . . .

48

25

(a) Original Aerial image. (b) Original Boat image. . . . . . . . . . . . .

57

26

(a) Bilinearly interpolated Aerial image. (Downsampling factor is two.) (b) Reconstructed Aerial image. (Four observations are used.) (c) Bilinearly interpolated Aerial image. (Downsampling factor is four.) (d) Reconstructed Aerial image. (Sixteen observations are used.) . . . . . . . . . . . . . . . .

58

(a) Bilinearly interpolated Boat image. (Downsampling factor is two.) (b) Reconstructed Boat image. (Four observations are used.) (c) Bilinearly interpolated Boat image. (Downsampling factor is four.) (d) Reconstructed Boat image. (Sixteen observations are used.) . . . . . . . . . . . . . . . . .

59

Results for License Plate sequence; quantization factor is 0.25; enhancement factor is four. (a) to (f) are the images used in reconstruction. (g) The reconstructed image using the DCT-domain MAP algorithm. . . . . . . . .

61

Results for Text sequence; quantization factor is 0.25; enhancement factor is two. (a) to (f) are the images used in reconstruction. (g) The reconstructed image using the DCT-domain MAP algorithm. . . . . . . . . . . . . . . . .

62

30

Super-resolution applied as a preprocessing block to face recognition. . . . .

68

31

Super-resolution embedded into eigenface-based face recognition. . . . . . .

69

32

Error in feature vector computation. . . . . . . . . . . . . . . . . . . . . . .

80

27

28

29

ix

33

(a) Original 40 × 40 image. (b) 10 × 10 low-resolution observation is interpolated using nearest neighbor interpolation. (c) 10 × 10 low-resolution observation is interpolated using bilinear interpolation. (d) Pixel-domain super-resolution applied. (e) The result of pixel-domain super-resolution reconstruction is projected into the face subspace. (f) Representation of the feature vector reconstructed using the eigenface-domain super-resolution in the face subspace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

(a) Original 40 × 40 image. (b) 10 × 10 low-resolution observation is interpolated using nearest neighbor interpolation. (c) 10 × 10 low-resolution observation is interpolated using bilinear interpolation. (d) Pixel-domain super-resolution applied. (e) The result of pixel-domain super-resolution reconstruction is projected into the face subspace. (f) Representation of the feature vector reconstructed using the eigenface-domain super-resolution in the face subspace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

35

Effect of feature vector length on performance. . . . . . . . . . . . . . . . .

83

36

Effect of observation noise performance. . . . . . . . . . . . . . . . . . . . .

84

37

Effect of motion estimation error on performance. . . . . . . . . . . . . . . .

85

38

Imaging model includes effects acting on the gray-scale domain as well as on the spatial domain. In our formulation we will not try to reconstruct q(t; λ; x, y, z) but f (x, y). . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

Pixel intensities are compressed and digitized observations of real-valued quantities. For each intensity level, there is a lower bound Tl (·) and an upper bound Tu (·) within which the input intensity lies. These bounds are used to impose constraints on the reconstructed image. . . . . . . . . . . . .

90

It is possible to increase the gray-scale extent and resolution when there are multiple observations of different range spans. . . . . . . . . . . . . . . . . .

92

In order to determine the saturation function, the images are first registered in range to form high-dynamic range composite image. The saturation function are then determined by comparing the midtones of the composite with the observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

34

39

40 41

42

Non-parametric estimation of the camera saturation function. Three images of different exposures times are used to estimate the curve. When more images were used, the estimation would be more accurate and the outliers could be eliminated. Parametric models could also be used with the algorithm. 99

43

Images from the first sequence. (a) First image. (b) Second image. (c) Third image. (d) Corners detected in the third image. . . . . . . . . . . . . . . . .

100

Images from the second sequence. (a) First image. (b) Second image. (c) Third image. (d) Corners detected in the third image. . . . . . . . . . . . .

101

Reconstructed image from the first sequence. Color map on the right shows the extended dynamic range. . . . . . . . . . . . . . . . . . . . . . . . . . .

102

44 45

x

46 47

48

Reconstructed image from the second sequence. Color map on the right shows the extended dynamic range. . . . . . . . . . . . . . . . . . . . . . . .

103

Zoomed regions from the first sequence. (a) First image (interpolated by pixel replication). (b) Second image (interpolated by pixel replication). (c) Third image (interpolated by pixel replication). (d) Reconstructed image scaled to intensity range [0-255]. . . . . . . . . . . . . . . . . . . . . . . . .

103

Zoomed regions from the second. (a) First image (interpolated by pixel replication). (b) Second image (interpolated by pixel replication). (c) Third image (interpolated by pixel replication). (d) Reconstructed image scaled to intensity range [0-255]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

104

xi

SUMMARY

The need to enhance the resolution of a still image or of a video sequence arises frequently in digital cameras, security/surveillance systems, medical imaging, aerial/satellite imaging, scanning and printing devices, and high-definition TV systems. In this thesis, we address several aspects of the resolution-enhancement problem. We first look into the color filter array (CFA) interpolation problem, which arises because of the patterned sampling of color channels in single-chip digital cameras. At each pixel location, one color sample (red, green, or blue) is taken, and the missing samples are estimated by a CFA interpolation process. When the CFA interpolation is not performed well, the resulting images suffer from highly visible color artifacts. We demonstrate that there is a high correlation among the color channels and this correlation differs at different frequency components, and propose an iterative CFA interpolation algorithm that exploits the frequency-dependent inter-channel correlation. The algorithm defines constraint sets based on the observed data and the interchannel correlation, and employs the projections onto convex sets (POCS) technique to estimate the missing samples. To increase the resolution further to the subpixel levels, we need to use multiple frames. By using subpixel accurate motion vector estimates among the observed images, it is possible to reconstruct an image or a sequence of images that has higher spatial resolution than any of the observations. Such a multi-frame reconstruction process is called super-resolution reconstruction. Although there is a lot of work done in the area of super-resolution reconstruction, most of it assumes that there is no compression during the imaging process. The input signal (video/image sequence) is assumed to exist in a raw (uncompressed) format. However, because of the limited resources (bandwidth, storage space, I/O requirements, etc.), this is rarely the case. We therefore look into the super-resolution problem where compression is part of the imaging process. The most popular image compression standards are based on the discrete cosine transform (DCT). We add a DCT-based compression

xii

process to a typical imaging model, and develop a Bayesian super-resolution algorithm for compressed signals. Unlike the previous methods, the proposed algorithm handles both the quantization noise and the additive noise in a stochastic framework. In super-resolution reconstruction, the most commonly used regularization term is the spatial smoothness. When we have a model of the image to be reconstructed, we should be able use that model to improve the reconstruction. Motivated by this idea, we develop a super-resolution reconstruction algorithm for face images. Face images can be represented in a lower dimension (so called face space) through the application of the Karhunen-Loeve Transform (KLT). Since the KLT is a linear transform, it can be embedded into a linear observation model easily. At the end, we obtain a super-resolution algorithm where the reconstruction is performed in a reduced dimensional face space instead of the pixel domain. The resulting algorithm has less computational complexity and turns out to be more robust to noise. Finally, we question the sufficiency of imaging models used in super-resolution algorithms. All the work done in the area of super-resolution assumes that there are no changes in illumination. However, the observed data may have gray-scale diversity because of the changes of illumination in the scene or imaging device adjustments such as exposure time, gain, or white balance. All of these potential changes should be considered for an effective and robust reconstruction. We include such effects in our imaging model and develop a reconstruction algorithm that handles the illumination changes and improves both spatial and gray-scale resolution.

xiii

CHAPTER I

INTRODUCTION

Information fusion is the process of combining data from multiple sources such that the resulting entity or decision is in some sense better than would be possible if any of the sources were used alone. When the data coming from multiple sources are images, the process is called image fusion. In image fusion, the data may differ in sensor type, viewing condition, camera position, or capture time; and depending on the synergy of the information inherent in the data, it is possible to enhance the spatial, temporal, and/or spectral resolution, extract the 3D structure, and improve the decision-making (detection, classification, recognition, etc.) performance. Among the image fusion problems, spatial resolution enhancement is the most active research area, mainly because of its wide variety of application areas. With the development of visual communications and image processing applications, there is a high demand for high-resolution images not only to give the viewer a high-quality picture but also to provide additional detail that may be critical in various applications. Digital cameras, surveillance systems, medical imaging, aerial/satellite imaging, scanning/printing devices, and highdefinition TV systems are some of the application areas where high-resolution images are desired. For example, high-resolution images are required in medical imaging to make correct diagnosis and operational decisions. Surveillance systems require high-resolution images to recognize faces, licence plates, etc. In aerial/satellite imaging, high-resolution images are required to resolve small objects and to make correct detection/classification decisions. The most direct way of increasing spatial resolution is to increase the number of sensor sites per unit area. Although this can be achieved by reducing the sensor size and placing the sensors more densely, the cost of producing such sensor arrays may not be appropriate for general purpose commercial applications. More importantly, as the sensor size decreases,

1

the image quality degrades because of shot noise. This shot noise is due to the inherent quantum uncertainty in the electron-hole pair generation process and remains roughly the same with reduced sensor size, whereas the signal power decreases proportional to the sensor size reduction. It is estimated that for a CMOS-based sensor, the optimal sensor area is about 40µm2 , which has almost been achieved with the current technology. An alternative approach is to use signal processing techniques to improve the spatial resolution. When there are multiple observations of the scene, such as different spectral components or successive video frames, it is possible to increase the spatial resolution by exploiting the correlation among those multiple observations, which is the focus of this thesis. In the remainder of this chapter, we first investigate a typical imaging process, which will help us to understand not only the sources of spatial resolution loss but also the other crucial steps of an image acquisition process. We will then focus on the spatial resolution enhancement problem and provide a comprehensive survey of the state of the art. We will conclude the chapter with a summary of our contributions. In the next four chapters, we explain our contributions in detail. And finally, we conclude the thesis with future research directions in Chapter 6.

1.1

Spatial Resolution Enhancement Problem

An image is a two-dimensional projection of a real-valued photoquantity q(t; λ; x, y, z) that is a function of time (t), spectral wavelength (λ), and space (x, y, z). During the imaging process, this photoquantity is degraded in several ways. The degradation may occur on the domain (t; λ; x, y, z) (such as projection from 3D to 2D, spectral filtering, temporal sampling, spatial sampling, and blurring) and/or on the range q (such as quantization and noise). Consider a typical digital image acquisition process, which is illustrated in Figure 1. The camera collects light through its aperture and optical system, and passes it through a spectral filter to form a continuous image on its sensor array. The continuous image is a two-dimensional projection of a three-dimensional scene; it is limited in spatial extent; and it has been filtered spectrally. In a charge-coupled-device (CCD) camera, there is a

2

q ( t ; λ ; x, y , z )

Figure 1: Image acquisition model. rectangular grid of electron-collection sites laid over a silicon wafer to record a measure of the amount of light energy reaching each of them. When photons strike these sensor sites, electron-hole pairs are generated; and the electrons generated at each site are collected over a certain period of time. The numbers of electrons are eventually converted to pixel values. The time period over which the sensors are exposed to light is controlled by an exposure timer. Because of the shot noise, the signal-to-noise ratio is low for short exposure times. When the exposure time is kept too long, there may be saturation of the number of electrons that are stored at each sensor site. Another problem associated with the exposure time is motion blur. If there is a fast moving object in the scene, the light coming from that object is accumulated over different sensor sites during the exposure interval. The result in the captured image is a blur along the trajectory of the object. In the case of video capture, the exposure is repeated at a certain frame rate. That is, the photoquantity q(t; λ; x, y, z) is sampled in time (t). The sensors are placed on the sensor array with a certain spatial density, and each has nonzero physical dimensions. The result is further loss of information in the spatial domain in the form of spatial resolution reduction and blurring. There is also information loss in the range q of the photoquantity q(t; λ; x, y, z). The sensors capture a certain portion of the dynamic range, and each pixel intensity is represented by a certain number of bits. Moreover, the pixel intensities may be corrupted by noise as a result of thermal/quantum effects and quantization. To produce color pictures, the image acquisition process explained above is modified in one of two ways. The first method uses beam-splitters to split the light into several optical

3

paths and use different spectral filters on each path to capture different spectral components. This is illustrated in Figure 2. The second method uses a mosaic of color filters to capture only one spectral component at each sensor site. The color samples obtained with such a color filter array must then be interpolated to estimate the missing samples. Such a single-sensor imaging system is illustrated in Figure 3. Although the latter method causes a further loss of spatial resolution in each color channel, it is usually the preferred one because of its simplicity and lower cost. To sum up, during an image acquisition process, spatial resolution may decrease because of color filter array (CFA) sampling, optical blur (out of focus, diffraction limit, etc.), sensor blur (due to the physical area of each sensor), sensor density (distance between the sensor sites), and motion blur (due to the nonzero exposure time). Reconstruction of a highresolution image from the observed data is a highly ill-posed problem. When there are multiple sources that provide some sort of diversity, the complementary information in those sources can be combined to enhance the resolution. In CFA interpolation, the missing pixel samples are estimated, where the correlation among the color channels can be exploited. To increase the resolution further to subpixel levels, that is, to recover the resolution loss because of spatial sampling and blurring, we need to exploit the correlation among multiple frames. Such multi-frame resolution enhancement is referred to as super-resolution reconstruction. Although it is possible to treat the CFA interpolation and the super-resolution reconstruction together, they have been treated separately in the literature. As a result of this division, we review CFA interpolation and super-resolution reconstruction in two subsections. 1.1.1

CFA Interpolation: Multi-Channel Resolution Enhancement

The human visual system is sensitive to a small portion (330nm − 730nm) of the electromagnetic spectrum. In a human eye, there are two types of photoreceptors: rods and cones. The rods are sensitive to light, whereas the cones provide color vision. There are three types of cone receptors; they have different spectral sensitivities, and they are occasionally called

4

Figure 2: Multi-chip digital cameras split the light into three optical paths; each path has a spectral filter and a sensor array. At the end, three full color channels are produced.

Figure 3: Single-chip digital cameras have color filter arrays with specific patterns. The most commonly used pattern is the “Bayer” pattern, which is illustrated in this picture. In a Bayer pattern, red, green, and blue filters are placed in a checkerboard fashion.

5

blue, green, and red cones because of their peak sensitivities. When the purpose is to produce a color picture with an imaging device, three spectral filters (e.g., red, green, and blue) provide sufficient perceptual quality. As mentioned earlier, there are two ways of producing a color picture with a digital camera. One approach is to use beam-splitters to form three optical paths, and to place different spectral filters on each path to detect a particular color channel. (See Figure 2.) This is a perfectly valid approach, and some professional digital cameras use this design. The problem is that this is an expensive solution, and this kind of design results in larger cameras. An alternative approach is to use a single sensor array and to cover the surface of the sensors with a mosaic of spectral filters, which is called a color filter array (CFA). Most of the commercially available digital cameras are CFA-based single-chip cameras. The problem with this approach is that there is only one color sample at each sensor site, and the missing color samples must be estimated using the neighboring pixels. Because of the mosaic pattern of the color samples, this CFA interpolation problem is also referred to as “demosaicking” in the literature. The most commonly used pattern is the “Bayer” pattern, which is shown in Figure 3. In a Bayer pattern, green samples are obtained on a quincunx lattice (checkerboard pattern), and red and blue samples are obtained on rectangular lattices. The green channel is more densely sampled than the red and blue channels because the spectral response of a green filter is similar to the human eye’s luminance frequency response. That is why the green channel is usually referred to as the luminance channel. The simplest way of estimating the missing samples is to use basic interpolation techniques, such as bilinear interpolation and bicubic interpolation.

Although these non-

adaptive algorithms (e.g., bilinear interpolation, bicubic interpolation) can provide satisfactory results in smooth regions of an image, they usually fail in high-frequency regions, especially along edges. An example where bilinear interpolation is used in demosaicking is shown in Figure 4. As seen in that figure, there are highly visible color artifacts in the high-frequency regions of the image. A better approach is to employ an adaptive strategy. One example is the edge-directed

6

(a)

(b)

Figure 4: When demosaicking is not performed well, images suffer from color artifacts. (a) Image captured with a multi-chip digital camera. (b) Image captured with a single-chip digital camera; the missing samples are estimated by bilinear interpolation. interpolation. Performing an edge detection for each pixel in question and averaging along the edges rather then across them works better than the non-adaptive approach. An edgedirected interpolation algorithm is illustrated in Figure 5. In [34], first-order horizontal and vertical gradients are computed at each missing green location on the Bayer pattern. If the horizontal gradient is larger than the vertical gradient, suggesting a possible edge in the horizontal direction, interpolation is performed along the vertical direction. If the vertical gradient is larger than the horizontal gradient, interpolation is performed only in the horizontal direction. When the horizontal and vertical gradients are equal, the green value is obtained by averaging its four neighbors. It is also possible to compare the gradients against a predetermined threshold value. This approach works only if the edge is horizontal or vertical. When the edge is diagonal, there is little that can be since there are no green samples diagonally.

7

Edge-directed interpolation 1

2

3

4

5

6

7

8

9

10 11 12

13 14 15 16

1. 2. 3.

Calculate horizontal gradient H = |G6 - G8| Calculate vertical gradient V = |G3 - G11| If H > V, G7 = (G3 + G11)/2 Else if H < V, G7 = (G6 + G8)/2 Else G7 = (G3 + G11 + G6 + G8)/4

Figure 5: The green (G) sample at location “7” is estimated using edge-directed interpolation. The other green samples and the chrominance samples can be found in a similar way. For natural images, better performance can be achieved by exploiting the correlation among the color channels. It is usually assumed that the color ratios (or differences) within a small neighborhood is constant. This is a decent assumption for most portions of an image. Demosaicking methods that are based on this assumption are called constant-hue-based interpolation methods. Constant-hue-based interpolation approach is the most common approach to interpolate the chrominance (red and blue) channels [20, 78, 1]. As a first step, these algorithms interpolate the luminance (green) channel, which is done using bilinear or edge-directed interpolation. The chrominance channels are then estimated from the bilinearly interpolated “red hue” (red-to-green ratio) and “blue hue” (blue-to-green ratio). To be more explicit, the interpolated “red hue” and “blue hue” values are multiplied by the green value to determine the missing red and blue values at a particular pixel location. (See Figure 6.) Instead of interpolating the hue, it is also possible to interpolate the logarithm of the hue. An extension to these approaches is to use larger regions around the pixel in question and to use more complex predictors. In [42], Laroche and Prescott proposed a different version of the edge-directed interpolation given in [34]. They used the chrominance channels (in the 5 × 5 neighborhood of the pixel in question) instead of the luminance channel to determine the gradients. In order to determine the horizontal and vertical gradients at a blue (red) sample, second-order derivatives of blue (red) values are computed in the corresponding direction. The red and blue channels are interpolated as for the constant-huebased interpolation approach, but this time the color differences are interpolated instead

8

Constant color ratio based interpolation 1

2

3

4

5

6

7

8

9

10 11 12

1. 2.

Obtain full green channel (using edge-directed or bilinear interpolation) R7 = G7 × (R2/G2 + R4/G4 + R10/G10 + R12/G12)/4

13 14 15 16

Figure 6: Constant color ratio based interpolation is used to find the red (R) sample at location “7” using the green (G) samples. The other red samples and blue samples can be found similarly. of the color ratios. In [31], Hamilton and Adams used second-order derivatives of the chrominance samples as correction terms in the green channel interpolation. To determine the gradient at a blue (red) sample location, the second-order derivative of blue (red) pixels values are added to the first-order derivative of the green values. The second-order derivative of the blue (red) pixels is also added to the average of the green values in the minimum gradient direction. (See Figure 7 for an illustration.) The red and blue channels are interpolated similarly with second-order green derivatives used as the correction terms. There are also some other variations of determining edge directions and exploiting similar texture in the color channels. In [15], Chang et al. applied interpolation using a thresholdbased variable number of gradients. In that approach, a set of gradients is computed in the 5 × 5 neighborhood of the pixel under consideration. A threshold is determined for those gradients, and the missing value is computed using the pixels corresponding to the gradients that pass the threshold. A similar algorithm was proposed in [79], where the green channel is used to determine the pattern at a particular pixel, and then a missing red (blue) pixel value is estimated as a weighted average of the neighboring pixels according to the pattern. Another texture-based algorithm was proposed in [19], where the texture is classified as edge, stripe, or corner, and the interpolation is done accordingly. In a recent paper, Kimmel combined the constant-hue-based interpolation and edgedirected interpolation approaches in an iterative scheme [40]. In his algorithm, first-order derivatives of the green channel information are used to compute edge indicators in eight possible directions. Hue values are interpolated using these edge indicators, and missing

9

Second-order gradients as correction terms 1. 2. 3.

1 2 3

4

5 8 9

6

7

Calculate horizontal gradient H = |G4 – G6| + |R5 – R3 + R5 – R7| Calculate vertical gradient V = |G2 – G8| + |R5 – R1 + R5 – R9| If H > V, G5 = (G2 + G8)/2 + (R5 – R1 + R5 – R9)/4 Else if H < V, G5 = (G4 + G6)/2 + (R5 – R3 + R5 – R7)/4 Else G5 = (G2 + G8 + G4 + G6)/4 + (R5 – R1 + R5 – R9 + R5 – R3 + R5 – R7)/8

Figure 7: Second order derivatives of red (R) samples are used as the correction terms to find the green (G) sample at location “5”. When the missing green pixel is at a blue pixel, the blue values are used. color intensities are determined according to the interpolated hues. The color channels are then updated iteratively to obey the color-ratio rule. He also proposed an inverse diffusion process to enhance the images further. All these demosaicking methods reviewed so far are heuristic approaches; the interchannel correlation is utilized in simple ways (such as edge determination and constant color ratio) without any in-depth analysis. On the other hand, there are few approaches that look at the problem from a more realistic perspective. In [67] and [74], the CFA samples are modeled as observations obtained from a spatially and spectrally continuous scene, and a least mean square estimate (LMSE) of the original scene is computed. In [67], wide sense stationary models are assumed for noise and image priors. The problem is that the LMSE is computationally complex and the underlying assumptions may not always be valid. Another reasonable idea is proposed in [28]: Based on the fact that green channel is less likely to be aliased than the red and blue channels, Glotzbach et al. proposed to decompose the green channel into its frequency components and then add the high-frequency components of the green channel to the low-pass filtered red and blue channels. This is based on the idea that high-frequency components of red, blue, and green channels are similar. There are also couple of problems with this approach. First of all, it is not possible to construct ideal low-pass and high-pass filters with finite spatial extent. Secondly, the high-frequency components of red, green, and blue channels may not be identical. Therefore, it is not a good idea to simply replace the high-frequency components of red and blue channels with those of the green channel. And finally, this method does not guarantee that the reconstructed 10

channels are consistent with the observed data. 1.1.2

Super-Resolution Reconstruction: Multi-Frame Resolution Enhancement

In the demosaicking problem, we are trying to achieve pixel-level resolution. That is, we are estimating the missing pixel intensities of the color channels. This brings up the next question: Is it possible to achieve subpixel-level resolution? The answer is “yes” when there are multiple images that are slightly different than each other. By “slightly different images” we mean they are the images of the same scene but there is subpixel-level relative motion among them. The diversity of information among those images can be exploited (by using subpixel accurate motion estimates) to increase the resolution beyond the measured pixel levels; and such multi-frame resolution enhancement is referred to as super-resolution reconstruction in the literature. The need to enhance the resolution of a still image extracted from a video or of the video sequence itself arises frequently in security/surveillance systems, medical imaging, aerial/satellite imaging, scanning and printing devices, and high-definition TV systems. (See Figure 8 for an illustration of super-resolution reconstruction.) In aerial/satellite imaging, small objects cannot be properly resolved because of the vast distances involved. When a video of the scene is available, super-resolution reconstruction algorithms can be used to resolve details that would be impossible otherwise. In medical imaging, super-resolution reconstruction can help in diagnosis and operational decisions by producing high-resolution images. The resolving power of scanners can be improved in a cost-effective way by superresolution reconstruction: the scanner can capture multiple images that are slightly different from each other because of a predetermined shift of the sensor array, and then apply a super-resolution reconstruction algorithm to produce a higher-resolution image. Another application area of super-resolution is high-definition TV (HDTV) systems. A standard NTSC video has 480 vertical lines; on the other hand, an HDTV signal can have up to 1080 vertical lines. The standard TV signal can be improved by super-resolution reconstruction to match the quality and resolution of HDTV displays. A related super-resolution problem arises when we need to create an enhanced-resolution still image from a video sequence, as

11

Super-resolution reconstruction

Low-resolution video sequence

Reconstructed image Figure 8: Super-resolution reconstruction algorithms increase the resolution by exploiting the correlation among multiple frames. An application of super-resolution reconstruction is license plate recognition. when printing stills from video sources. The human visual system requires a higher resolution for a still image than for a sequence of frames, with the same perceptual quality. An NTSC video yields 480 vertical lines, whereas more than twice as many lines are required to print with reasonable resolution on modern 1200-dpi printers. As we have seen, during the image acquisition process, the spatial resolution decreases because of optical blur, sensor blur, sensor density, and motion blur. Super-resolution reconstruction algorithms model these processes, and establish a relation between the highresolution original image (that we are trying to reconstruct) and the low-resolution observations obtained from it. Then, the super-resolution problem becomes the inverse problem where the original image is estimated from multiple low-resolution observations. Before reviewing the basic super-resolution approaches, we will first review a typical image acquisition model used in super-resolution reconstruction. This model, depicted in Figure 9, is derived from the general image acquisition model presented in Figure 1. According to the model in Figure 9, a spatially and temporally continuous input signal f (x, y, t) is sampled in time to form a spatially continuous image f (x, y). Here, (x, y) represents the

12

f (n1 , n2 )

f ( x, y )

f ( x, y , t )

Sampling in Time

Continuous to Discrete Without Aliasing

Blur

g (l1 , l2 )

Downsampling

Noise

n(l1 , l2 )

Figure 9: Image acquisition model used in super-resolution reconstruction. Because the reconstructed signal must be digital, the model includes a discretization block to convert f (x, y) to f (n1 , n2 ). f (n1 , n2 ) is the signal to be reconstructed. continuous spatial coordinates, and t represents time. Because we are dealing with digital images, this continuous image is converted to a discrete image f (n1 , n2 ). (n1 , n2 ) are the discrete spatial coordinates, and f (n1 , n2 ) is the high-resolution image that we are trying to reconstruct through super-resolution reconstruction. f (n1 , n2 ) is then distorted by sensor and optical blurs. Sensor blur is caused by integrating the received light over the finite nonzero sensor cell area, while optical blur includes optical distortions such as out of focus. It is also possible to include motion blur to the blurring process, which is due to nonzero shutter time. The blurred image is then downsampled to account for the insufficient sensor density. There may also be additive noise causing further degradation. The final is image is represented by g(l1 , l2 ), where (l1 , l2 ) are the discrete spatial coordinates. We now extend this image acquisition model to a video acquisition model by incorporating the relative motion among different frames of a video. The model is depicted in Figure 10. f (n1 , n2 ) is the high-resolution discrete image we want to reconstruct and gi (l1 , l2 ) are the low-resolution frames that are observed. i is the frame number, and M is the total number of frames. The warping block accounts for the relative motion between the observations, which can be global as well as dense. The rest of the model (blurring, downsampling, and additive noise) is same as in the image acquisition model. All the processes in this video acquisition model are linear, and the relationship between the high-resolution image f (n1 , n2 ) and the recorded low-resolution images gi (l1 , l2 ) can be formulated as follows [49]. gi (l1 , l2 ) =

X

hi (l1 , l2 ; n1 , n2 )f (n1 , n2 ) + ni (l1 , l2 ),

n1 ,n2

13

(1)

n1 (l1 , l2 ) Warping

Blur

g1 (l1 , l2 )

Downsampling

n2 (l1 , l2 )

f (n1 , n2 )

Warping

Blur

g 2 (l1 , l2 )

Downsampling



nM (l1 , l2 )

Warping

Blur

Downsampling

g M (l1 , l2 )

Figure 10: Video acquisition model. where hi (l1 , l2 ; n1 , n2 ) is the linear mapping that includes warping, blurring, and downsampling operations, and ni (l1 , l2 ) is the additive noise. The high-resolution and low-resolution sampling lattice indices (i.e., pixel coordinates) are (n1 , n2 ) and (l1 , l2 ), respectively. Note that (1) provides a linear set of equations that relates the high-resolution image to the lowresolution frames gi (l1 , l2 ) for different values of i. The relation in (1) can also be expressed in a simpler matrix-vector notation, which we will use in the remainder of this thesis. Letting f , g(i) , and n(i) denote the lexicographically ordered versions of f (n1 , n2 ), g(l1 , l2 , i), and ni (l1 , l2 ), respectively, we write g(i) = H(i) f + n(i) , i = 1 · · · M,

(2)

where M is the total number of observations and H(i) is a matrix constructed from the blur mapping hi (l1 , l2 ; n1 , n2 ). (Although it is possible to construct the matrix, a more convenient way is to think of H(i) as a linear operator that applies warping, blurring, and downsampling operations successively. Figure 11 helps understanding this approach. We will elaborate on this issue in Section 3.4.) We now review the basic super-resolution reconstruction approaches. A simple form of the super-resolution reconstruction idea is illustrated in Figure 12. In that figure, multiple images are registered (with subpixel accuracy) on top of each other. Then the problem

14

First low-resolution observation

Second low-resolution observation

Motion

Figure 11: Each pixel in a low-resolution observation is obtained as a weighted sum of several pixels in high-resolution image. The weights are determined by the point spread function. The relative motion between the observations can be simulated by warping the high-resolution image. becomes interpolation from nonuniform samples. The nonuniform samples are interpolated to a regular grid, which is finer than the grids of the input images. In a real-life scenario, the problem is more complicated because of non-global motion, occlusion, blurring, nonzero aperture intervals, and illumination changes. Another early approach on super-resolution reconstruction was proposed by Tsai and Huang [75]. They used the shifting property of the Fourier transform and the aliasing relationship between the continuous Fourier transform (CFT) of a high-resolution image and the discrete Fourier transform (DFT) of observed low-resolution images to derive a set of equations relating the aliased DFT coefficients of the observations to the samples of the CFT coefficients of the unknown image. This work was followed by similar frequencydomain approaches [38, 39, 72, 77, 63]. In [38], the idea of Tsai and Huang was extended to blurred and noisy images, which resulted in a weighted least squares formulation. The approach was further refined to include cases where low-resolution images have different blur and noise characteristics [39]. Ur and Gross [77] presented an algorithm similar to the one in [75], where the interpolation step is implemented in the spatial domain and a deblurring step is included as post-processing.

15

Reconstruction

Registered lowresolution images

Interpolated to highresolution grid

Figure 12: Illustration of super-resolution reconstruction. Samples of different images are given in different colors. These images are registered with sub-pixel accuracy; and the irregular samples are interpolated to a regular grid. Although the frequency-domain approach is theoretically simple and convenient for parallel implementation, it is limited to enhancement in the presence of linear shift-invariant blurs and global translational motion. It is also difficult to use spatial domain a priori knowledge (such as spatial smoothness) for regularization. There are more realistic approaches that allow for linear shift-variant blur and arbitrary motion between the frames, and enable the use of various regularization techniques easily. We can examine these methods in two groups: stochastic and deterministic. The first group of methods is based on a statistical formulation, such as a maximum likelihood (ML) or maximum a posteriori probability (MAP) estimation [16, 57, 7, 70]. The high-resolution image to be estimated and the additive sensor noise are assumed to be stochastic variables. In the MAP formulation, the observations g(i) , the original highresolution image f , and the additive noise n(i) are all assumed to be random processes. ³

Denoting p f |g(1) , · · · , g(M )

´

as the conditional probability density function (PDF), the

16

MAP estimate ˆ f is given by n ³

ˆ f = arg max p f |g(1) · · · g(M )

´o

.

(3)

f

Using the Bayes rule, equation (3) can be rewritten as n ³

´

o

ˆ f = arg max p g(1) · · · g(M ) |f p (f ) .

(4)

f

³

In order to find the MAP estimate ˆ f , the conditional PDF p g(1) , · · · , g(M ) |f

´

and the

prior PDF p (f ) need to be modeled. Gaussian models and Markov Random Fields are the most commonly used statistical distributions. Using convex energy functions in the priors ensures uniqueness of the solution and allows for efficient gradient descent techniques in optimization. In [16], Cheeseman et al. applied the MAP estimation technique to obtain high-resolution satellite images. The original high-resolution images were assumed to be spatially smooth, which is incorporated into the estimation problem through a Gaussian prior. The additive sensor noise was also taken as a Gaussian process. In [57], Schultz and Stevenson used a Markov Random Field model for the high-resolution image, aiming to preserve the edges in the reconstruction by means of a Huber edge penalty function. Borman and Stevenson [7] extended the approach in [57] to incorporate temporal smoothness constraints in the prior image model. In these methods, motion parameters are obtained as a preprocessing step. In [70], Tom and Katsaggelos proposed an expectation-maximization (EM) algorithm to estimate the motion parameters, noise variances of each image, and the high-resolution image simultaneously. A similar joint estimation method was also presented in [32]. The second group uses deterministic methods, such as Projections Onto Convex Sets (POCS), to enhance the resolution in the spatial domain without taking any statistical information into account [64, 69, 50, 35, 45, 36, 41]. However, it is still possible to use a priori information in the form of constraint sets. The POCS technique produces solutions that are consistent with the information arising from observed data or knowledge about the solution. Each piece of information is associated with a constraint set in the solution space; the intersection of these sets represents the space

17

of acceptable solutions [21]. Consistency with observed data is one of the constraint sets used in super-resolution reconstruction. Let x(n1 , n2 ) be an estimate of the high-quality image f (n1 , n2 ). Then, the observed data consistency can be imposed by the following constraint set: (

C [gi (l1 , l2 )] =

x(n1 , n2 ) : |gi (l1 , l2 ) −

X

)

hi (l1 , l2 ; n1 , n2 )x(n1 , n2 )| ≤ Ti (l1 , l2 ) ,

(5)

n1 ,n2

where Ti (l1 , l2 ) is a threshold determined from the power of noise ni (n1 , n2 ) By projecting an initial estimate onto these constraint sets iteratively, a solution closer to the original signal is obtained. POCS-based methods are effective and easy to implement, but they do not guarantee uniqueness of the solution. (See Figure 13 for an illustration of the POCS technique.) In [64], Stark and Oskoui proposed a POCS-based super-resolution reconstruction to reduce the blur introduced by low-resolution sensors. Tekalp et al. extended this POCS formulation to include sensor noise [69]. In [50], Patti et al. further extended this approach by incorporating the time-varying motion blur induced by the motion of the camera, and an arbitrary sampling lattice into the imaging model. In [35], Irani and Peleg used the method of iterated back-projections, where the high-resolution image is estimated by back projecting the difference between the observed low-resolution images and simulated low-resolution images (from the current high-resolution image estimate). Their method assumed translational and rotational motion between the low-resolution frames. Later, [45] and [36] extended this method for more general motion models. A similar method proposed by Komatsu et al. [41] used a Landweber iteration technique. In addition to these approaches, there are also some other methods. In [23], Elad and Feuer proposed a hybrid method that applies the POCS and stochastic estimation approaches iteratively. Although it is computationally complex, this approach ensures uniqueness of the solution in contrast to the POCS approach. Elad and Feuer also proposed adaptive filtering [25] and Kalman filter [25] approaches to the super-resolution problem. Although the super-resolution problem has been studied extensively, there are still shortcomings of the existing super-resolution methods. All these methods reviewed so far are based on the assumption that there is no compression during the imaging process. The input

18

Initial estimate Constraint set #2

Constraint set #1 Constraint set #3 Solution space

Figure 13: In the POCS technique, the initial estimate is projected onto the constraint sets iteratively. signal (video/image sequence) is assumed to exist in a raw format instead of a compressed format. However, because of the limited resources (bandwidth, storage space, I/O requirements, etc.), compression has become a standard component of almost every data communication application. Unfortunately, super-resolution algorithms designed for uncompressed data do not perform well when directly applied to decompressed image sequences. The reason is that the quantization noise may be the dominant source of error, especially at high compression rates. Therefore, there is definitely a need to develop super-resolution algorithms designed for compressed data. A second shortcoming of current super-resolution algorithms is that the regularization constraints are limited. Spatial smoothness and pixel intensity range are the two most commonly used constraints. Depending on the image to be reconstructed, there must be others ways of regularization. For example, face images can be modeled and represented in a lower dimensional space through the application of the Karhunen-Loeve Transform (KLT). We must be able to develop super-resolution algorithms that can exploit such a model-based information.

19

Another major shortcoming of the current super-resolution algorithms is that illumination changes and camera settings that affect the gray-scale levels (such as exposure time, gain, white balance adjustment) are not considered in the imaging models. However, it is likely that there may be such changes during an imaging process; and there is definitely a need for developing imaging models and super-resolution algorithms that can handle such effects.

1.2

Contributions

We have made several contributions in the area of resolution enhancement, which are collected in four chapters. We first look into the problem of color filter array sampling and propose a totally new approach to estimate missing color samples. We then aim to increase the resolution further by exploiting the correlation among multiple images. We address the case when there is compression during the imaging process. We then propose a super-resolution algorithm for face images, where model-based information is used in regularization. We finally explore the shortcomings of traditional imaging models used in the resolution enhancement problem and develop a new model and a super-resolution algorithm based on that model. In the next four chapters, we detail these contributions. These chapters are as follows. In Chapter 2, we present a color filter array (CFA) interpolation algorithm. Most of the CFA interpolation methods are heuristic approaches that do not utilize the inter-channel correlation effectively. We have developed a projections onto convex sets (POCS) based algorithm that exploits both the frequency-dependent correlation among the color channels and the observed data to improve the spatial resolution. We have compared our algorithm with several state-of-the-art algorithms available in the literature, and ours turned out to be the best one. In Chapter 3, we propose a super-resolution reconstruction algorithm for compressed video. Most of the work done in the area of super-resolution reconstruction do not consider compression, and assume that images are available in their raw (uncompressed) format.

20

Super-resolution algorithms designed for uncompressed data do not consider the quantization noise, which may become the dominant source of error at high compression rates. We propose a Bayesian super-resolution algorithm that models compression, and exploits the quantization step size information (available in the data bitstream) effectively in reconstruction. The proposed algorithm allows us to use the statistical information about the quantization noise and the additive noise at the same time. In Chapter 4, we present a model-based super-resolution algorithm to improve face recognition performance. Face images that are captured by surveillance cameras usually have a very low resolution, which significantly limits the performance of face recognition systems. The immediate solution to this problem is to apply super-resolution reconstruction on the face images, and then pass the result to the recognition system. Considering that most state-of-the-art face recognition systems use an initial dimensionality reduction method, we proposed embedding the super-resolution algorithm into the face recognition system so that super-resolution is not performed in the pixel domain, but is instead performed in a reduced dimensional face space. The resulting face-space super-resolution algorithm has reduced computational complexity and is more robust to noise than the pixel-domain super-resolution algorithms. In Chapter 5, we propose a generalization to the super-resolution reconstruction. None of the work done in the area of super-resolution considers potential changes in gray-scale. However, the observations may provide diverse information in gray-scale due to changes of illumination in the scene or imaging device adjustments such as exposure time, gain, or white balance, all of which should be considered for an effective and robust reconstruction. We include such effects in our imaging model and propose a POCS-based reconstruction algorithm that can handle gray-scale changes and improves both spatial and gray-scale resolution.

21

CHAPTER II

COLOR FILTER ARRAY INTERPOLATION USING ALTERNATING PROJECTIONS 2.1

Introduction

Single-chip digital cameras use color filter arrays to sample different spectral components, such as red, green, and blue. At the location of each pixel, only one color sample is taken, and the other colors must be interpolated from neighboring samples. This color plane interpolation is known as demosaicking, and it is one of the important tasks in a digital camera pipeline. In the literature review given in the previous chapter, we have seen that most of the demosaicking algorithms are heuristic and do not exploit the inter-channel correlation that exist in natural images. In this chapter we present a very effective means of using interchannel correlation in demosaicking. The algorithm defines constraint sets based on the observed color samples and prior knowledge about the correlation between the channels. It reconstructs the color channels by projecting the initial estimates onto these constraint sets iteratively. We have compared our algorithm with the various other techniques that we have outlined earlier in Section 1.1.1, and it outperforms them both visually and in terms of its mean square error. The chapter is organized as follows. We first present the motivation and details of our algorithm in Section 2.2. In Section 2.3, we provide a comprehensive comparison of our algorithm with several state-of-the-art algorithms in the literature. We finally provide a computational complexity analysis (Section 2.4) and propose several extensions to the algorithm that would improve its performance (Section 2.5).

22

2.2

Demosaicking Using Alternating Projections

There are two observations that are important for the demosaicking problem. The first is that for natural images, there is a high correlation among the red, green, and blue channels. All three channels are very likely to have the same texture and edge locations. (Because of the similar edge content, we expect this inter-channel correlation to be even higher when it is measured between the high-frequency components of the channels.) The second observation is that digital cameras use a color filter array (CFA) in which the luminance (green) channel is sampled at a higher rate than the chrominance (red and blue) channels. Therefore, the green channel is less likely to be aliased, and details are preserved better in the green channel than in the red and blue channels. (The high-frequency components of the red and blue channels are affected the most in CFA sampling.) In demosaicking, it is the interpolation of the red and blue channels that is the limiting factor in performance. Color artifacts, which become severe in high-frequency regions such as edges, are caused primarily by aliasing in the red and blue channels. Although this fact is acknowledged by the authors of most demosaicking algorithms, inter-channel correlation has not been used effectively to retrieve the aliased high-frequency information in the red and blue channels. The algorithm that is proposed in this chapter removes aliasing in these channels using an alternating-projections scheme. It defines two constraint sets based on the inter-channel correlation and the observed data, and reconstructs the red and blue channels by projecting initial estimates onto these constraint sets. Section 2.2.1 quantifies the degree of cross-correlation among the color channels. Section 2.2.2 illustrates the aliasing that results from the CFA sampling and motivates a detail-retrieving interpolation scheme. Section 2.2.3 derives the constraint sets used by the proposed demosaicking scheme. Section 2.2.4 presents the complete algorithm. 2.2.1

Inter-Channel Correlation

In natural images the color channels are highly mutually correlated. Since all three channels are very likely to have the same edge content, we expect this inter-channel correlation to be even higher when it is measured among the high-frequency components. (The reason

23

for investigating correlation in the high-frequency components will become more evident in the next section.) To illustrate this, we decomposed the three color channels of 20 natural images (Figure 14) into subbands. We used two-dimensional separable filters constructed from a low-pass filter (h0 = [1 2 1]/4) and a high-pass filter (h1 = [1 −2 1]/4) to decompose each image into its four subbands: (LL) both rows and columns are low-pass filtered, (LH) rows are low-pass filtered, columns are high-pass filtered, (HL) rows are high-pass filtered, columns are low-pass filtered, and (HH) both rows and columns are high-pass filtered. The inter-channel correlation coefficients for each of these four subbands was computed using the formula P (n1 ,n2 )

Cx,y = s P

(n1 ,n2 )

(x(n1 , n2 ) − µx ) (y(n1 , n2 ) − µy ) 2

(x(n1 , n2 ) − µx )

s

P (n1 ,n2 )

2

,

(6)

(y(n1 , n2 ) − µy )

where (n1 , n2 ) are integers denoting the spatial coordinates, x(n1 , n2 ) and y(n1 , n2 ) are the samples of two different color channels within a subband, and µx and µy are the means of x(n1 , n1 ) and y(n1 , n2 ), respectively. The summation is done over all possible (n1 , n2 ) in a subband. The correlation coefficients between the red and green, and blue and green channels are tabulated in Table 1. As seen in that table, the correlation coefficients for the high-frequency subbands (LH, HL, HH) are larger than 0.9 for all images, and the highest correlation coefficient for a particular image is among these subbands. The lowfrequency subbands LL are also highly correlated (Their correlation coefficients are greater than 0.8 for most of the images.), but they are not as highly correlated as the high frequency subbands. The next section examines the effects of CFA sampling on these subbands. In particular, we show that the high-frequency subbands of the red and blue channels are the most affected. 2.2.2

Color Filter Array Sampling

As seen in Figure 15, in a Bayer pattern the green channel, sampled with a quincunx lattice, is less likely to be aliased than the red and blue channels, which are sampled with less dense rectangular lattices. This can easily be illustrated in the frequency domain. Figure 16(a) depicts the Fourier spectrum of an image with fm being the maximum observable frequency. 24

Figure 14: Images used in the experiments. (These images are referred as Image 1 to Image 20 in the thesis, enumerated from left-to-right, and top-to-bottom.)

25

Table 1: Inter-channel correlation in different subbands. Image no 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Red/Green Corr. Coef. LL LH HL HH 0.8376 0.9959 0.9903 0.9866 0.4792 0.9275 0.9346 0.9414 0.8854 0.9275 0.9899 0.9822 0.9765 0.9963 0.9926 0.9794 0.8234 0.9789 0.9800 0.9483 0.9629 0.9947 0.9937 0.9822 0.9462 0.9906 0.9861 0.9404 0.9540 0.9867 0.9841 0.9458 0.8083 0.9863 0.9870 0.9768 0.9112 0.9899 0.9759 0.9316 0.9793 0.9968 0.9953 0.9932 0.8595 0.9548 0.9445 0.9421 0.9838 0.9962 0.9895 0.9599 0.9852 0.9864 0.9831 0.9519 0.9120 0.9832 0.9812 0.9771 0.9652 0.9951 0.9911 0.9562 0.9956 0.9873 0.9913 0.9588 0.8850 0.9926 0.9901 0.9771 0.8660 0.9542 0.9448 0.9352 0.9765 0.9809 0.9741 0.9460

G

Blue/Green Corr. Coef. LL LH HL HH 0.9911 0.9970 0.9940 0.9838 0.9723 0.9891 0.9889 0.9431 0.8917 0.9754 0.9856 0.9682 0.9924 0.9917 0.9810 0.9605 0.9045 0.9717 0.9761 0.9391 0.9725 0.9932 0.9931 0.9770 0.8464 0.9834 0.9776 0.9252 0.9698 0.9807 0.9706 0.9224 0.9730 0.9942 0.9933 0.9765 0.9670 0.9878 0.9784 0.9120 0.9603 0.9895 0.9854 0.9805 0.9882 0.9809 0.9841 0.9506 0.9492 0.9943 0.9853 0.9471 0.9778 0.9811 0.9784 0.9418 0.8113 0.9746 0.9630 0.9700 0.9078 0.9923 0.9919 0.9578 0.9767 0.9594 0.9773 0.9177 0.9094 0.9871 0.9838 0.9614 0.8698 0.9693 0.9737 0.9598 0.9654 0.9696 0.9629 0.9259

R

B G

Figure 15: Bayer pattern. When this image is captured with a digital camera, the color planes are sampled according to a CFA, which is generally the Bayer pattern. As illustrated in Figures 16(b) and 16(c), while there is no aliasing in the green channel, the red and blue channels are aliased. This can also be confirmed for the images in Figure 14. In Table 2, the correlation coefficients between the original channels and the bilinearly-interpolated (from the CFA samples) channels are displayed for all subbands. Two important things can be observed in that table. First, the high-frequency (LH, HL, HH) subbands are degraded the most. Second this degradation is more severe in the red and blue channels than in the green channel, especially in the LH and HL subbands. In Section 2.2.1, we showed that the color channels are highly mutually correlated, 26

especially in the high-frequency subbands. In this section, we illustrated the fact that the high-frequency subbands of the red and blue channels are affected the most in CFA sampling. These two observations imply that the high-frequency information of the green channel can be used to help estimate the high frequency components of the red and blue channels. One way to achieve this is with a set-theoretic reconstruction. f2

fm

f1

(a) Frequency support of an image. f2

f2

f1

f1

(b) Sampled green channel.

(c) Sampled red and blue channels.

Figure 16: CFA sampling of the color channels. (a) Frequency support of an image. (b) Spectrum of the sampled green channel. (c) Spectrum of the sampled red and blue channels.

2.2.3

Constraint Sets

Set-theoretic reconstruction techniques produce solutions that are consistent with the information arising from observed data or prior knowledge about the solution. Each piece of 27

Table 2: Correlation among the full-channel (original) images and bilinearly interpolated (CFA-sampled) images in different subbands. Img. no 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

LL 0.972 0.992 0.991 0.995 0.997 0.985 0.996 0.997 0.992 0.996 0.984 0.999 0.996 0.998 0.993 0.992 0.999 0.991 0.996 0.994

Red channel LH HL 0.594 0.642 0.713 0.597 0.609 0.633 0.589 0.599 0.693 0.699 0.575 0.572 0.665 0.664 0.666 0.681 0.619 0.599 0.641 0.735 0.555 0.537 0.675 0.673 0.624 0.710 0.669 0.671 0.578 0.615 0.659 0.580 0.621 0.765 0.617 0.613 0.709 0.602 0.647 0.639

HH 0.303 0.259 0.339 0.302 0.273 0.283 0.251 0.294 0.296 0.256 0.301 0.285 0.269 0.281 0.292 0.288 0.298 0.299 0.284 0.299

LL 0.998 0.998 0.998 0.999 0.999 0.999 0.999 0.999 0.999 0.999 0.997 0.999 0.999 0.999 0.997 0.999 0.999 0.999 0.999 0.999

Green channel LH HL 0.897 0.808 0.876 0.849 0.868 0.849 0.929 0.676 0.919 0.875 0.795 0.894 0.901 0.869 0.901 0.839 0.883 0.806 0.917 0.824 0.883 0.718 0.787 0.907 0.943 0.653 0.859 0.868 0.843 0.811 0.895 0.829 0.826 0.845 0.916 0.756 0.852 0.871 0.875 0.805

HH 0.301 0.353 0.358 0.312 0.307 0.253 0.251 0.321 0.338 0.276 0.402 0.326 0.316 0.355 0.353 0.245 0.307 0.326 0.323 0.366

LL 0.979 0.987 0.989 0.992 0.997 0.981 0.994 0.995 0.993 0.997 0.986 0.999 0.994 0.997 0.981 0.991 0.999 0.993 0.995 0.992

Blue channel LH HL 0.596 0.527 0.675 0.638 0.646 0.554 0.619 0.551 0.696 0.678 0.601 0.539 0.681 0.629 0.681 0.623 0.621 0.575 0.663 0.622 0.573 0.500 0.669 0.514 0.603 0.662 0.663 0.575 0.587 0.536 0.641 0.560 0.694 0.725 0.624 0.623 0.664 0.570 0.631 0.604

HH 0.305 0.264 0.336 0.302 0.262 0.284 0.250 0.284 0.292 0.236 0.302 0.272 0.256 0.281 0.284 0.279 0.276 0.297 0.287 0.325

information is associated with a constraint set in the solution space, and the intersection of these sets represents the space of acceptable solutions [21]. For the demosaicking problem, we define two types of constraint sets, one coming from the observed data, and the other based on the prior knowledge of the inter-channel correlation. The first constraint set comes from the observed color samples. The interpolated color channels must be consistent with the color samples captured by the digital camera. We denote O(n1 , n2 ) as this observed data, which has red, green, and blue samples placed according to the CFA used. (n1 , n2 ) are ordered pairs of integers denoting the pixel locations. By defining ΛR , ΛG , and ΛB as the set of pixel locations, (n1 , n2 ), that have the samples of red, green, and blue channels, respectively, we write the “observation” constraint Co set as follows: Co = {S(n1 , n2 ) : S(n1 , n2 ) = O(n1 , n2 ) ∀ (n1 , n2 ) ∈ ΛS , S = R, G, B} ,

(7)

where S is a generic symbol for the interpolated color channels, which can be R for the red channel, G for the green channel, and B for the blue channel. The second constraint set is a result of the previous two sections. In Section 2.2.1, it 28

was shown that color channels have very similar detail (high-frequency) subbands. This information would not be enough to define constraint sets if all channels lost the same amount of information in sampling. However, Section 2.2.2 pointed out that the red and blue channels lose more information (details) than the green channel when captured with a color filter array. Therefore, we can define constraint sets on the red and blue channels that force their high-frequency components to be similar to the high-frequency components of the green channel. This proves to be a very effective constraint set, since the main source of color artifacts in a demosaicked image is the inconsistency of the channels, especially, along the edges. Before formulating this constraint set, we need to provide some information about the filter bank structure that is used to decompose the channels. Referring to Figure 17, the filter bank performs an undecimated wavelet transform, with H0 (z) and H1 (z) denoting lowpass and high-pass filters, respectively. These analysis filters (H0 (z) and H1 (z)) constitute a perfect reconstruction filter bank with the synthesis filters G0 (z) and G1 (z). The perfect reconstruction condition can be written as H0 (z)G0 (z) + H1 (z)G1 (z) = 1.

(8)

By denoting h0 (·) and h1 (·) as the impulse responses of H0 (z) and H1 (z), respectively, we write the four subbands of a two-dimensional signal S(n1 , n2 ) as follows: (W1 S) (n1 , n2 ) = h0 (n1 ) ∗ [h0 (n2 ) ∗ S(n1 , n2 )]

(9)

(W2 S) (n1 , n2 ) = h1 (n1 ) ∗ [h0 (n2 ) ∗ S(n1 , n2 )]

(10)

(W3 S) (n1 , n2 ) = h0 (n1 ) ∗ [h1 (n2 ) ∗ S(n1 , n2 )]

(11)

(W4 S) (n1 , n2 ) = h1 (n1 ) ∗ [h1 (n2 ) ∗ S(n1 , n2 )] ,

(12)

where (W1 S) is the approximation subband, and (W2 S), (W3 S), (W4 S) are the horizontal, vertical, and diagonal detail subbands, respectively. Now we can define the “detail” constraint set Cd that forces the details (high-frequency components) of the red and blue channels to be similar to the details of the green channel

29

G1 ( z )

H1 ( z )

Rˆ , Gˆ , Bˆ

R ,G , B

G0 ( z )

H0 ( z) (a) Analysis filterbank.

(b) Synthesis filterbank.

Figure 17: Analysis and synthesis filterbanks for one-level decomposition. as follows:

Cd =

   S(n1 , n2 ) : |(Wk S) (n1 , n2 ) − (Wk G) (n1 , n2 )| ≤ T (n1 , n2 )    

      

     

     

∀ (n1 , n2 ) ∈ ΛS ,

f or k = 2, 3, 4 and S = R, B

,

(13)

where T (n1 , n2 ) is a positive threshold that quantifies the “closeness” of the detail subbands to each other. If the color channels are highly correlated, then the threshold should be small; if the correlation is not high, then the threshold should be larger. Although T (n1 , n2 ) is a function of image coordinates in general, it is also possible to use a predetermined fixed value for it. One choice is to set T (n1 , n2 ) to zero for all (n1 , n2 ), which is result of the highcorrelation assumption. Later in the chapter, we also discuss how to choose a nonuniform threshold. 2.2.4

Alternating Projections Algorithm

This section presents an alternating-projections algorithm to reconstruct the red and blue channels. We first derive the projection operators corresponding to the “observation” and “detail” constraint sets given in the previous section. Convergence issues and enhancement of the green channel are then addressed. Finally, the complete algorithm is presented. A. Projection Operators The first constraint set that is used in the reconstruction is the “observation” constraint

30

set given in (7). Referring to that equation, we write the projection Po [·] onto the “observation” constraint set as follows:

Po [S(n1 , n2 )] =

   O(n1 , n2 ) ;    

  (n1 , n2 ) ∈ ΛS     

      S(n1 , n2 ) ;

     

otherwise

,

(14)

where S is the color channel, which can be the red (R), green (G), or blue (B) channel. The other constraint set is the “detail” constraint set given in (13). In order to write the projection onto this constraint set, we first need to define the filtering operations in the synthesis stage of the filter bank. Letting g0 (·) and g1 (·) denote the impulse responses corresponding to G0 (z) and G1 (z), we write the four filtering operations on a two-dimensional signal X(n1 , n2 ) as follows: (U1 X) (n1 , n2 ) = g0 (n1 ) ∗ [g0 (n2 ) ∗ X(n1 , n2 )]

(15)

(U2 X) (n1 , n2 ) = g1 (n1 ) ∗ [g0 (n2 ) ∗ X(n1 , n2 )]

(16)

(U3 X) (n1 , n2 ) = g0 (n1 ) ∗ [g1 (n2 ) ∗ X(n1 , n2 )]

(17)

(U4 X) (n1 , n2 ) = g1 (n1 ) ∗ [g1 (n2 ) ∗ X(n1 , n2 )] ,

(18)

where U1 , U2 , U3 , U4 are the synthesis filtering operators. As stated earlier, these form a perfect reconstruction filter bank with the analysis filtering operators W1 , W2 , W3 , W4 : S(n1 , n2 ) = U1 (W1 S) (n1 , n2 ) + U2 (W2 S) (n1 , n2 ) + U3 (W3 S) (n1 , n2 ) + U4 (W4 S) (n1 , n2 ). (19) Now, we write the projection Pd [S(n1 , n2 )] of a color channel S(n1 , n2 ) onto the “detail” constraint set Cd as follows. Referring to equation (13), we define rk (n1 , n2 ) as the residual: rk (n1 , n2 ) = (Wk S) (n1 , n2 ) − (Wk G) (n1 , n2 ).

(20)

When this residual is less than the threshold T (n1 , n2 ) in magnitude, the subband value (Wk S) (n1 , n2 ) is not changed. Otherwise, it has to be changed so that the residual rk (n1 , n2 ) is less than T (n1 , n2 ) in magnitude. This projection operator can be written as Pd [S(n1 , n2 )] = U1 (W1 S) (n1 , n2 ) +

4 X k=2

31

¡

¢

Uk Wk 0 S (n1 , n2 ),

(21)

where

¡

¢

Wk 0 S (n1 , n2 ) =

   (Wk G + T ) (n1 , n2 )           

(Wk S) (n1 , n2 )              (W G − T ) (n , n ) 1 2 k

; rk (n1 , n2 ) > T (n1 , n2 )

             

; |rk (n1 , n2 )| ≤ T (n1 , n2 ) . 

(22)

          ; rk (n1 , n2 ) < −T (n1 , n2 ) 

The “observation” projection ensures that the interpolated channels are consistent with the observed data; the “detail” projection reconstructs the high-frequency information of the red and blue channels, and imposes edge consistency between the channels. By alternately applying these two projections onto the initial red and blue channel estimates, we are able to enhance these channels.

B. Convergence The constraint sets given in equations (7) and (13) are convex. (The proofs are provided in Appendix A.) Therefore, an initial estimate converges to a solution in the feasibility set by projecting it onto these constraint sets iteratively. We have also verified it experimentally. Using the proposed algorithm, we updated the chrominance (red and blue) channels iteratively from the initial estimates. In each iteration, the chrominance channels are updated by the “detail” projection, followed by the “observation” projection. A typical convergence plot is given in Figure 18. As seen in that figure, the mean square error of the red and blue channels converges in about five iterations. (That plot is for Image 16 in Figure 14. The initial estimates for the red and blue channels were obtained by bilinear interpolation. The green channel was interpolated using a method that will be explained in the next subsection.) Instead of performing a one-level subband decomposition, it is also possible to decompose the signals further. As done with undecimated wavelet transforms, the low-pass (LL) subbands can be decomposed by using filters H0 (z 2 ), and H1 (z 2 ). This filterbank structure is shown in Figure 19 for a two-level decomposition. Convergence for the two-level decomposition, which is illustrated in Figure 20 is faster than for the one-level decomposition. 32

140 Red channel Blue channel 120

Mean Square Error

100

80

60

40

20

0

1

2

3

4

5 Iteration no

6

7

8

9

Figure 18: Convergence for one-level decomposition.

G1 ( z )

H1 ( z )

Rˆ , Gˆ , Bˆ

G1 ( z 2 )

H1 ( z 2 )

R ,G , B

G0 ( z )

H0 ( z)

G0 ( z 2 )

H0 ( z 2 )

(a) Analysis filterbank.

(b) Synthesis filterbank.

Figure 19: Analysis and synthesis filterbanks for two-level decomposition. C. Updating the Green Channel The algorithm we have discussed so far reconstructs the high-frequency information of the red and blue channels. The performance of this reconstruction directly depends on the accuracy of the green channel interpolation. The edge-directed interpolation methods discussed in Section 1 provide satisfactory performance in general, but it is still possible to obtain better results using a method similar to the red-blue interpolation we have presented. Referring to Figure 21, we can update the green channel as follows:

33

140 Red channel Blue channel 120

Mean Square Error

100

80

60

40

20

0

1

2

3

4

5 Iteration no

6

7

8

9

Figure 20: Convergence for two-level decomposition. 1. Interpolate the green channel to get an initial estimate. Either bilinear or edgedirected interpolation methods can be used for this step. 2. Use the observed samples of the blue channel to form a downsampled version of the blue channel. (Note that all pixels of this downsampled image are observed data.) 3. Use the interpolated green samples at the corresponding (blue) locations to form a downsampled version of the green channel. (Note that the pixels of this downsampled image are all interpolated values.) 4. Decompose these blue and green downsampled channels into their subbands, as was done in the previous section. 5. Replace the high-frequency (LH, HL, HH) subbands of the green channel with those of the blue channel. (Note that this corresponds to setting the threshold T (n1 , n2 ) to zero.) 6. Reconstruct the downsampled green channel, and insert the pixels in their corresponding locations in the initial green channel estimate. 34

7. Repeat the same procedure for the pixels at the red samples. With this scheme, significant improvement over bilinear interpolation and other adaptive algorithms can be achieved in the green channel. We used the edge-directed interpolation procedure proposed in [31] to obtain the initial green channel estimates. The results are discussed in Section 2.3. D. Complete Algorithm The pseudo-code of the complete algorithm is as follows: 1. Initial interpolation: Interpolate the red, green, and blue channels to obtain initial estimates. Bilinear or edge-directed interpolation algorithms can be used for this initial interpolation. 2. Update the green channel: Update the green channel using the scheme explained in Section 2.2.4.C. 3. “Detail” projection: Decompose all three channels with a filter bank. At each level of decomposition, there will be four subbands. Update the detail (high-frequency) subbands of the red and blue channels using equation (22) and reconstruct these channels using equation (21). 4. “Observation” projection: Compare the samples of the reconstructed red and blue channels with the original (observed) samples. Insert the observed samples into the reconstructed channels at their corresponding pixel locations as given in equation (14). 5. Iteration: Go to Step 3, and repeat the procedure until a stopping criterion is achieved. (Typically, the iterations are repeated for a predetermined number of times.)

2.3

Experimental Results

In our experiments, we used the images shown in Figure 14. These images are film captures and digitized with photo scanner. Full color channels are available, and the CFA is simulated by sampling the channels. The sampled channels are used to test the demosaicking 35

algorithms. We used bilinear interpolation for the red and blue channels, and the edge-directed interpolation method given in [31] for the green channel to get the initial estimates. The method proposed in Section 2.2.4.C was used to refine the initial estimate of the green channel. The following filters were used in the experiments: h0 = [1 2 1]/4; h1 = [1 − 2 1]/4; g0 = [−1 2 6 2 − 1]/8; and g1 = [1 2 − 6 2 1]/8. The threshold T (n1 , n2 ) was set to zero for all (n1 , n2 ). We did the experiments for both one-level decomposition and two-level decomposition. The number of iterations for one-level (1-L) and two-level (2-L) decompositions was eight and four, respectively. The performance in terms of mean squared error can be seen in Table 3 for both our and various other demosaicking algorithms [20, 34, 42, 31, 40, 15]. As seen in that table, the proposed algorithm has the lowest mean squared error in almost all cases. Among these algorithms, [31] and [15] have comparable performance in the green channel for some images. (The ones whose performance was better than the proposed method are highlighted.) However, their red and blue channel performance was worse in all cases, which make them worse visually. Another successful method was Kimmel’s method [40]. In that paper, the red, green, and blue channels were corrected iteratively to satisfy the color ratio rule, and the number of iterations was set to three. However, we found that algorithm to be prone to color artifacts, and iterating three times made the results worse both visually and in terms of mean squared error. Therefore, in our implementation we did color correction only once. We also provide some examples from the images used in the experiments for visual comparison. Figures 22 and 23 show cropped segments from original images (Images 4 and 6 in Figure 14), and the corresponding reconstructed images from the demosaicking algorithms that were used in comparison. Close examination of those figures verifies the effectiveness of the proposed algorithm.

2.4

Computational Complexity

In this section, we analyze the computational complexity of the proposed algorithm. Let lh0 , lh1 , lg0 , and lg1 denote the lengths of the filters h0 , h1 , g0 , and g1 , respectively, and let M and

36

G R B G

Interpolate Green

••

••





Figure 21: Fine tuning of green channel is done from observed red and blue samples. Initial green pixel estimates corresponding to red and blue observations are combined to form smaller size green images, which are then updated using red and blue observations. N denote the width and height of an image. Each channel is decomposed into four subbands by convolving its rows and columns with filters h0 and h1 . This requires approximately (2lh0 +4lh1 )M N multiplications and additions for each channel. Including the reconstruction stage the total number of additions and multiplications is [2(lh0 + lg0 ) + 4(lh1 + lg1 )] M N for each channel. Typically, three iterations is enough for updating the red and blue channels, which will require a total of [12(lh0 + lg0 ) + 24(lh1 + lg1 )] M N operations for the red and blue channels. As a result, [14(lh0 + lg0 ) + 28(lh1 + lg1 )] M N operations are required for the iteration stages. We also update the initial estimate of the green channel as proposed in Section 2.4 with a one-level decomposition, and one iteration. This adds approximately [2(lh0 + lg0 ) + 4(lh1 + lg1 )] M N operations to the total count, which brings the total operation count to [16(lh0 + lg0 ) + 32(lh1 + lg1 )] M N . For the filters used in the experiment, this number is 384M N . If four iterations are done, the total complexity is 480M N . If a two-level decomposition is performed, a single iteration should be sufficient. Under this assumption using the filters in this chapter, the total computational complexity for a two-level decomposition is also 384M N additions and multiplications. Considering the dedicated hardware in digital cameras, this is a reasonable computational complexity.

37

2.5

Extensions to the Algorithm

It is also possible to extend the proposed algorithm in several different ways. We discuss two of them in this section. 1. Correlation surface: The threshold T (n1 , n2 ) in the “detail” projection provides a way of controlling the amount of the correlation between the channels that is used by the algorithm. If the channels are totally uncorrelated the threshold should be large enough to turn the “detail” projection into an identity projection. If the channels are highly correlated the threshold should be close to zero. One problem, however, is that the correlation between the channels is not necessarily uniform; there may be both high-correlation and low-correlation regions within the same image. This can be overcome by estimating the correlation locally and adjusting the threshold T (n1 , n2 ) accordingly. One way to compute a local correlation surface is to move a small window over the color planes, compute the correlation between them, and assign a correlation coefficient to the pixel at the center of the window. By mapping the values on the correlation surface to the threshold T (n1 , n2 ), the algorithm can be made more effective for images that have nonuniform correlation surfaces. Denoting KS (n1 , n2 ) as the correlation surface between channel S—red or blue—and the green channel, the proposed method computes the correlation surface as P

KS (n1 , n2 ) = s

(i,j)∈N(n1 ,n2 )

P

(i,j)∈N(n1 ,n2 )

(S(i, j) − µS ) (G(i, j) − µG )

(S(i, j) − µS )2

s

P (i,j)∈N(n1 ,n2 )

(G(i, j) − µG )2

,

(23)

where N(n1 ,n2 ) is a neighborhood about location (n1 , n2 ), and µS and µG are the means of channels S and G in that neighborhood. One choice for N(n1 ,n2 ) might be a 5×5 window. This formula will give a correlation surface with values ranging between zero and one. This correlation surface is then passed to a function that will return a large value when KS (n1 , n2 ) is small and a small value when KS (n1 , n2 ) is large. The choice of such a function requires further research and experimentation, and we leave it as an open problem.

38

2. Smoothness projection: Other constraint sets can be included in the algorithm easily. One such constraint is a smoothness constraint. Smooth hue (color ratio) and smooth color difference transitions are the basis of some demosaicking algorithms that we have already cited [20, 78, 1]. An easy way to include a smoothness projection is to interpolate the color ratio or difference to get an estimated color value at a certain location (n1 , n2 ), and constrain the results to lie in a certain neighborhood of that estimate. This is also an open area that should be investigated.

2.6

Conclusions

In this chapter we presented a demosaicking algorithm that exploits inter-channel correlation in an alternating projections scheme. Two constraint sets are defined based on the observed data and the prior knowledge about the correlation of the channels, and initial estimates are projected onto these constraint sets to reconstruct the channels. The proposed algorithm was compared with well-known demosaicking algorithms, and it showed an outstanding performance both visually and in terms of mean square error at a reasonable computational complexity. The question of uncorrelated color channels has also been addressed, and a threshold selection procedure has been proposed. However, in the experiments this was not needed, and setting the threshold to zero worked very well. Threshold selection and inclusion of other constraint sets are left as future work. It should also be noted the test images used are film captures that were digitized with photo scanner. Therefore, they have different noise power spectrums compared to actual digital camera captures, and more thorough performance analysis of the demosaicking algorithm should be done for different capture and digitization paths.

39

Table 3: Mean square error comparison of different methods. Img. no 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Chan. Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue Red Green Blue

Linear interp. 187.909 72.116 205.010 41.761 15.217 38.682 169.874 75.373 169.524 142.337 51.622 135.016 36.163 14.724 36.984 361.167 117.532 365.623 46.292 17.578 45.693 46.086 19.404 49.297 97.204 38.247 93.830 40.905 15.062 44.311 317.266 144.096 324.362 54.025 21.871 62.387 60.180 21.764 60.683 46.788 22.730 54.574 122.764 60.462 130.166 128.874 44.327 124.869 61.450 30.399 56.928 111.057 45.455 113.392 65.977 30.196 76.188 149.287 76.334 192.404

Method [20] 92.820 72.116 94.536 51.591 15.217 20.912 96.889 75.373 79.687 66.218 51.622 69.594 17.880 14.724 18.705 157.405 117.532 162.068 21.493 17.578 21.791 23.412 19.404 23.022 50.513 38.247 47.578 22.182 15.062 20.703 144.769 144.096 167.395 55.571 21.871 26.279 29.423 21.764 30.731 23.156 22.730 26.904 60.729 60.462 69.556 56.512 44.327 62.621 33.403 30.399 30.211 51.810 45.455 57.070 35.494 30.196 38.142 70.542 76.334 99.693

Mean square error for different methods Method Method Method Method Method [42] [34] [31] [15] [40] 53.575 84.335 51.991 71.911 28.633 46.351 67.242 21.482 15.710 20.098 52.972 83.481 53.664 82.832 32.140 17.759 76.505 16.654 17.912 59.249 11.807 15.797 5.278 4.700 9.160 13.450 23.584 12.130 13.216 12.948 48.414 56.180 42.229 37.541 57.929 50.715 67.056 16.282 17.810 73.238 51.047 95.325 42.734 41.193 32.513 36.467 60.666 37.417 50.866 21.430 32.574 51.879 15.589 12.419 16.250 39.906 63.509 38.084 48.826 29.324 9.711 16.598 9.801 9.096 12.501 9.379 12.319 3.882 3.994 29.424 10.872 22.882 10.569 9.174 11.586 71.201 116.381 84.796 164.431 40.532 58.950 87.148 28.431 25.536 26.853 73.190 112.644 86.273 169.961 44.726 11.402 16.767 11.637 14.350 7.3858 10.229 14.333 4.469 4.273 7.4135 12.072 18.697 11.851 15.441 7.678 12.979 19.744 11.792 12.364 8.3602 11.914 16.373 4.662 4.028 4.621 14.697 20.061 13.410 13.702 7.6636 29.832 53.881 28.679 30.664 23.227 27.045 37.719 12.098 9.992 22.540 29.249 51.113 26.730 28.675 20.820 10.474 16.946 10.886 14.578 7.1526 8.175 12.655 3.696 3.416 3.8649 10.482 15.544 11.187 15.597 7.3193 124.463 170.964 107.939 80.282 55.657 121.663 163.540 54.104 39.239 99.185 135.853 201.601 115.388 88.377 91.654 22.957 67.477 21.382 22.056 43.500 15.175 18.986 7.248 5.504 7.611 17.595 24.368 18.184 22.407 9.0273 16.150 28.300 16.509 23.093 9.5122 13.554 22.694 6.810 5.659 8.1522 16.710 28.035 16.637 22.595 13.275 16.170 34.736 14.021 11.542 10.266 17.443 22.142 7.223 6.332 17.254 19.205 29.978 17.388 14.080 14.419 46.906 75.698 40.300 30.760 30.852 47.830 64.225 21.046 16.428 51.057 54.552 92.709 47.407 38.246 38.894 22.710 37.201 28.222 63.810 13.460 20.630 29.580 9.226 8.867 14.978 23.859 45.855 27.333 64.511 22.137 15.635 21.988 16.010 12.969 8.8625 15.320 19.850 6.815 5.359 15.165 19.477 30.404 18.660 16.554 13.350 34.622 55.655 32.467 33.786 15.732 32.402 46.324 14.640 10.815 16.430 38.229 57.972 34.685 38.656 24.590 24.858 38.901 22.287 23.255 16.170 21.957 29.614 10.301 9.390 11.386 26.276 37.082 25.062 32.014 20.078 61.527 83.298 49.391 40.426 29.034 64.112 83.370 28.544 22.360 26.109 85.460 119.561 78.480 64.625 50.765

40

Prop. (1-L) 11.154 7.123 11.630 8.018 8.796 7.746 10.481 8.660 17.656 8.928 5.692 11.276 3.630 3.654 6.562 18.450 11.383 22.897 3.546 3.342 6.767 3.813 3.661 5.676 8.178 5.851 8.962 3.388 3.158 4.325 23.325 15.025 31.084 8.486 8.108 7.815 3.966 3.042 5.267 4.632 3.739 7.019 11.805 10.186 17.574 6.449 5.292 8.643 4.961 7.132 9.243 8.072 6.274 11.305 13.436 8.387 14.248 25.743 17.790 32.839

Prop. (2-L) 11.927 7.123 10.041 13.064 8.7961 7.720 14.841 8.660 23.713 8.225 5.692 11.162 6.243 3.654 9.374 19.950 11.383 23.105 4.310 3.342 9.443 4.850 3.661 6.643 10.203 5.851 8.908 4.787 3.158 4.895 17.504 15.025 28.241 13.524 8.108 8.063 3.531 3.042 5.369 4.498 3.739 7.475 12.948 10.186 19.808 6.683 5.292 9.777 5.202 7.132 11.040 7.561 6.274 11.792 15.816 8.387 16.975 25.875 17.790 32.197

Figure 22: Comparison of the methods for Image 4

(a) Crop from the original Image 4.

(b) Method in [20].

(c) Method in [34].

(d) Method in [42].

41

(e) Method in [31].

(f) Method in [40].

(g) Method in [15].

(h) Proposed (1-L, 8 iterations).

42

Figure 23: Comparison of the methods for Image 6

(a) Crop from the original Image 6.

(b) Method in [20].

(c) Method in [34].

(d) Method in [42].

43

(e) Method in [31].

(f) Method in [40].

(g) Method in [15].

(h) Proposed (1-L, 8 iterations).

44

CHAPTER III

SUPER-RESOLUTION FOR COMPRESSED VIDEO 3.1

Introduction

An important problem that arises frequently in visual communications and image processing is the need to enhance the resolution of a still image extracted from a video sequence or of the video sequence itself. As we have mentioned earlier, it is possible to achieve subpixel-level resolution by exploiting the spatial correlation among successive frames of a video sequence. Such a multi-frame reconstruction process is usually called super-resolution reconstruction. All of the methods reviewed in Section 1.1.2 are based on the assumption that there is no compression during the imaging process. The input signal (video/image sequence) is assumed to exist in a raw format instead of a compressed format. However, because of the limited resources that are often available, compression has become a standard component of almost every data communication application. Printing from MPEG video sources, by definition, involves compressed video, standardized digital TV (DTV) signals are MPEG2 compressed, and digital video cameras typically store images in a compressed format. Unfortunately, super-resolution algorithms designed for uncompressed data do not perform well when directly applied to decompressed image sequences, especially for highcompression rates. The reason is that the quantization error introduced during the compression/quantization process is often the dominant source of error when the compression rate is high and this error is not modeled. In contrast to the abundance of methods proposed to enhance raw video, there are only a few methods that have been proposed for compressed video. Chen and Schultz [17] propose to decompress the MPEG video and then use the uncompressed-video algorithm given in [57]. The drawback is that decompression discards important information about the quantization error that was introduced when the video was compressed. In [48], Patti

45

and Altunbasak demonstrated the importance of properly handling the quantization information, and suggested a solution that explicitly incorporates the compression process. This method extends the model given in [50] by adding the MPEG stages, and uses the quantization information as the basis for a projections onto convex sets (POCS) algorithm that operates in the compressed domain. However, in this approach, all sources of error except for the quantization error are ignored, which may not be a good assumption at medium-to-high bit rates. It is also difficult with the POCS approach to impose additional regularization priors on the reconstructed frame. In [58], Segall et al. proposed a Bayesian algorithm that is designed to penalize the artifacts (such as blocking artifacts) formed during the compression process. This is not a very effective approach because it is difficult to distinguish between the artifacts resulting from compression and the natural texture of the image. In this chapter we propose a Bayesian super-resolution reconstruction technique that uses the statistical information about the quantization noise and the additive sensor noise. The proposed method allows us to take full advantage of Bayesian methods without neglecting quantization. Although the method is designed for the DCT-based video standards such as MJPEG, MPEG, H.261, and DV, it can easily be generalized to any compression method where the transform involved is linear. The proposed algorithm is also different from the previous approaches in the sense that the quantization information is utilized directly in the DCT domain. As a result, it is possible to treat the DCT coefficients separately in a way that depends on their statistical distribution or reliability. The method also allows the use of source statistics and additional reconstruction constraints, such as those that might aid in blocking artifact reduction and edge enhancement. We will assume Gaussian models and derive the equations necessary for a maximum a posteriori probability (MAP) estimator. The proposed method is compared with two basic super-resolution approaches [57, 50]. The rest of the chapter is organized as follows. In Section 3.2, we present a general image acquisition and video compression model. Section 3.3 provides a Bayesian framework for the compressed-domain resolution enhancement problem and a possible approach for its solution. Experimental results are presented in Section 3.4. Finally, in Section 3.5, we

46

generalize our method to a transform-domain reconstruction idea and discuss our major findings.

3.2

Imaging Model

This section extends the video acquisition model given in Chapter 1 to accommodate blockDCT based compression. The result is a linear set of equations that relates the (unobserved) high-resolution source images to the observed data: the quantized DCT coefficients of lowresolution frames. We use this set of equations to establish the Bayesian framework in the next section. We start by adding the MPEG compression stages to the video acquisition model given in Figure 10. As shown in Figure 24, the low-resolution frame g(i) is motion compensated (i.e., the prediction frame is computed and subtracted from the original to get a residual image), and the residual is transformed using a series of 8 × 8 block-DCTs to produce the DCT coefficients d(i) . Defining g ˆ(i) as the prediction frame and T as the DCT matrix for the lexicographically ordered images, we write d(i) = TH(i) f − Tˆ g(i) + Tn(i) .

(24)

The prediction frame g ˆ(i) is obtained using neighboring frames except for the case of intra-coded frames, where the prediction frame is zero. The DCT coefficients d(i) are then ˜ (i) . The quantization operation is a quantized to produce the quantized DCT coefficients d nonlinear process that will be denoted by the operator Q{·}: n

o

˜ (i) = Q TH(i) f − Tˆ d g(i) + Tn(i) .

(25)

In MPEG compression, quantization is realized by dividing each DCT coefficient by a quantization step size followed by rounding to the nearest integer. The quantization step size is determined by the location of the DCT coefficient, the bit rate, and the macroblock ˜ (i) and the corresponding step sizes are availmode [68]. The quantized DCT coefficients d able at the decoder, i.e., they are either embedded in the compressed bit-stream or specified as part of the coding standard. Since the quantization takes place in the transform domain,

47

g (i ) − gˆ (i )

g(i ) Video Acquisition

f

d (i ) 8x8 Block DCT

Motion Compensation

Quantization

d (i )

Figure 24: MPEG compression is appended to the video acquisition. the natural way to exploit this information is to use it in the DCT domain without reverting back to the spatial domain. Equation (25) is the fundamental equation that represents the relation between the ˜ (i) . In the next section, we high-resolution image f and the quantized DCT coefficients d formulate a Bayesian super-resolution reconstruction framework based on this equation.

3.3

Bayesian Super-Resolution Reconstruction

With a Bayesian estimator, not only the source statistics but also various regularizing constraints can be incorporated into the solution. Bayesian estimators have been frequently used for super-resolution reconstruction. However, in these approaches either the video source is assumed to be available in uncompressed form, or it is simply decompressed prior to enhancement without considering the quantization process. Additive noise is considered as the only source of error. On the other hand, the POCS-based approaches treat the quantization error as the only source of error without considering the additive noise [49]. Clearly, neither of these approaches provides a complete framework for super-resolution. As will be shown, a Bayesian estimator that considers the quantization process can be applied successfully. In the maximum a posteriori probability (MAP) formulation, the quantized DCT coeffi˜ (i) , the original high-resolution frame f , and the additive noise n(i) are all assumed to cients d ´

³

˜ (1) , · · · , d ˜ (M ) as the conditional probability density be random processes. Denoting p f |d function (PDF), the MAP estimate ˆ f is given by n ³

˜ (1) · · · d ˜ (M ) ˆ f = arg max p f |d f

48

´o

.

(26)

Using the Bayes rule, equation (26) can be rewritten as n ³

´

o

ˆ ˜ (1) · · · d ˜ (M ) |f p (f ) , f = arg max p d

(27)

f

³

˜ (1) , · · · , d ˜ (M ) where we used the fact that p d

´

is independent of f . In order to find the ³

´

˜ (1) , · · · , d ˜ (M ) |f and the prior MAP estimate ˆ f , we need to model the conditional PDF p d PDF p (f ). Before proceeding, we rewrite equation (25) by letting e(i) denote the error introduced by quantization: ˜ (i) = TH(i) f − Tˆ d g(i) + Tn(i) + e(i) .

(28)

The quantization error e(i) is a deterministic quantity that is defined as the difference ˜ (i) and d(i) , but it can also be treated as a stochastic vector for reconstruction. between d There have been a number of studies directed toward modeling the statistical distribution of the quantization error e(i) . In this chapter, we model it as a zero-mean independent identically distributed (IID) Gaussian random process, which leads to a mathematically tractable solution. Using the notation N (µ, C) for a normal distribution with mean vector µ and covariance matrix C, we write e(i) ∼ N (0, Ce ) ,

(29)

where Ce is the covariance matrix of the quantization error e(i) . The additive noise n(i) is also modeled as a zero-mean IID Gaussian process: n(i) ∼ N (0, Cn ) ,

(30)

where Cn is the covariance matrix of the additive noise. Since the discrete cosine transform is unitary, the DCT of the noise n(i) is also an IID Gaussian random process with covariance matrix TCn TT :

³

´

Tn(i) ∼ N 0, TCn TT .

(31)

Because the additive noise and the quantization error have independent Gaussian distributions, the overall noise Tn(i) + e(i) is also a Gaussian distribution with a mean equal to

49

the sum of the means, and a covariance matrix equal to the sum of the covariance matrices of Tn(i) and e(i) : ³

´

Tn(i) + e(i) ∼ N 0, TCn TT + Ce .

(32)

Equation (32) gives us an elegant way of combining the statistical information of two different noise processes. This is in contrast to the previous approaches, where only one noise source is considered. We now pursue with the derivation by writing the explicit forms of the probability density functions. Denoting u(i) ≡ Tn(i) +e(i) as the total noise term, and K ≡ TCn TT + Ce as the overall covariance matrix, the probability distribution function of u(i) is

µ

p(u(i) ) =



´T ³ ´ 1 1³ exp − u(i) K−1 u(i) , Z 2

(33)

where Z is a normalization constant. Using equation (28) and the PDF of the noise u(i) , ³

´

˜ (i) |f is found to be the conditional PDF p d µ



³ ´ ³ ´T ³ ´ ˜ (i) |f = 1 exp − 1 d ˜ (i) − TH(i) f + Tˆ ˜ (i) − TH(i) f + Tˆ p d g(i) K−1 d g(i) . Z 2 ³

(34)

´

˜ (1) · · · d ˜ (M ) |f is the Since the noise is assumed to be an IID process, the joint PDF p d product of the individual PDFs. As a result, we obtain !

Ã

M ³ ´T ³ ´ ´ X ˜ (i) − TH(i) f + Tˆ ˜ (i) − TH(i) f + Tˆ ˜ (1) · · · d ˜ (M ) |f = 1 exp − 1 d g(i) K−1 d g(i) , p d Z 2 i=1 (35)

³

where Z is again a normalization constant. We now need to model the prior distribution p (f ) to complete the MAP formulation. Again, we will assume a joint Gaussian model: µ



1 1 p (f ) = exp − (f − µ)T Λ−1 (f − µ) , Z 2

(36)

with Λ being the covariance matrix, µ being the mean of f , and Z being a normalization constant. Substituting (35) and (36) into (26), we end up with the following MAP estimate: ˆ f = arg min f

 · ³ ´¸ ´T M ³ P  (i) (i) (i) (i) (i) (i) −1   d − TH f + Tˆ g d − TH f + Tˆ g K i=1

      

   + (f − µ)T Λ−1 (f − µ)

.

(37)

We finish this section by presenting an approach to solve (37). In the next section we detail the implementation and selection of the parameters and covariance matrices. We will 50

explain how to incorporate the quantization step size information in reconstruction through selection of covariance matrices. One approach to obtain the MAP estimate in (37) is to use an iterative steepest descent technique. Let E(f ) be the cost function to be minimized, then the high-resolution image f can be updated in the direction of the negative gradient of E(f ). At the nth iteration, the high-resolution image estimate is fn = fn−1 − α∇E(fn−1 ),

(38)

where α is the step size. From (37), we can choose a slightly generalized cost function as follows: E (f ) =

1−λ 2

+

λ 2

· ³ ´T ´¸ M ³ P d(i) − TH(i) f + Tˆ g(i) K−1 d(i) − TH(i) f + Tˆ g(i) i=1 T

(f − µ)

Λ−1 (f

(39)

− µ)

where λ is a number, (0 ≤ λ ≤ 1), that controls the relative contributions of the conditional and prior information in the reconstruction. When λ is set to zero, the estimator behaves like a maximum likelihood (ML) estimator. When λ is made larger, the prior information is more and more important to reconstruction. Taking the derivative of E(f ) with respect to f , the gradient of E(f ) can be calculated as ∇E(f ) = −(1 − λ)

M X

³

´

T ˜ (i) − TH(i) f + Tˆ H(i) TT K−1 d g(i) + λΛ−1 (f − µ) .

(40)

i=1

The step size α in (38) can be fixed or updated adaptively during the iterations. One way is to update it using the Hessian of E(f ). In that case, α is updated at each iteration using the formula α=

(∇E(fn−1 ))T (∇E(fn−1 )) (∇E(fn−1 ))T H (∇E(fn−1 ))

,

(41)

where H is the Hessian matrix found by H = (1 − λ)

M X

T

H(i) TT K−1 TH(i) + λΛ−1 .

(42)

i=1

In the reconstruction, everything but f is known or can be computed in advance. For ˜ (i) , the prediction frames a specific observation sequence, the quantized DCT coefficients d 51

g ˆ(i) , and the quantization step sizes are known; the blur mappings H(i) and the other statistical/reconstruction parameters are computed or determined beforehand.

3.4

Experimental Results

We have designed a set of experiments to examine the performance of the proposed algorithm for different quantization levels. We also tested the spatial-domain POCS [50] and spatialdomain MAP [57, 23] algorithms, and compared their results with the results of our DCTdomain MAP algorithm. Before getting into details of the experiments, we want to address some of the implementation issues. A. Implementation Although the matrix-vector notation provides a neat formulation, implementing the algorithm with images converted into vectors is problematic. When dealing with large images, the matrices become large enough to cause memory problems and slow reconstruction. Instead, the algorithm was implemented using simple image operations, such as warping, convolution, sampling, scaling, and block-DCT transformation. After determining the blur point spread function (PSF) and the motion vectors between the observations, the high-resolution image is reconstructed as follows. We start by interpolating one of the observed images to obtain an initial estimate f0 . According to (38), we need to calculate the gradient of the cost function and the step size α. Referring to equation (40), we first need to calculate H(i) f0 . This is done by motion warping f0 for the ith frame, convolving with the PSF, and then downsampling. The resulting image and the prediction image g ˆ(i) are then transformed to the DCT domain by 8 × 8 block-DCTs. After finding the T

residual, we need to apply the operations K−1 , TT , and H(i) . As we mentioned earlier, we assumed statistical independence between the DCT quantization errors, and this results in the covariance matrix K being diagonal. Therefore, in our implementation, the K−1 is simply computed by dividing each DCT coefficient of the residual with the corresponding variance in K. This is followed by the TT operation, which is done by taking the inverse block-DCT. Finally, H(i)

T

is implemented by upsampling the image (with zero padding),

convolving with the flipped PSF, and motion warping back to the reference frame. (If we

52

let h(n1 , n2 ) denote the PSF, the flipped PSF is then h(−n1 , −n2 ).) Similar to K−1 , Λ−1 is also implemented by scaling each pixel by a number because the pixels are also modeled as being statistically independent. In the computation of α, we need to calculate (∇E(fn−1 ))T (∇E(fn−1 )), which is done by taking the square of each element of ∇E(fn−1 ), and then summing them up. The denominator of equation (41) is obtained similarly. (Apply the operations in equation (42) on ∇E(fn−1 ), multiply the result element by element with ∇E(fn−1 ), and then sum them up.) With this procedure, the reconstruction is achieved faster than working with lexicographically ordered images. We now turn to the experiments. B. Experimental Setup In order to test the proposed algorithm, we designed a controlled experiment. The Aerial and Boat images shown in Figure 25 are downsampled by two horizontally and vertically to create four low-resolution observations. These observations are then block transformed using 8 × 8 DCTs. The DCT coefficients are then quantized using the MPEG-2 quantization table for the luminance channel. Spatial-domain POCS, spatial-domain MAP, and the proposed DCT-domain MAP algorithms are tested. In the reconstructions, all four observations are used. The experiments are repeated for different quantization scales, which is done by multiplying the quantization table by a positive real number. The scaling factors used in the experiments are 0.25, 0.5, 0.75, 1.0, 1.25, and 1.5. In addition to the simulated data, we also tested the algorithm with observations captured with a digital camera. We captured six images from a text document with slightly different viewing positions and six images (zooming license plate) from a moving car. The observations are then quantized with the quantization scaling factor set to 0.25. For the Text sequence, the dense motion fields between the observations are calculated using a two-level hierarchical block-based motion estimation algorithm. Block sizes of 12 pixels are used with mean absolute difference as the matching criterion. In the final level of search, quarter pixel motion vectors are sought. For the License Plate sequence, we used the Harris corner detector [33] to select a set of points in the reference image. We then find the

53

correspondence points in the other images using normalized cross correlation. From these correspondences, a least-mean-square estimate of the affine motion parameters is found. C. Parameter Selection The spatial-domain POCS algorithm [50] starts with an initial estimate of the highresolution image, which is obtained by bilinearly interpolating one of the observations. The mapping H(i) is applied to this initial estimate to compute a prediction of one of the observed images. The difference between the predicted image and the real observed image is backprojected to update the initial estimate. This is repeated for a predetermined number of iterations (typically 15) or until the change in the mean-square-error (MSE) is less than 0.1. The formulation of the spatial-domain MAP [57, 23] algorithm is similar to the DCTdomain MAP algorithm that we derived in this chapter. In the spatial-domain MAP algorithm, the observations are the low-resolution images, not the quantized DCT coefficients. The spatial-domain MAP algorithm requires the covariance matrices for the additive noise and the prior image. Our DCT-domain MAP algorithm requires the covariance matrix for the quantization error in addition to the additive noise and prior image covariance matrices. The parameters in the experiments are chosen first intuitively and then finalized by trialand-error. (Obviously, this is not an optimal approach; we will comment on this issue later in this chapter.) In our formulation, all of the covariance matrices are diagonal as a result of the IID assumption. The variance of the additive noise is set to two, i.e., Cn is chosen to be a diagonal matrix with twos along the diagonal. After some trial-and-error, λ is set to 0.15, and the diagonal entries of Λ are set to 25. The mean vector µ is set to the bilinearly interpolated reference frame. Again, the iterations are repeated for a predetermined number of times or until the change in the MSE is less than 0.1. For the DCT-domain MAP algorithm, the standard deviation of the quantization noise is proportional to the quantization step size. Although there is no exact analytical relationship, this is a valid assumption and directly affects the reconstruction. Heuristically, we set the standard deviation to one-fifth of the quantization step size in our experiments. The diagonal entries of Ce are computed according to the corresponding quantization step sizes,

54

which are available in the data bitstream. The covariance matrix K is then calculated using the formula K = TCn TT + Ce . (Since Cn and Ce are diagonal, and T is a unitary transform, K is also diagonal.) For the real video experiments, the same reconstruction parameters except for the λ are used. For the Text sequence, λ is set to 0.1; for the License Plate sequence, λ is set to 0.7. An additional problem with the real video experiment is that the PSF of the camera is unknown. Therefore, we chose a typical PSF, and tested the reconstruction using that PSF. In the experiments, the PSF was set to a 7 × 7 Gaussian blur with standard deviation of two. D. Results The spatial-domain MAP, spatial-domain POCS, and DCT-domain MAP algorithms are tested for quantization scaling factors of 0.25, 0.5, 0.75, 1.0, 1.25, and 1.5. The same observations are used for all algorithms. The MSE comparison for Aerial and Boat images are given in Table 4 and Table 5, respectively. As seen in that table, in all the experiments, the DCT-domain MAP algorithm performed better than the spatial-domain MAP and POCS algorithms, which do not utilize the quantization information. As the quantization step size increases, the relative performance of the DCT-domain MAP algorithm improves. Although the quantitative comparison shows that the DCT-domain MAP algorithm performs better than the other algorithms, the difference is not visually obvious. In figures 26 and 27, we provide visual results of the DCT-domain MAP algorithm for downsampling factors of two and four to demonstrate the resolution enhancement. For the experiments with downsampling factor of two, four observations are used. For the experiments with downsampling factor of four, sixteen observations are used. (The quantization scaling factor in these experiments is 0.25.) The real video experiments were performed to assess the robustness of the algorithm when the true motion vectors and the PSF are not known. The parameters used in the experiments are given in the previous two subsections. The observations from the License Plate sequence and the reconstructed image are given in Figure 28. The resolution enhancement factor is four in horizontal and vertical directions. For the Text sequence, the

55

resolution enhancement factor is two. The observations and the reconstructed image for the Text is given in Figure 29. Although we do not have the true motion vectors and the exact PSF, we still observe improvement in the readability in these experiments.

3.5

Conclusions

In this chapter we introduced a super-resolution algorithm that incorporates both the quantization operation and additive sensor noise in a stochastic framework. Since the resolution enhancement problem is cast in a Bayesian framework, additional constraints can easily be incorporated in the form of prior image models. Although a block-based hybrid transform coder was emphasized throughout our derivations, the framework is still valid for all video coding standards where the transform utilized is linear. The proposed algorithm enables distinct treatment of each DCT coefficient. This can be exploited for better performance. For instance, the information coming from highfrequency DCT coefficients can be discarded altogether since they are quantized severely, and the information obtained from those coefficients are likely to be noise. One step beyond this idea is to realize the whole reconstruction in transform domain. That is, the highresolution image is reconstructed in a transform domain, and then converted back to spatial domain at the end. This way, we can suppress noise and achieve reconstruction at a lower computational complexity. (This will be demonstrated in the next chapter.) The experimental results were encouraging. However, there are still several open issues, as summarized below: • Both the POCS-based and Bayesian-based solutions are computationally expensive. For software implementations, fast, but perhaps suboptimal, solutions need to be investigated. For hardware implementations, algorithm parallelization issues also need to be examined. • All super-resolution methods, including ours, require accurate motion estimates. However, for a typical video sequence, we will almost surely have inaccurate motion estimates for some frames or regions because of the ill-posed nature of motion estimation.

56

(a)

(b)

Figure 25: (a) Original Aerial image. (b) Original Boat image. We need to deal with these model failure regions for successful application of the proposed super-resolution algorithms to general video sequences. • Here, we demonstrated the algorithm for a limited set of parameters, which are probably not optimal. One approach to solve this problem is to employ an iterative scheme, where a single parameter or a set of parameters is optimized with others held constant, and this is repeated for the rest of the parameters iteratively. • The stochastic entities in the problem were assumed to be Gaussian random processes. Different statistical models need to be investigated. Even if they do not have analytical solutions, improved reconstruction may be achievable.

57

(a)

(b)

(c)

(d)

Figure 26: (a) Bilinearly interpolated Aerial image. (Downsampling factor is two.) (b) Reconstructed Aerial image. (Four observations are used.) (c) Bilinearly interpolated Aerial image. (Downsampling factor is four.) (d) Reconstructed Aerial image. (Sixteen observations are used.)

58

(a)

(b)

(c)

(d)

Figure 27: (a) Bilinearly interpolated Boat image. (Downsampling factor is two.) (b) Reconstructed Boat image. (Four observations are used.) (c) Bilinearly interpolated Boat image. (Downsampling factor is four.) (d) Reconstructed Boat image. (Sixteen observations are used.)

59

Table 4: Mean square error (MSE) comparison of different methods for Aerial image.

Method Bilinear intepolation Spatial-domain POCS Spatial-domain MAP DCT-domain MAP

MSE of the reconstructed image for different quantization factors 0.25 0.50 0.75 1.00 1.25 1.50 109.1 130.4 153.5 176.6 197.9 215.5 21.1 63.4 105.0 139.4 168.1 193.2 21.2 60.2 102.2 141.9 172.6 196.3 19.8 55.3 90.3 119.4 147.5 169.6

Table 5: Mean square error (MSE) comparison of different methods for Boat image.

Method Bilinear intepolation Spatial-domain POCS Spatial-domain MAP DCT-domain MAP

MSE of the reconstructed image for different quantization factors 0.25 0.50 0.75 1.00 1.25 1.50 143.5 154.9 167.8 179.9 192.0 206.2 17.1 42.5 67.1 89.1 109.6 129.6 17.1 43.4 68.3 89.7 108.4 126.3 16.4 38.6 60.8 81.4 101.7 119.1

60

(a)

(b)

(c)

(d)

(e)

(f)

(g) Figure 28: Results for License Plate sequence; quantization factor is 0.25; enhancement factor is four. (a) to (f) are the images used in reconstruction. (g) The reconstructed image using the DCT-domain MAP algorithm.

61

(a)

(b)

(c)

(d)

(e)

(f)

(g) Figure 29: Results for Text sequence; quantization factor is 0.25; enhancement factor is two. (a) to (f) are the images used in reconstruction. (g) The reconstructed image using the DCT-domain MAP algorithm.

62

CHAPTER IV

SUPER-RESOLUTION FOR FACE RECOGNITION 4.1

Introduction

The performance of existing face recognition systems decreases significantly if the resolution of the face image falls below a certain level. This is especially critical in surveillance imagery where often only a low-resolution video sequence of the face is available. If these low-resolution images are passed to a face recognition system, the performance is usually unacceptable. Therefore, super-resolution techniques have been proposed for face recognition that attempt to obtain a high-resolution face image by combining the information from multiple low-resolution images [9, 5, 43, 14]. In general, super-resolution algorithms try to regularize the ill-posedness of the problem using prior knowledge about the solution, such as smoothness or positivity. (See Section 1.1.2 for a comprehensive literature review on super-resolution.) Recently, researchers have proposed algorithms that attempt to use model-based constraints in regularization. While [9] demonstrates how super-resolution (without model-based priors) can improve the face recognition rate, [5], [43], and [14] provide super-resolution algorithms that use face-specific constraints for regularization. All these systems propose super-resolution as a separate preprocessing block in front of a face recognition system. In other words, their main goal is to construct a high-resolution, visually improved face image that can later be passed to a face recognition system for improved performance. This is perfectly valid as long as computational complexity is not an issue. However, in a real-time surveillance scenario where the super-resolution algorithm is expected to work on continuous video streams, computational complexity is usually a very critical issue. In this chapter, we propose an efficient super-resolution method for face recognition that transfers the super-resolution problem from the pixel domain to a low dimensional face space. This is based on the observation that nearly all state-of-the-art face recognition systems use some kind of front-end dimensionality reduction, and that a lot

63

of redundant information generated by the preprocessing super-resolution algorithm is not used by the face recognition block. Hence, we perform the super-resolution reconstruction in the low-dimensional framework so that only the necessary information is reconstructed. In addition, we show that face-space super-resolution is more robust to registration errors and noise than the pixel-domain super-resolution because of the addition of model-based constraints. There are two important sources of noise in this problem. One is the observation noise that results from the imaging system. The other is the representation error, which is a result of the dimensionality reduction. We derive the statistics of these noise processes for the lowdimensional face space by using examples from the human face image class. Substitution of this model-based information into the algorithm provides a higher robustness to noise. We test our system on both real and synthetic video sequences. Currently, by far the most popular dimensionality reduction technique in face recognition is to use subspace projections based on the Karhunen-Loeve Transform (KLT). This type of dimensionality reduction has been central to the development of face recognition algorithms for the last ten years. We propose to use a similar KLT-based dimensionality reduction technique to decrease the computational cost of the super-resolution algorithm by transforming it from a problem in the pixel domain to a problem in the lower-dimensional subspace, which we call the face subspace. In Section 4.2, we briefly review the KLT-based dimensionality reduction method for face recognition. Then, in Section 4.3, we formulate the super-resolution problem in the lowdimensional framework and provide the details of the reconstruction algorithm. Although the derivations are similar to those in the previous section, we will not skip them for the purposes of self-completeness. Section 4.4 analyzes the computational advantage of the proposed approach, and Section 4.5 provides experimental results addressing several issues, such as sensitivity to noise and motion estimation errors. Conclusions are given in Section 4.6.

64

4.2

Dimensionality Reduction for Face Recognition

KLT-based dimensionality reduction for face images was first proposed by Sirovich and Kirby [62]. They showed that face images could be represented efficiently by projecting them onto a low-dimensional linear subspace that is computed using the KLT. Later, Turk and Pentland demonstrated that this subspace representation could be used to implement a very efficient and successful face recognition system [76]. Since then, eigenface-based dimensionality reduction has been used widely in face recognition. Mathematically, the eigenface method tries to represent a face image as a linear combination of orthonormal vectors, called eigenfaces. These eigenfaces are obtained by finding the eigenvectors of the covariance matrix of the training face image set. Let I1 , I2 , . . . , IK be a set of K face images, each ordered lexicographically. The eigenvectors of the matrix C=

K X

Ii ITi

(43)

i=1

that correspond to the largest L eigenvalues span a linear subspace that can reconstruct the face images with minimum reconstruction error in the least squares sense. This Ldimensional subspace is called the face space. Assuming x is a lexicographically ordered face image and Φ is the matrix that contains the eigenfaces as its columns, we write x = Φa + ex

(44)

where a is the feature vector that represents the face, and ex is the subspace representation error for the face image. As a larger training data set is used and the dimensionality of the face space is increased, the representation error ex gets smaller. Letting ∆

·

a= be the feature vector, and ∆

¸T

(45)

a1 a2 · · · aL

¸

·

Φ=

φ1 φ2 · · · φL

(46)

be the matrix where φ1 , . . . , φL are the eigenface vectors, ai is computed as follows: ai = φTi x,

for i = 1, . . . , L.

65

(47)

4.3

Face-Space Super-Resolution Reconstruction

In this section, we formulate the super-resolution problem in the low-dimensional face subspace. In such a formulation, the observations are inaccurate feature vectors of a subject, and the reconstruction algorithm estimates the true feature vector. 4.3.1

Imaging Model

We start with the observation model for pixel-domain super-resolution, and then derive the observation model for face-space super-resolution using the eigenface representation. In pixel-domain super-resolution, the observations are low-resolution images that are related to a high-resolution image by a linear mapping. By ordering images lexicographically, such a relation can be written in matrix-vector notation as follows: y(i) = H(i) x + n(i) ,

for i = 1, . . . , M

(48)

where x is the unknown high-resolution image, y(i) is the ith low-resolution image observation, H(i) is a linear operator that incorporates the motion, blurring, and downsampling processes, n(i) is the noise vector, and M is the number of observations. Assuming that s is the downsampling factor (0 < s < 1), and that the high-resolution image is of dimension N × N ; y(i) , H(i) , x, and n(i) have dimensions s2 N 2 × 1, s2 N 2 × N 2 , N 2 × 1, and s2 N 2 × 1, respectively. The matrix H(i) can be written as H(i) = D(i) B(i) W(i) ,

(49)

where D(i) , B(i) , and W(i) are the downsampling, blurring, and motion warping matrices, respectively. An overview of such modeling is given in the previous chapter, and more details can be found in [57, 23, 50]; therefore, we will not elaborate on it in this chapter. (Here, we also want to note that it is also possible to include an upsampling matrix in H(i) that will make the sizes of y(i) and x equal.) The images x and y(i) have components that lie in and are orthogonal to the face space. Only the components that lie in the face space are necessary in recognition. We will now derive the observation model for the reconstruction of the components that lie in the face

66

space. The formulation and reconstruction algorithm will not neglect the spatial-domain observation noise and the subspace representation error, which is initially orthogonal to the face space but which has an effect during the imaging process. We start by writing the face-space representation: x = Φa + ex , y(i) = Ψˆ a(i) + e(i) y ,

for i = 1, . . . , M

(50) (51)

where Φ and Ψ are N 2 × L and s2 N 2 × L matrices that contain the eigenfaces in their columns, ˆ a(i) is the L × 1 dimensional feature vector that is associated with the ith obser(i)

vation, and ex and ey are the N 2 × 1 and s2 N 2 × 1 representation error vectors. Note that we have two different eigenvector bases, Φ and Ψ, corresponding to high and low resolution face images, respectively. (If we had included an upsampling matrix in H(i) , then we could use the same basis matrix.) We substitute equations (50) and (51) into (48) to obtain (i) (i) (i) Ψˆ a(i) + e(i) y = H Φa + H ex + n .

(52)

Now, we will project equation (52) into the lower-dimensional face space using the fact that (i)

the representations errors ey are orthogonal to the face space Ψ. Since ΨT e(i) y = 0,

for i = 1, . . . , M,

(53)

and ΨT Ψ = I,

(54)

and by multiplying both sides of equation (52) by ΨT on the left, we obtain ˆ a(i) = ΨT H(i) Φa + ΨT H(i) ex + ΨT n(i) .

(55)

This is the observation equation that is analogous to (48). It gives the relation between the unknown “true” feature vector a and the observed “inaccurate” feature vectors ˆ a(i) . In the traditional way of applying super-resolution, the unknown high-resolution image x in (48) is reconstructed from the low-resolution observations y(i) . Then, the reconstructed x is fed into a face recognition system. (See Figure 30.) For eigenface-based face recognition 67

yˆ (1) yˆ (2) Super-resolution reconstruction

y

Feature vector extraction

a

yˆ ( M ) Figure 30: Super-resolution applied as a preprocessing block to face recognition. systems, a better way is to directly reconstruct the low-dimensional feature vector. Using the relation provided in (55), accurate feature vectors of a face image can be obtained from the inaccurate feature vector observations. This is illustrated in Figure 31. The face observations y(i) are first projected into the face space, and the computationally intensive super-resolution reconstruction is performed in the low-dimensional face subspace instead of in the spatial domain. A quantitative comparison of the computational complexity of these two approaches is provided in the next section. While we are reconstructing the feature vectors in the low-dimensional subspace, we can (and will) substitute face specific information in the form of statistics of the prior distributions of the feature vectors and distributions of the noise processes. Using model-based information in regularizing the super-resolution algorithm has been shown to be successful in previous work [5, 43, 14]. This helps to obtain more robust results when compared to traditional super-resolution algorithms. Our experiments in this chapter also confirm the advantages of using such model-based information. Our main difference, however, with respect to previous model-based algorithms is that we specifically transform all of the prior information to the low dimensional face space so that the computational complexity is kept low with little or no sacrifice in performance. This is in contrast to previous approaches that use complicated pixel-domain model-based statistical information. 4.3.2

Reconstruction Algorithm

In this section we present a reconstruction algorithm to solve (55) based on Bayesian estimation. The algorithm handles the observation noise and subspace representation error in the

68

yˆ (1)

Feature vector extraction

aˆ (1)

yˆ (2)

Feature vector extraction

aˆ (2)

yˆ ( M )

Eigenface-domain super-resolution

Feature vector extraction

a

aˆ ( M )

Figure 31: Super-resolution embedded into eigenface-based face recognition. low-dimensional face subspace. The maximum a posteriori probability (MAP) estimator ˜ a is the argument that maximizes the product of the conditional probability p(ˆ a(1) , · · · , ˆ a(M ) |a) and the prior probability p(a): n

o

˜ a = arg max p(ˆ a(1) , · · · , ˆ a(M ) |a)p(a) . a

(56)

We now need to model the statistics p(ˆ a(1) , · · · , ˆ a(M ) |a) and p(a). The prior probability p(a) can simply be assumed to be jointly Gaussian: µ



1 1 p(a) = exp − (a − µa )T Λ−1 (a − µa ) , Z 2

(57)

where Λ is the L × L covariance matrix, µa is the L × 1 mean of a, and Z is a normalization constant. In order to find p(ˆ a(1) , · · · , ˆ a(M ) |a), we first model the noise process in the spatial domain, and then derive its statistics in face space. We define a total noise term v(i) that consists of the noises resulting from the subspace representation error ex and the observation noise n(i) in the spatial domain: v(i) ≡ H(i) ex + n(i) .

(58)

Using this definition, we rewrite equation (55) for convenience: ˆ a(i) = ΨT H(i) Φa + ΨT v(i) .

(59)

The reason we defined H(i) ex + n(i) as the total noise term instead of TH(i) ex + Tn(i) is the modeling convenience in the spatial domain. It has been demonstrated that modeling the 69

noise (resulting from the imaging system and the estimation of H(i) ) in the spatial domain as an independent identically distributed (IID) Gaussian processes is a good assumption [57, 23]. We assume that the covariance matrix of this Gaussian process is diagonal so that the statistical parameters can be estimated easily even with the limited training data. Using these assumptions, it is easy to find the distribution of ΨT v(i) in the face space, as will be shown shortly. (i)

Defining K as the s2 N 2 × s2 N 2 positive definite diagonal covariance matrix and µv as the s2 N 2 × 1 mean of v(i) , we write the probability distribution of v(i) as ¶

µ

´T ³ ´ 1 1³ −1 (i) (i) K v − µ , p(v ) = exp − v(i) − µ(i) v v Z 2 (i)

(60)

where Z is a normalization constant. Now, we need to derive the distribution of the projected noise, p(ΨT v(i) ), in order to get the conditional PDF p(ˆ a(1) , · · · , ˆ a(M ) |a). From the analysis of functions of multivariate random variables [65], it follows that p(ΨT v(i) ) is also jointly Gaussian since ΨT Ψ is nonsingular (by construction). As a result, we have µ



´T ³ ´ 1 1³ −1 T (i) T (i) p(Ψ v ) = exp − ΨT v(i) − ΨT µ(i) , Q Ψ v − Ψ µ v v Z 2 T

(i)

(61)

(i)

where ΨT µv is the new mean and Q is the new covariance matrix computed by Q = ΨT KΨ.

(62)

The covariance matrix Q has dimension L × L while K is of dimension s2 N 2 × s2 N 2 . Using (59) and (61), we find the conditional PDF p(ˆ a(i) |a): µ



´T ³ ´ 1 1 ³ (i) −1 (i) T (i) (i) p(ˆ a |a) = exp − ˆ a − ΨT H(i) Φa − µ(i) Q ˆ a − Ψ H Φa − µ . (63) v v Z 2 (i)

Since we assumed that v(i) is IID, it follows that the probability density function ∆

(i)

p(ˆ a(1) , · · · , ˆ a(M ) |a) is the product of p(ˆ a(i) |a) for i = 1, · · · , M . Defining η = ΨT µv as the mean of the process ΨT v(i) , we write Ã

(1)

p(ˆ a

(M )

,···,ˆ a

!

M ³ ´T ³ ´ 1X 1 ˆ a(i) − ΨT H(i) Φa − η Q−1 ˆ a(i) − ΨT H(i) Φa − η . |a) = exp − Z 2 i=1 (64)

70

Substituting the conditional and prior PDFs given in (57) and (64) into (56), we obtain the MAP estimator ˜ a as follows:

˜ a = arg min a

 · ¸ ´T M ³ P  (i) T (i) −1 (i) T (i)   ˆ a − Ψ H Φa − η Q ˆ a − Ψ H Φa − η i=1

   + (a − µa )T Λ−1 (a − µa )

      

.

(65)

So far, we have shown how to incorporate the statistics of spatial-domain noise and prior information into the low-dimensional face-space reconstruction. In the next section, we estimate the parameters for these assumed models and provide experiments analyzing the recognition performance, effects of feature vector length, sensitivity to noise and motion estimation errors, etc. Before getting to the experimental results, we provide an algorithm to solve equation (65). One approach to obtain the MAP estimate ˜ a is an iterative steepest descent method. Defining E(a) as the cost function to be minimized, the feature vector a can be updated in the direction of the negative gradient of E(a). That is, at the nth iteration, the feature vector can be updated as follows: an = an−1 − α∇E(an−1 ),

(66)

where α is the step size. From (65), a slightly generalized cost function is chosen as E(a) =

1−λ 2

´T ³ ´ M ³ P ˆ a(i) − ΨT H(i) Φa − η Q−1 ˆ a(i) − ΨT H(i) Φa − η i=1

+

λ 2

(a − µa )

T

Λ−1 (a

(67)

− µa ) ,

where λ is a number, (0 ≤ λ ≤ 1), that controls the relative contribution of the prior information in the reconstruction. When λ is set to zero, the estimator becomes a maximum likelihood (ML) estimator. When λ is one, only the prior information is used, and the noise statistics are discarded. λ = 1/2 corresponds to the original MAP estimator. Taking the derivative of E(a) with respect to a, the gradient of E(a) can be calculated as ∇E(a) = −(1 − λ)

M X

T

³

´

ΦT H(i) ΨQ−1 ˆ a(i) − ΨT H(i) Φa − η + λΛ−1 (a − µa ) .

i=1

71

(68)

Although the step size α in (66) can be chosen as fixed, a better way is to update it using the Hessian of E(a). In this case, α is updated at each iteration using the formula α=

(∇E(an−1 ))T (∇E(an−1 )) (∇E(an−1 ))T H (∇E(an−1 ))

,

(69)

where H is the Hessian matrix found by H = (1 − λ)

M X

T

ΦT H(i) ΨQ−1 ΨT H(i) Φ + λΛ−1 .

(70)

i=1

In the reconstruction, everything but a, a(i) , and H(i) is known and can be computed in advance. (The details are left to the next section.) For a specific observation sequence y(i) , the feature vectors a(i) and the blur mappings H(i) are computed, and the true feature vector a is reconstructed. The pseude-code of the complete algorithm is as follows: 1. Choose a reference frame from the video sequence, bilinearly interpolate it, and project it onto the face space to obtain an initial estimate a0 for the true feature vector. 2. Obtain the feature vector a(i) by projecting each low-resolution frame onto the face space. (That is, ˆ a(i) = ΨT y(i) .) 3. Estimate the motion between the reference and other frames, and compute H(i) . 4. Determine the maximum number of iterations, M axIter. 5. For n = 1 to M axIter, (a) Compute ∇E(a) using (68). (b) Compute H using (70). (c) Compute α using (69). (d) Compute an using (66). ˜ to aM axIter . 6. Set the MAP estimate a

72

4.4

Computational Complexity

We now take a look at the computational complexity of the proposed algorithm compared to a pixel-domain reconstruction. Let P be the total number of pixels in a (high-resolution) face image, Q be the length of the feature vectors, and s be the downsampling factor (0 < s < 1). Excluding the motion estimation stage of the reconstruction, most of the computational cost results from the computation of ΨT H(i) Φa. According to our image acquisition model, H(i) can be represented as the successive application of motion warping, PSF blurring, and downsampling. Since the blurring and downsampling operations are time-invariant, only the motion warping operation needs to be computed for each observation separately. Denoting W(i) , B, and D as the motion warping, blurring, and downsampling matrices, respectively, we need three matrix-vector multiplications to compute ΨT H(i) Φa, where H(i) = DBW(i) . The first one is Φa, which requires approximately 2P Q multiplications and additions. This is then multiplied by W(i) , which requires 2P 2 multiplications and additions. This is followed by a multiplication with the Q × P matrix ΨT BD, which can be precomputed and stored. The total number of multiplications and additions is approximately 2P 2 + 4P Q. On the other hand, doing these operations in the pixel domain (using the matrix H(i) ) requires 2P 2 + 4sP 2 operations. Referring to the gradient and Hessian matrix computations (equations 68 and 70), the eigenface-space reconstruction ¡

¢

requires roughly 3M 4sP 2 − 4P Q fewer operations per iteration than the spatial-domain reconstruction does. (In our experiments, P is 1600, s is 0.25, Q is 40, and M is 16.)

4.5

Experimental Results

We performed a set of experiments to demonstrate the efficacy of the proposed method. We investigated the effect of the face-space dimension, and sensitivity to noise and motion estimation errors. We have also performed a recognition experiment with real video sequences. We will explain each step of the experiments in detail. A. Obtaining the Face Subspace In these experiments, we used face images from the Yale face databases A and B [26], Harvard Robotics Laboratory database [30], AR database [46], and CMU database [61]. The

73

images are downsampled to have a size of 40 × 40, and aligned according to the manually located eye and mouth locations. We selected 134 images as training data and 50 images as test data. We applied the KLT to those 134 images and chose the first 60 eigenvectors having the largest eigenvalues to form the face subspace. (These 60 eigenvectors form the columns of the matrix Φ.) We also downsampled the training images by four to obtain 10 × 10 images, applied the KLT to those images, and chose the first 60 of them to construct the eigenface space Ψ. B. Obtaining Low-Resolution Observations for Synthetic Video The test images were jittered by a random amount to simulate motion, blurred, and downsampled by a factor of four to produce multiple low-resolution images for each subject. The motion vectors were saved for use in synthetic video experiments. For blurring, the images were convolved with a point spread function (PSF), which was set to a 5 × 5 normalized Gaussian kernel with zero mean and a standard deviation of one pixel. C. Estimating the Statistics of Noise and Feature Vectors From the training image set I1 , · · · , IK , (K = 134), we estimate the statistics of a and v(i) . The unbiased estimates for the mean and covariance matrix of a are simply obtained from the sample mean and variances: K ³ ´ 1 X µa ' ΦT Ij , K j=1

(71)

K ³ ´³ ´T 1 X Λ' ΦT Ij − µa ΦT Ij − µa . K j=1

(72)

and

Because of the limited number of training images, for more reliable estimation, we assume a diagonal covariance matrix, so the off-diagonal elements of the matrix Λ are set to zero. (i)

The mean and covariance matrices of v(i) are found similarly. Letting yj

be the ith

observation of the jth training image, (i = 1, · · · , M and j = 1, · · · , K), we estimate the mean and covariance matrices as follows: K X M ³ ´ 1 X (i) yj − H(i) ΦΦT Ij , µv ' KM j=1 i=1

74

(73)

and K'

K X M ³ ´³ ´T 1 X (i) (i) yj − H(i) ΦΦT Ij − µv yj − H(i) ΦΦT Ij − µv . KM j=1 i=1

(74)

Again, the off-diagonals of K are set to zero. The mean η and covariance matrix Q for ΨT v(i) are found using η = ΨT µv and Q = ΨT KΨ. D. Reconstruction for Synthetic Video One of the frames for each video sequence is chosen as the reference frame, bilinearly interpolated by four, and projected onto the face space Φ to obtain the initial estimate for the true feature vector. It is then updated using the algorithm proposed in the previous section. The mapping H(i) is computed from the known motion vectors and PSF, and 16 low-resolution images are used in the reconstruction. The model parameters µa , Λ, η, and Q computed in Step C are used in the reconstruction with λ set to 0.5. The number of iterations M axIter is set to seven for each sequence. We also wanted to compare the results of this eigenface-domain super-resolution algorithm with a traditional pixel-domain super-resolution. We applied the pixel-domain super-resolution algorithm given in [50] to the low-resolution video sequences again using the same 16 low-resolution images and setting the number iterations to seven. After the high-resolution images are reconstructed, they are projected onto the face space Φ to obtain the feature vectors. The feature vectors obtained from these algorithms are compared with the true feature vectors (which are computed using the 40 × 40 original high-resolution images). For each subject (image sequence), we computed the normalized distance between the true feature vector a and the estimated feature vector ˜ a. The normalized distance D (˜ a, a) is defined as D (˜ a, a) =

1 k˜ a − ak × × 100, kak Length (a)

(75)

where Length (a) is the length of vector a. Figure 32 shows the results for three cases: (i) Feature vectors computed from a single observation. (No super-resolution applied.) (ii) Feature vectors computed after pixel-domain super-resolution applied. (iii) Feature vectors reconstructed using the proposed eigenfacedomain super-resolution. As seen in the figure, eigenface-domain super-resolution achieves 75

a similar performance to the pixel-domain super-resolution at less computation. We also provide an example from the face database. Figures 33 and 34 show the results for Subject 1 and Subject 2 in the test data. In these figures, (a) is the original 40 × 40 image, (b) is one of the observations interpolated using nearest neighbor interpolation, (c) is the bilinearly interpolated observation, which is the initial estimate in reconstruction, (d) is the result of the pixel-domain super-resolution, (e) is the projection of the result in (d) into the face space, and (f) is the representation of the reconstructed feature-vector from the eigenface-domain super-resolution algorithm. As seen, (e) and (f) are almost identical, but (f) is obtained at a lower computational burden. This experiment was done for a face-space dimension of 60, which brings up the question of how the feature vector length (i.e., dimension of the face space) affects the performance. This question is addressed in the next experiment. We will also demonstrate that eigenfacedomain super-resolution is more robust to noise and motion estimation errors than the pixel-domain super-resolution. E. Effect of Feature Vector Size on Reconstruction We repeated the experiments for various feature vector sizes to examine the effect of the face-space dimension on reconstruction. The results are given in Figure 35. In that figure, the x-axis is the dimension of the face space, and the y-axis is the normalized distance averaged over 50 subjects. Due to the face-space representation error, pixel-domain superresolution performs better than the eigenface-domain super-resolution at very low face-space dimensions. As expected, as the feature vector size is increased, the performance of the eigenface-domain super-resolution approaches that of the pixel-domain super-resolution. Note that this is the result for the case where there is no observation noise or motion estimation error. As will be shown shortly, when there is noise or motion estimation error, eigenface-domain super-resolution becomes better than the pixel-domain super-resolution even at the low face-space dimensions. This is because the solution obtained in the eigenface domain is constrained by face-specific priors. F. Effect of Noise on Reconstruction In order to examine the effects of observation noise, we added zero-mean Gaussian IID

76

noise to each low-resolution video frame. The experiment is done for a feature vector size of 40, and repeated for each of the 50 video sequences. Figure 36 shows the results for different noise powers. (The x-axis is the variance of the noise, and the y-axis is the average normalized distance.) As seen in that figure, when the noise power is zero, the pixel-domain super-resolution is better than the eigenface-domain super-resolution. However, as the noise power increases, eigenface-domain super-resolution outperforms pixel-domain superresolution. The reason is that eigenface-domain super-resolution constrains the solution to lie in the face space, and therefore, it is more robust to noise. G. Effect of Motion Estimation Error on Reconstruction In addition to the robustness to observation noise, eigenface-domain super-resolution is also more robust to motion estimation errors than pixel-domain super-resolution. This time, we perturbed each true motion vector with a zero-mean Gaussian IID random vector to simulate the motion estimation error. The face dimension for the experiment is again 40. As seen in Figure 37, as the motion estimation error increases, the pixel-domain superresolution becomes worse than eigenface-domain super-resolution immediately. It is also observed that the pixel-domain super-resolution becomes even worse than using only one image to get the feature vector. Again eigenface-domain super-resolution is less sensitive to motion estimation errors because of the face-space regularization. H. Recognition Experiment with Real Video Sequences Finally, we tested the proposed algorithm with real video sequences from the CMU database. We performed a recognition experiment with a database of 68 people. For each person, we selected the neutral face image from the facial expression part of the database as the training image. We manually located the positions of the eyes and the mouth in those images, cropped them according to those locations, downsampled them to a size of 40x40, and projected them into the eigenspace to get the training feature vector for each person. To perform recognition, we used the talking video sequences provided in the CMU database. Each sequence contains a single person talking for two seconds. We had a total of 68 such sequences, one for each person in our database. The goal of our recognition experiment is to identify the person who appears in the video. We used 16 consecutive

77

images from each video sequence. The original sequences are very high resolution, so we downsampled them so that the face is around 40 pixels wide. The resulting sequences form our high-resolution face image sequences, and we use them as the ground truth to evaluate the success of our experiments. We then blurred these face image sequences (using the PSF given in Step B) and downsampled them (by four) to form low-resolution observations. These low-resolution image sequences are the input images for the recognition experiment. We manually located the positions of the eyes and the mouth in the first frame of these image sequences. Then, we ran three different recognition experiments. In the first experiment, we used the first image from each low-resolution image sequence for recognition. We cropped the faces from the frames according to the locations of the eyes and the mouth, projected them into the eigenface space, and performed minimum distance classification with the L2 norm. The recognition rate in this case was 44%. In the second experiment, we again cropped the faces from the first frames of the low resolution image sequences according to the locations of the eyes and the mouth. Then, we used block based motion estimation to get the motion vectors from one image frame to the other. In motion estimation, we computed the motion vectors for each pixel with quarter-pixel accuracy. We set the block size and the search range to 8 and ±8, respectively, and we found motion vectors for each pixel by performing a full search with mean absolute difference being the matching criteria. Then, we projected all low-resolution face images into the eigenface space, and performed eigenface space super-resolution to construct an accurate feature vector for each person. The recognition experiment in this case provided a recognition rate of 74%. In the third experiment, we used the first frame of each high-resolution video sequence to perform recognition. The recognition rate with these high-resolution images was 79%. The results we reported above show that the decrease in the resolution of the face image decreases the recognition rate significantly. (In our experiments, the decrease was from 79% to 44%.) With the super-resolution reconstruction, the recognition rate improved significantly, and got close to the recognition rate where high-resolution images were used.

78

4.6

Conclusions

The performance of face recognition systems decreases significantly if the resolution of the face image falls below a certain level. For video sequences, super-resolution techniques can be used to obtain a high-resolution face image by combining the information from multiple low-resolution images. Although super-resolution can be applied as a separate preprocessing block, in this chapter, we propose to apply super-resolution after dimensionality reduction in a face recognition system. In this way, only the necessary information for recognition is reconstructed. We have also shown how to incorporate the model-based information into the face-space reconstruction algorithm. This helps to obtain more robust results when compared to the traditional super-resolution algorithms. In the experiments, we demonstrated robustness to noise and motion estimation error. We have investigated the effect of face-space dimension on the reconstruction, and provided recognition results for real video sequences. This chapter only examines the case for face images; however, the idea can be extended to other pattern recognition problems easily. One such application is the recognition of car license plates from video. The text on the plates can be reconstructed in a “text space”, where letters and numerals are used as the training set. One exciting extension of the face-space super-resolution is the use of 3D models in reconstruction. This extension would improve the recognition performance by providing robustness against changes in pose, expression, and lighting.

79

Figure 32: Error in feature vector computation.

80

Figure 33: (a) Original 40×40 image. (b) 10×10 low-resolution observation is interpolated using nearest neighbor interpolation. (c) 10 × 10 low-resolution observation is interpolated using bilinear interpolation. (d) Pixel-domain super-resolution applied. (e) The result of pixel-domain super-resolution reconstruction is projected into the face subspace. (f) Representation of the feature vector reconstructed using the eigenface-domain super-resolution in the face subspace.

81

(a)

(b)

(c)

(d)

(e)

(f)

Figure 34: (a) Original 40×40 image. (b) 10×10 low-resolution observation is interpolated using nearest neighbor interpolation. (c) 10 × 10 low-resolution observation is interpolated using bilinear interpolation. (d) Pixel-domain super-resolution applied. (e) The result of pixel-domain super-resolution reconstruction is projected into the face subspace. (f) Representation of the feature vector reconstructed using the eigenface-domain super-resolution in the face subspace.

82

Figure 35: Effect of feature vector length on performance.

83

Figure 36: Effect of observation noise performance.

84

Figure 37: Effect of motion estimation error on performance.

85

CHAPTER V

JOINT SPATIAL AND GRAY-SCALE ENHANCEMENT 5.1

Introduction

As we have seen in chapters 1 and 3, there has been considerable work done in the area of super-resolution reconstruction. All of this work assumes that there is no illumination change during the imaging process. However, the observations may provide diverse information in the gray-scale domain due to changes of illumination in the scene or imaging device adjustments such as exposure time, gain, or white balance [44, 52, 10, 11]. All these gray-scale changes should be modeled and compensated for an effective and robust super-resolution reconstruction. The straightforward approach would be a two-step scheme. First, assume an illumination model that consists of a gain and an offset parameter, estimate these illumination parameters and normalize the observations accordingly. Second, apply a super-resolution reconstruction algorithm to those normalized observations. In this chapter, we propose a better approach. We modify the imaging process such that gray-scale changes are part of it. This type of imaging model results in an effective reconstruction algorithm that can handle illumination changes, device adjustments (exposure time, gain, etc.), camera response function, and quantization naturally. Specifically, we define constraint sets based on the quantization bound information, and employ an iterative set-theoretic technique—namely, the projections onto convex sets—to improve the spatial resolution. As we remove restriction on the gray-scale diversity, the reconstructed image would also have more information in the gray-scale domain than any of the observations does. The rest of the chapter is organized as follows. In the next section we present an observation model that establishes the connection between a scene and multiple observations of that scene that are limited in spatial and gray-scale resolution/extent. Based on this model, an image fusion algorithm that enhances the imagery in both the gray-scale and

86

spatial domains is proposed in Section 5.3. This algorithm is an iterative set-theoretic method that uses constraint sets derived from the gray-scale quantization information. Section 5.4 presents some experimental results, and Section 5.5 concludes the chapter.

5.2

Imaging Model

In this section, we extend the image acquisition model given in Chapter 3 to include the factors that act on the gray-scale domain (such as the exposure time, white balance adjustment, saturation) in addition to the factors that act on the spatial domain (such as the point spread function of the sensors and sampling). The extended model includes the saturation and digitization processes that are mentioned in Chapter 1, and adds gain and offset blocks to handle changes in illumination. This extended model is illustrated in Figure 38. Light coming from the spectrally and spatially varying scene q(t; λ; x, y, z) is passed through an optical system to form a two-dimensional real-valued function f (x, y) on a sensor array. It is assumed that the scene is static (time-invariant) during the exposure time. A spectral filter captures a certain portion of the light spectrum. The sensors may have a nonlinear response to the amount of light falling on the sensor surface. This nonlinearity is usually by an exponential function, which is known as the gamma factor. Most commercial cameras have a built-in gamma correction circuit that linearizes the relationship between the impinging light intensity and the image output level. The rest of the imaging system converts the signal f (x, y), which is continuous in the spatial and gray-scale domains, into a digital image. In practice, the signal is reconstructed on a discrete grid (n1 , n2 ) instead of the continuous coordinates (x, y). Incorporating the motion between observations, the super-resolution algorithms model the imaging process as a linear mapping between a highresolution input signal f (n1 , n2 ) and low-resolution observations gi (l1 , l2 ). This mapping includes motion (of the camera or the objects in the scene), blur (caused by the point spread function of the sensor elements and the optical system), and downsampling. This is the observation model that is presented in Chapter 3 and is used by the current super-resolution algorithms.

87

We repeat the formulation for this model here for convenience: X

gi (l1 , l2 ) =

hi (l1 , l2 ; n1 , n2 )f (n1 , n2 ),

(76)

n1 ,n2

where (n1 , n2 ) and (l1 , l2 ) are the discrete coordinates of the high- and low-resolution images, respectively, i is the observation number, and hi (l1 , l2 ; n1 , n2 ) is the linear mapping that incorporates motion, blurring, and downsampling. (Notice that we neglected the additive sensor noise in (76). This leads to a neat formulation in a set-theoretic framework. However, it is still possible modify the algorithm to include noise power. We comment on this issue later in this chapter.) Although hi (l1 , l2 ; n1 , n2 ) can be a space- and time-varying function, it is usually taken to be invariant with respect to both time and space. In practice, the forward mapping formulated in (76) is implemented in three steps: spatial warping to compensate for motion, convolution with a PSF, and downsampling. (See Chapter 3 on this implementation issue.) The model formulated in (76) assumes that the observations are captured under the same illumination conditions and camera settings. However, this assumption is not always valid. It is possible that the observations are obtained with different camera settings (such as exposure time, gain, offset, etc.) in addition to potential illumination changes in the scene itself. The pixel intensities are also limited to a certain dynamic range and are quantized to a certain number of bits. To include the effects acting on gray-scale domain, we can extend the equation given in (76) as follows: (

"

zi (l1 , l2 ) = Q F ηi

Ã

X

!

hi (l1 , l2 ; n1 , n2 )f (n1 , n2 ) + µi

#)

,

(77)

n1 ,n2

where F [·] is the dynamic range saturation function, Q{·} is the gray-scale quantizer, and ηi and µi are the illumination gain and offset, respectively. (See Figure 38.) In this model, the illumination gain and offset are assumed to be spatially uniform. This is a valid assumption unless there is nonuniform illumination change within the scene. For the case of nonuniform illumination effects, the parameters ηi and µi can be modeled as functions of the spatial coordinates. A typical dynamic range saturation function F [·] is depicted in Figure 39. It is linear at the midrange intensities, and becomes flat at low and high intensities. The quantizer Q{·} is assumed to be uniform with a finite number of gray levels. (Because most 88

g ( l1 , l2 ) f ( x, y )

q ( t ; λ ; x, y , z )

z ( l1 , l2 )

η

F [⋅]

µ

Q {} ⋅

Figure 38: Imaging model includes effects acting on the gray-scale domain as well as on the spatial domain. In our formulation we will not try to reconstruct q(t; λ; x, y, z) but f (x, y). modern processors are byte-oriented, 8-bit representations of pixel intensities are widely used.) The joint effect of the dynamic range saturation function and the quantizer is a quantizer with nonuniform step sizes. The quantization noise is larger at low and high pixel intensities than at the midrange.

5.3

Joint Gray-Scale and Spatial Domain Enhancement

In this section we present a new image fusion algorithm for joint gray-scale and spatial enhancement. We define constraint sets using the quantization bounds of the pixel intensities, and employ a projections onto convex sets (POCS)-based algorithm to produce an image of higher spatial and gray-scale information content. 5.3.1

Constraint Sets from Amplitude Quantization

As formulated in (77), dynamic range compression and quantization introduces quantization error in the pixel intensities of the observations. The quantization error is not uniform because of the nonlinearity of the saturation function. The error is small at the midrange

89

I2

I1

Tl ( I1 )

Tu ( I1 )

Tl ( I 2 )

Tu ( I 2 )

Figure 39: Pixel intensities are compressed and digitized observations of real-valued quantities. For each intensity level, there is a lower bound Tl (·) and an upper bound Tu (·) within which the input intensity lies. These bounds are used to impose constraints on the reconstructed image. and gets larger towards the ends of the dynamic range. Although the exact value of the quantization error cannot be determined from the measurements, the bounds within which it lies can be determined when the saturation curve of the camera is known. As illustrated in Figure 39, the input intensities are mapped by the nonlinear saturation function and then quantized to a certain number of bits, typically eight. For each intensity level, there is a lower bound Tl (·) and an upper bound Tu (·); and these bounds can be determined from the saturation curve. If there were no nonlinearity in the camera response function, these bounds would be uniform for all intensities. However, cameras have limited dynamic range, and there is saturation at the two ends of the dynamic range. Therefore, the lower Tl (·) and the upper Tu (·) bounds on the quantization error are closer to each other for midrange pixel intensities compared to low- and high-end pixel intensities. We now define constraint sets using the bounds of the gray-scale data, and employ a POCS-based algorithm. The POCS technique produces solutions that are consistent with the information arising from observed data or knowledge about the solution. Each piece

90

of information is associated with a constraint set in the solution space; the intersection of these sets represents the space of acceptable solutions [21]. By projecting an initial estimate onto these constraint sets iteratively, a solution closer to the original signal is obtained. In this problem, the constraint sets arise from the quantization bounds of pixel intensities. Let Tl (I) and Tu (I) be the lower and upper quantization bounds for a measured pixel intensity I, and let x(n1 , n2 ) be an estimate of the high-quality image f (n1 , n2 ). Then, we write the constraint set using any observed pixel zi (l1 , l2 ) as follows: C [zi (l1 , l2 )] = {x(n1 , n2 ) : Tl (zi (l1 , l2 )) ≤ zˆi (l1 , l2 ) ≤ Tu (zi (l1 , l2 ))} ,

(78)

where zˆi (l1 , l2 ) is the calculated pixel intensity derived from x(n1 , n2 ): Ã

zˆi (l1 , l2 ) = ηi

X

!

hi (l1 , l2 ; n1 , n2 )x(n1 , n2 ) + µi .

(79)

n1 ,n2

When there are multiple observations, the intersection of the constraint sets may result in finer range resolution and dynamic range. We illustrate this situation in Figure 40. In that figure, we have two observations z1 and z2 with different gain and offset settings. When normalized over the input intensity, this corresponds to two different scaled and shifted saturation curves. Each pixel intensity in these observations provides a quantization interval to which the input intensity belongs. Because of the scaling and shift in the saturation curves, the quantization bounds partition the input intensity differently, which results in a higher gray-scale resolution. In addition, an extended dynamic range is spanned. In [50], a projections onto convex sets (POCS) algorithm for spatial-domain enhancement is presented. In that algorithm, bounds on the constraint sets are set to a fixed value, which is chosen heuristically, and the gray-scale effects are ignored. Here, we show that those bounds should actually be a function of observed pixel intensity, that they should be chosen using the camera response function. Moreover, the illumination changes and the camera settings should also be considered in the reconstruction. As a final note in this section, we want to point out that we did not include the additive noise in our formulation. When the additive noise is comparable to the quantization error, we need to modify our formulation. In the context of POCS reconstruction, one way is to

91

z2

z1 I1

Quantization interval for pixel intensity I1 in observation z1

I2 Quantization interval for pixel intensity I 2 in observation z2

Figure 40: It is possible to increase the gray-scale extent and resolution when there are multiple observations of different range spans. modify the quantization bounds given in equation (78) according to the noise power. Let us assume that the noise variance is σ 2 , then the lower and upper quantization bounds, Tl (·) and Tu (·), can be relaxed to Tl (·) − σ and Tu (·) + σ, respectively. 5.3.2

Projection Operations

For each pixel in the observations, we can define a constraint set and project an initial estimate onto these constraint sets iteratively to update the initial estimate. The result is a reconstructed composite with high gray-scale and spatial information content. The projection operator should update x(n1 , n2 ) in such a way that the constraint given in (78) is satisfied. We design the projection operator so that it projects the estimate x(n1 , n2 ) onto the bounds of the constraint sets. For instance, if zˆi (l1 , l2 ) is larger than the upper bound Tu (zi (l1 , l2 )), then the projection operator updates the estimate x(n1 , n2 ) such that the new zˆi (l1 , l2 ) (obtained from the updated x(n1 , n2 )) is equal to Tu (zi (l1 , l2 )).

92

With this logic, we find the projection operator onto a constraint set C [zi (l1 , l2 )] as PC[zi (l1 ,l2 )] [x(n1 , n2 )] = x(n1 , n2 ) +  ³ ´ Tu (zi (l1 ,l2 ))−ˆ zi (l1 ,l2 )−µi ¯   hi (l1 , l2 ; n1 , n2 ),  ηi          



 zˆi (l1 , l2 ) > Tu (zi (l1 , l2 ))   

0, Tl (zi (l1 , l2 )) ≤ zˆi (l1 , l2 ) ≤ Tu (zi (l1 , l2 ))           ³ ´    Tl (zi (l1 ,l2 ))−ˆzi (l1 ,l2 )−µi h ¯ i (l1 , l2 ; n1 , n2 ), zˆi (l1 , l2 ) < Tl (zi (l1 , l2 )) ηi

                     

(80) ,

where x(n1 , n2 ) is an estimate of the original scene f (n1 , n2 ), zˆi (l1 , l2 ) is the calculated ¯ i (l1 , l2 ; n1 , n2 ) is the intensity from the estimate x(n1 , n2 ) as given in equation (79), and h normalized blurring function: ¯ i (l1 , l2 ; n1 , n2 ) ≡ P hi (l1 , l2 ; n1 , n2 ) h . |hi (l1 , l2 ; n1 , n2 )|2

(81)

n1 ,n2

When zˆi (l1 , l2 ) is outside the bounds, equation (80) finds the distance (residual) to the closest bound and updates the contributing pixels in the initial estimate by an amount proportional to their contributions. More on the implementation of POCS technique can be found in [50]. (The best way to verify the validity of equation (80) is to assign a value to zˆi (l1 , l2 ), such as zˆi (l1 , l2 ) = Tu (zi (l1 , l2 )) + δ, where δ is a nonzero number, apply the projection operator, and see how it updates x(n1 , n2 ).) 5.3.3

Complete Algorithm

Here we present the complete algorithm. The details of each step is presented in Section 5.3.4. 1. Parameter estimation: Choose a reference frame and estimate the motion and illumination parameters between the reference frame and the other frames. 2. Initial estimate: Construct an initial estimate x(n1 , n2 ) by registering the images in spatial position and range.

93

3. Constraint sets: Define a constraint set for each pixel in the observed images. (The determination of the saturation curve, from where the quantization bounds are found, is also explained in Section 5.3.4.) 4. Alternating projections: (a) Choose an image among the observations, and for each pixel (l1 , l2 ) in the current image, i. Compute the pixel intensity zˆi (l1 , l2 ) from the estimate x(n1 , n2 ) by applying the estimated image acquisition model as in equation (79). ii. Update the estimate x(n1 , n2 ) using the projection operation given in equation (80). (b) Stop, if a stopping criterion is reached; otherwise, choose another image, and go to Step 4(a). 5.3.4

Implementation Details

There are various ways to implement the steps of the proposed algorithm. In particular, Mann [44], Robertson et al. [52] and Candocia [10, 11] propose different methods to determine the saturation curve, motion and illumination parameters, and initial estimate. In this section, we clarify which method we chose in our implementation. We should note that these steps can be implemented with other methods (than the ones we chose) as well. A. Motion and Illumination Parameter Estimation The first step of the algorithm is the determination of the spatial registration and illumination parameters. In our implementation we assume an affine motion model, which can easily be generalized to other parametric or non-parametric (dense) motion fields. In the affine model the motion vectors are represented by six parameters. Given the affine parameters [a1 a2 a3 a4 a5 a6 ], the motion vectors are u(x, y) = a1 x + a2 y + a3 v(x, y) = a4 x + a5 y + a6

94

.

(82)

Incorporating these equations into the optic flow equation (with gain η and offset µ factors included), we can relate two images I1 and I2 as follows: ηI1 (x, y) + µ = I2 (x + a1 x + a2 y + a3 , y + a4 x + a5 y + a6 ).

(83)

The parameters [a1 a2 a3 a4 a5 a6 ] and [η µ] can be determined jointly or separately. One method is to apply a Taylor series expansion to (83) and define a cost function from the resulting equation. By taking the partial derivatives of the cost function with respect to the unknown parameters and setting these to zero, we get a set of linear equations that can be solved easily. Defining Ix (x, y) and Iy (x, y) as the horizontal and vertical gradients of I2 (x, y), one cost function is

Ψ=

 2    X Ix (x, y)(a1 x + a2 y + a3 ) + Iy (x, y)(a4 x + a5 y + a6 )  x,y

  +I (x, y) − ηI (x, y) − µ 2 1

 

.

(84)

An alternative to the joint estimation is a two-step procedure. In our implementation we followed a two-step procedure. We first determine the affine parameters, and then determine the illumination parameters using these affine parameters. To determine the affine parameters, we use the Harris corner detector [33] to select a set of points in the reference image. We then find the correspondence points in the second image using normalized cross correlation with quarter-pixel accuracy. Each correspondence provides two linear equations for six unknowns, as given in (82). Using all (≥ 3) correspondences, a least-mean-square estimate of the six affine parameters is determined. Once the affine parameters are determined, the images are spatially registered, and then the illumination parameters are estimated. To determine the illumination parameters, the derivative of the cost function defined in (84) is taken with respect to η and µ, and the least-mean-square estimates are found. We should note that equation (83) is valid for the linear regions of saturation curve. Since the small and large pixel intensities might have been saturated, the midrange pixels (in range [10 − 230]) are used in the estimation of illumination parameters η and µ.

95

B. Initial Estimate for f (n1 , n2 ) In the POCS algorithm, we start with an initial estimate f0 (n1 , n2 ) and project it onto the constraint sets iteratively. The image f (n1 , n2 ) has higher spatial resolution and dynamic range than any one of the observations. The initial estimate is obtained by bilinearly interpolating one of the observations, and then extending the dynamic range by registering the other images in range. To perform range registrations, we used the method proposed in [52]. We register the images first spatially (by warping onto the reference image using the affine parameters), and then in range by taking a weighted sum of the pixels. The weight function is chosen to be a Gaussian-like function; its mean is set to the middle point of the dynamic range, which is a result of the relative reliability of the midrange intensities. For an image of dynamic range I ∈ [0 − 255], the weight function can be chosen as [52] Ã

(I − 127.5)2 w (I) = exp −W (127.5)2

!

,

(85)

where W was set to 5 in our experiments. In the overlapping pixel intensities, the new pixel values are found using (w1 I1 + w2 I2 )/(w1 + w2 ). C. Saturation Curve Estimation There are various ways to estimate the camera response function. In [10, 11], the response function is linearized, and the parameters are estimated jointly with the motion parameters using least squares minimization. In [52], Gauss-Siedel relaxation is used to determine the mapping for all pixel intensities. In [44], the response function is determined from the range-range plots. Camera response function is image independent, and once it is determined, it is used for the rest of the experiments. We use a simple method to determine the saturation function. We take multiple photographs of the same scene, each having a different exposure time. We then register these images spatially, and determine the offset and gain terms between them. (The spatial registration and illumination parameter estimation procedures are explained earlier in this section.) These images are then registered in range to form a composite image of higher dynamic range. For range registration, a weighted sum of the pixel intensities are taken, where the weights are determined according to the reliability of the pixel value. Since any

96

w2

w1

w3

z1

z2 z3

zc Figure 41: In order to determine the saturation function, the images are first registered in range to form high-dynamic range composite image. The saturation function are then determined by comparing the midtones of the composite with the observations. of the observations are more limited in dynamic range than the composite image, the saturation curve can be determined by comparing the observations with the composite image. This is illustrated in Figure 41. The weighted sum of the three observations z1 , z2 , and z3 are taken to form the composite image zc . Then the pixel intensities in zc and z2 are compared. (As there are overlaps in the midranges of zc , the information is more likely to be correct in midranges. There would still be saturation at the low- and high-ranges of zc .) For each pixel intensity in zc , the mean value of the corresponding pixels in z2 are found. This gives us an estimate of the saturation curve since the saturated intensities of z2 correspond to midrange intensities in zc . Once the saturation curve is determined, the quantization bounds are found from the curve. Using the images given in Figures 43(a), (b), and (c), the saturation given in Figure 42 is found. If more images were used, the saturation curve would be more accurate. Here, we should note that the simple procedure we followed may not be the best way to estimate the saturation curve. However, the idea of finding the bounds from the saturation curve is valid no matter what estimation method is used.

97

5.4

Experimental Results

We provide the results of an experiment to demonstrate the proposed method. We captured video sequences using a Sony DCR-TRV20 digital camcorder. While capturing the sequences, we increased the exposure time manually. We used three images from each sequence in the reconstruction. These images are given in figures 43(a), 43(b), 43(c) and 44(a), 44(b), 44(c). The images in figures 43(a) and 44(a) were captured with relatively short exposure times. In the image in 43(a), the buildings outside the window can be seen clearly. On the other hand, for the image in 43(c), which was captured with a longer exposure time, the inside can be seen clearly, but the objects outside the window cannot be seen because of saturation. In Figure 44(a) the low-contrast details in the lighter regions can be seen better, while in Figure 44(c) the tonal fidelity is higher for the darker regions. We applied the proposed algorithm to these image sets. The corners detected for the reference images are depicted in figures 43(d) and 44(d). The correspondence points are found using normalized cross correlation with quarter-pixel accuracy, a block size of eight pixels, and a search range of 10 pixels. The PSF is taken as a 7 × 7 Gaussian window with a standard deviation of one. Once the initial estimates are obtained, they are projected onto the constraint sets iteratively. The number of iterations is set to seven for both sequences. The reconstructed images for the first and second image sequences are given in figures 45 and 46, respectively. These images have higher dynamic ranges than any of the observations does by itself. (The pixel intensity ranges are given as side bars.) The low-contrast regions also become more clear in the reconstructed images. After getting these high-dynamic-range images, certain portion of their range can be chosen for display purposes. In figures 47 and 48, we show zoomed regions from these image sequences. They are also scaled in intensity to range [0 − 255]. Close examination of these figures shows that the spatial resolution has also been improved during the reconstruction.

5.5

Conclusions

In this chapter, we presented an image fusion algorithm that improves spatial resolution and dynamic range. The proposed algorithm can be considered as a generalization to the

98

300

250

200

150

100

50

0

0

50

100

150

200

250

300

Figure 42: Non-parametric estimation of the camera saturation function. Three images of different exposures times are used to estimate the curve. When more images were used, the estimation would be more accurate and the outliers could be eliminated. Parametric models could also be used with the algorithm. super-resolution reconstruction, in particular, to the algorithms based on the projections onto convex sets (POCS) technique. Although the POCS solution has the disadvantage of non-uniqueness, it is a natural fit to the image fusion problem, as quantization bound information can be easily used to define constraint sets. We particularly show how the camera response function affects the constraint sets. The proposed idea is valid regardless of the camera response function model (parametric/non-parametric) or the method used to estimate it. Although we have not demonstrated this, it is possible to extend the algorithm to increase the spatial extent. It is also implied that the algorithm can improve the grayscale resolution at the same time. This can be verified by synthetic data, but it is not easy with real data.

99

(a)

(b)

(c)

(d)

Figure 43: Images from the first sequence. (a) First image. (b) Second image. (c) Third image. (d) Corners detected in the third image.

100

(a)

(b)

(c)

(d)

Figure 44: Images from the second sequence. (a) First image. (b) Second image. (c) Third image. (d) Corners detected in the third image.

101

300 50 250

100

150

200

200

150

250

100

300

50

350 0 400 -50 450 -100 100

200

300

400

500

600

Figure 45: Reconstructed image from the first sequence. Color map on the right shows the extended dynamic range.

102

350 50 300 100 250 150 200 200 150 250

100

300

50

350

0

400

-50

450

-100 100

200

300

400

500

600

Figure 46: Reconstructed image from the second sequence. Color map on the right shows the extended dynamic range.

(a)

(b)

(c)

(d)

Figure 47: Zoomed regions from the first sequence. (a) First image (interpolated by pixel replication). (b) Second image (interpolated by pixel replication). (c) Third image (interpolated by pixel replication). (d) Reconstructed image scaled to intensity range [0-255].

103

(a)

(b)

(c)

(d)

Figure 48: Zoomed regions from the second. (a) First image (interpolated by pixel replication). (b) Second image (interpolated by pixel replication). (c) Third image (interpolated by pixel replication). (d) Reconstructed image scaled to intensity range [0-255].

104

CHAPTER VI

CONCLUSIONS AND FUTURE WORK

The most important questions in the area of multi-frame information fusion are information theoretic: What is the total amount of information distributed over temporal bandwidth? What is the best way of converting information distributed in time to another form of information, such as spatial resolution? What are the limiting factors on that conversion? Although these are the fundamental questions that need to be answered ultimately, there are also more practical questions that deserve investigation. Development of fast implementation techniques and more robust algorithms are two of those research topics. In this thesis, we addressed several aspects of the resolution enhancement problem; and there are some follow-up research in these areas. We first looked into the problem where we need resolution enhancement because of the patterned sampling of color channels. We showed the correlation among the color channels is higher at high-frequency components, red and blue channels are more likely to be aliased, and the high-frequency components of green channel can be used to remove the aliasing. Considering the signal-to-noise ratio concerns and the size/cost problems of multi-chip cameras, the demosaicking problem is likely to stay an active area until new sensor technologies are developed. As an important future research problem, we need to find a mechanism to measure the correlation among different frequency components and modify the reconstruction accordingly. One approach is the application of the local correlation estimator (proposed in Section 2) on the different frequency components of the color channels, where the frequency components are obtained using a wavelet decomposition. The extent of exploiting the inter-channel correlation can be controlled by a spatially and spectrally varying threshold function; such an adaptive mechanism is especially critical when the correlation varies considerably at different locations of the image. Although this may bring in additional computational cost, the algorithms can be easily embedded into dedicated hardware. Another issue is the noise inherent in the imaging

105

process. As sensor sizes are made smaller, sensor noise will become a bigger issue. The next step in demosaicking should be elimination of the noise, and statistical methods are good candidates for this purpose. A Bayesian estimator that exploits the frequency-dependent correlation can be developed easily. After the color filter array interpolation problem, we looked into the multi-frame resolution enhancement problem for compressed video. We appended the DCT-based compression process to a traditional imaging model, and developed a Bayesian super-resolution algorithm. Unlike previous methods, the algorithm uses the quantized DCT coefficients directly in reconstruction. This helps to exploit the quantization step size information in determining the critical parameters of the reconstruction. The algorithm enables distinct treatment of each DCT coefficient. This can be exploited for better performance. For instance, the information coming from high-frequency DCT coefficients can be discarded since they are quantized severely, and the information obtained from those coefficients are likely to be noise. Using the DCT coefficients directly also brings in a new idea: reconstructing the original image in transform domain. This can lead to more efficient and robust algorithms, and definitely deserves more investigation and future research to determine the right model, transform, etc. The stochastic entities in the problem were assumed to be Gaussian random processes. Investigation of different statistical models is a potential research area. Even if there may not be analytical solutions, improved reconstruction may be achievable. After the DCT-domain super-resolution algorithm, we developed a super-resolution algorithm for face images. Face images can be represented in a lower dimension (so called face space) through the application of the Karhunen-Loeve Transform (KLT). We embedded the super-resolution algorithm into the face recognition system so that super-resolution is not performed in the pixel domain, but is instead performed in a reduced dimensional face space. The resulting algorithm is a face-space image fusion algorithm with reduced computational complexity and higher robustness to noise. One exciting extension of the face-space super-resolution is the use of 3D models in reconstruction. This requires accurate registration of the facial features to the model, which is an active research area in computer

106

vision. Once the 3D extension is achieved, it is possible to develop face recognition system that are robust to pose, expression, and lighting changes. In this thesis, we examined the case for two-dimensional face images only; however, the idea can be extended to other pattern recognition problems easily. One such application is the recognition of car license plates from video. The text on the plates can be reconstructed in a “text space”, where letters and numerals are used as the training set. Finally, we questioned the sufficiency of the imaging model used in super-resolution algorithms. All of the work done in the area of super-resolution assumes that there are no changes in the gray-scale domain. However, the observations may have diversity in grayscale because of the changes of illumination in the scene or imaging device adjustments such as exposure time, gain, or white balance. We proposed a modified imaging model to handle such effects and developed a POCS-based reconstruction algorithm. It can be considered as a generalization to the super-resolution algorithms in the sense that both spatial and gray-scale enhancement is achieved. One extension of the algorithm is to use space-varying illumination models in the reconstruction. In this way, local illumination changes in the scene can be handled. All the algorithms and ideas presented in this thesis are computationally expensive. Although the computational power of computers is increasing exponentially, some implementation issues (such as algorithm parallelization and hardware implementation) are still worth investigating. Development of robust algorithms that include motion failure modeling, blur identification, overall video quality is also a future research topic.

107

APPENDIX A

CONVEXITY OF THE CONSTRAINT SETS

We outline the convexity proofs of the observation and detail constraint sets that are given in equations (7) and (13), respectively. 1. Observation constraint set Co : Let S1 (n1 , n2 ) and S2 (n1 , n2 ) be any two points in the set Co . That is, S1 (n1 , n2 ) = O(n1 , n2 )

∀ (n1 , n2 ) ∈ ΛS ,

(86)

S2 (n1 , n2 ) = O(n1 , n2 )

∀ (n1 , n2 ) ∈ ΛS .

(87)

and

For convexity, we need to show that all points of the line segment connecting S1 (n1 , n2 ) and S2 (n1 , n2 ) remain in the set Co . Let S3 (n1 , n2 ) ≡ αS1 (n1 , n2 ) + (1 − α)S2 (n1 , n2 ) be this line segment. (0 ≤ α ≤ 1.) Using equations (86) and (87), we get S3 (n1 , n2 ) = αS1 (n1 , n2 ) + (1 − α)S2 (n1 , n2 ) = αO(n1 , n2 ) + (1 − α)O(n1 , n2 ), = O(n1 , n2 ),

∀ (n1 , n2 ) ∈ ΛS

(88)

∀ (n1 , n2 ) ∈ ΛS .

That is, S3 (n1 , n2 ) ∈ Co . 2 2. Detail constraint set Cd : Let S1 (n1 , n2 ) and S2 (n1 , n2 ) be any two points in the set Cd . Referring to equations (10), (11), and (12) in the manuscript, we write |hx (n1 ) ∗ [hy (n2 ) ∗ S1 (n1 , n2 )] − (Wk G) (n1 , n2 )| ≤ T (n1 , n2 ),

(89)

|hx (n1 ) ∗ [hy (n2 ) ∗ S2 (n1 , n2 )] − (Wk G) (n1 , n2 )| ≤ T (n1 , n2 ),

(90)

and

108

where the subscripts x and y are chosen according to the value of k as in equations (10) to (12). Again, we need to show that all points of the line segment connecting S1 (n1 , n2 ) and S2 (n1 , n2 ) remain in the set Cd , for convexity. Let S3 (n1 , n2 ) ≡ αS1 (n1 , n2 ) + (1 − α)S2 (n1 , n2 ) for 0 ≤ α ≤ 1. We will now show that S3 (n1 , n2 ) is in Cd . We will omit the indices n1 and n2 in the notation to simplify the equations. |hx ∗ [hy ∗ S3 ] − (Wk G)| = |hx ∗ [hy ∗ (αS1 + (1 − α)S2 )] − (Wk G)| = |αhx ∗ [hy ∗ S1 ] + (1 − α)hx ∗ [hy ∗ S2 ] − (Wk G)| . (91) Add and subtract α(Wk G) inside the Equation 91 and regroup the terms to get |hx ∗ [hy ∗ S3 ] − (Wk G)| = |αhx ∗ [hy ∗ S1 ] + (1 − α)hx ∗ [hy ∗ S2 ] − (Wk G) + α (Wk G) − α (Wk G)|

(92)

= |α {hx ∗ [hy ∗ S1 ] − (Wk G)} + (1 − α) {hx ∗ [hy ∗ S2 ] − (Wk G)}| . Use the triangular inequality and the inequalities given in equations (89) and (90) to get |hx ∗ [hy ∗ S3 ] − (Wk G)| ≤ α |{hx ∗ [hy ∗ S1 ] − (Wk G)}| + (1 − α) |{hx ∗ [hy ∗ S2 ] − (Wk G)}| ≤ αT + (1 − α)T = T. Therefore, S3 (n1 , n2 ) ∈ Cd . 2

109

(93)

REFERENCES

[1] Adams, J. E., “Interactions between color plane interpolation and other image processing functions in electronic photography,” Proc. SPIE Cameras and Systems for Electronic Photography and Scientific Imaging, vol. 2416, pp. 144–151, February 1995. [2] Adams, J. E., “Design of color filter array interpolation algorithms for digital cameras, part 2,” in Proc. IEEE Int. Conf. Image Processing, vol. 1, pp. 488–492, 1998. [3] Adams, J. E. and Hamilton, J. F., “Adaptive color plane interpolation in single color electronic camera,” U.S. Patent 5,506,619, April 1996. [4] Adams, J. E. and Hamilton, J. F., “Design of practical color filter array interpolation algorithms for digital cameras,” Proc. SPIE Real Time Imaging II, vol. 3028, pp. 117–125, February 1997. [5] Baker, S. and Kanade, T., “Hallucinating faces,” in Fourth International Conf. Automatic Face and Gesture Recognition, March 2000. [6] Berthod, M., Skekarforoush, H., Werman, M., and Zerubia, J., “Reconstruction of high resolution 3d visual information using subpixel camera displacements,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pp. 654–657, 1994. [7] Borman, S. and Stevenson, R. L., “Simultaneous multi-frame map super-resolution video enhancement using spatio-temporal priors,” in Proc. IEEE Int. Conf. Image Processing, vol. 3, pp. 469–473, 1999. [8] Bose, N. K., Kim, H. C., and Valenzuela, H. M., “Recursive total least squares algorithm for image reconstruction from noisy, undersampled frames,” Multidimensional Systems and Signal Processing, vol. 4, pp. 253–268, 1993. [9] Boult, T. E., Chiang, M.-C., and Micheals, R. J., Super-Resolution via Image Warping. Boston: Kluwer Academic Publishers, 2001. [10] Candocia, F. M., “A least squares approach for the joint domain and range registration of images,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 4, pp. 3237–3240, May 2002. [11] Candocia, F. M., “Synthesizing a panoramic scene with a common exposure via the simultaneous registration of images,” in FCRAR 2002, May 2002. [12] Capel, D. and Zisserman, A., “Automated mosaicing with super-resolution zoom,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pp. 885–891, June 1998. [13] Capel, D. and Zisserman, A., “Super-resolution enhancement of text image sequences,” in Proc. IEEE Int. Conf. Pattern Recognition, vol. 1, pp. 600–605, 2000.

110

[14] Capel, D. and Zisserman, A., “Super-resolution from multiple views using learnt image models,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, December 2001. [15] Chang, E., Cheung, S., and Pan, D. Y., “Color filter array recovery using a threshold-based variable number of gradients,” SPIE, vol. 3650, pp. 36–43, 1999. [16] Cheeseman, P., Kanefsky, B., and Hanson, R., “Super-resolved surface reconstruction from multiple images,” Technical Report, NASA, January 1993. [17] Chen, D. and Schultz, R. R., “Extraction of high-resolution video stills from mpeg image sequences,” in Proc. IEEE Int. Conf. Image Processing, vol. 2, pp. 465–469, 1998. [18] Cok, D. R., “Color imaging array,” U.S. Patent 3,971,065, July 1976. [19] Cok, D. R., “Signal processing method and apparatus for sampled image signals,” U.S. Patent 4,630,307, February 1986. [20] Cok, D. R., “Signal processing method and apparatus for producing interpolated chrominance values in a sampled color image signal,” U.S. Patent 4,642,678, February 1987. [21] Combettes, P. L., “The foundations of set theoretic estimation,” Proc. of the IEEE, vol. 81, pp. 182–208, February 1993. [22] Devevec, P. E., Taylor, C. J., and Malik, J., “Modeling and rendering architecture from photographs,” in SIGGRAPH, pp. 11–20, 1996. [23] Elad, M. and Feuer, A., “Restoration of a single superresolution image from several blurred, noisy and undersampled measured images,” IEEE Trans. Image Processing, vol. 6, pp. 1646–1658, December 1997. [24] Elad, M. and Feuer, A., “Super-resolution reconstruction of image sequences,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 21, pp. 817–834, September 1999. [25] Elad, M. and Feuer, A., “Superresolution restoration of an image sequence: adaptive filter approach,” IEEE Trans. Image Processing, vol. 8, pp. 387–395, March 1999. [26] Georghiades, A. S., Belhumeur, P. N., and Kriegman, D. J., “From few to many: Generative models for recognition under variable pose and illumination,” in Proc. IEEE Conf. Face and Gesture Recognition, pp. 277–284, 2000. [27] Ghassemian, H., “Multi-sensor image fusion using multirate filter banks,” in Proc. IEEE Int. Conf. Image Processing, vol. 1, pp. 846–849, 2001. [28] Glotzbach, J. W., Schafer, R. W., and Illgner, K., “A method of color filter array interpolation with alias cancellation properties,” in Proc. IEEE Int. Conf. Image Processing, vol. 1, pp. 141–144, 2001. [29] Gomez, R. B., Jazaeri, A., and Kafatos, M., “Wavelet-based hyperspectral an multispectral image fusion,” in 2001 SPIE OE/Aerospace Sensing, Geo-Spatial Image and Data Exploitation II, April 2001. 111

[30] Hallinan, P., “A low dimensional representation of human faces for arbitrary lighting conditions,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pp. 995–999, 1994. [31] Hamilton, J. F. and Adams, J. E., “Adaptive color plane interpolation in single sensor color electronic camera,” U.S. Patent 5,629,734, May 1997. [32] Hardie, R. C., Barnard, K. J., and Armstrong, E. E., “Joint map registration and high-resolution image estimation using a sequence of undersampled images,” IEEE Trans. Image Processing, vol. 6, pp. 1621–1633, December 1997. [33] Harris, C. J. and Stephens, M., “A combined corner and edge detector,” in Proceedings of the 4th Alvey Vision Conference, pp. 147–151, 1988. [34] Hibbard, R. H., “Apparatus and method for adaptively interpolating a full color image utilizing luminance gradients,” U.S. Patent 5,382,976, January 1995. [35] Irani, M. and Peleg, S., “Improving resolution by image registration,” CVGIP: Graphical Models and Image Processing, vol. 53, pp. 231–239, May 1991. [36] Irani, M. and Peleg, S., “Motion analysis for image enhancement: Resolution, occlusion, and transparency,” J. of Visual Communications and Image Representation, vol. 4, pp. 324–335, December 1993. [37] Jebara, T., Azarbayejani, A., and Pentland, A., “3d structure from 2d motion,” IEEE Signal Processing Magazine, vol. 16, pp. 66–84, May 1999. [38] Kim, S. P., Bose, N. K., and Valenzuela, H. M., “Recursive reconstruction of high resolution image from noisy undersampled multiframes,” IEEE Trans. Acoust. Speech Sign. Proc., vol. 38, pp. 1013–1027, June 1990. [39] Kim, S. P. and Su, W., “Recursive high-resolution reconstruction of blurred multiframe images,” IEEE Trans. Image Processing, vol. 2, pp. 534–539, October 1993. [40] Kimmel, R., “Demosaicing: image reconstruction from ccd samples,” IEEE Trans. Image Processing, vol. 8, pp. 1221–1228, 1999. [41] Komatsu, T., Aizawa, K., and Saito, T., “Very high resolution imaging scheme with multiple different-aperature cameras,” Signal Processing: Image Communications, vol. 4, pp. 324–335, December 1993. [42] Laroche, C. A. and Prescott, M. A., “Apparatus and method for adaptively interpolating a full color image utilizing chrominance gradients,” U.S. Patent 5,373,322, December 1994. [43] Liu, C., Shum, H.-Y., and Zhang, C.-S., “A two-step approach to hallucinating faces: Global parametric model and local nonparametric model,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, December 2001. [44] Mann, S., “Comparametric equations with practical applications in quantigraphic image processing,” IEEE Trans. Image Processing, vol. 9, pp. 1389–1406, August 2000. [45] Mann, S. and Picard, R. W., “Virtual bellows: constructing high quality stills from video,” in Proc. IEEE Int. Conf. Image Processing, pp. 13–16, November 1994. 112

[46] Martinez, A. M. and Benavente, R., “The ar face database,” tech. rep., Proc. CVC Technical Report No: 24, July 1998. [47] Mukherjee, J., Parthasarathi, R., and Goyal, S., “Markov random field processing for color demosaicing,” Pattern Recognition Letters, vol. 22, pp. 339–351, 2001. [48] Patti, A. J. and Altunbasak, Y., “Super-resolution image estimation for transform coded video with application to mpeg,” in Proc. IEEE Int. Conf. Image Processing, vol. 3, pp. 179–183, 1999. [49] Patti, A. J. and Altunbasak, Y., “Artifact reduction for set theoretic super resolution image reconstruction with edge adaptive constraints and higher-order interpolants,” IEEE Trans. Image Processing, vol. 10, pp. 179–186, January 2001. [50] Patti, A. J., Sezan, M. I., and Tekalp, A. M., “Superresolution video reconstruction with arbitrary sampling lattices and nonzero aperture time,” IEEE Trans. Image Processing, vol. 6, pp. 1064–1076, August 1997. [51] Pedersini, F., Sarti, A., and Tubaro, S., “Multi-camera systems,” IEEE Signal Processing Magazine, vol. 16, pp. 55–65, May 1999. [52] Robertson, M. A., Borman, S., and Stevenson, R. L., “Dynamic range improvement through multiple exposures,” in Proc. IEEE Int. Conf. Image Processing, vol. 3, pp. 159–163, 1999. [53] Robertson, M. A. and Stevenson, R. L., “Dct quantization noise in compressed images,” in Proc. IEEE Int. Conf. Image Processing, vol. 1, pp. 185–188, 2001. [54] Robertson, M. A. and Stevenson, R. L., “Dct quantization noise in compressed images,” in SPIE Visual Communications and Image Processing, vol. 4310, pp. 21–29, 2001. [55] Robertson, M. A. and Stevenson, R. L., “Temporal resolution enhancement in compressed video sequences,” in EURASIP Journal on Applied Signal Processing: Special Issue on Nonlinear Signal Processing, December 2001. [56] Schultz, R. R., “Extraction of high-resolution video stills from mpeg image sequences,” in Proc. IEEE Int. Conf. Image Processing, vol. 2, pp. 465–469, October 1998. [57] Schultz, R. R. and Stevenson, R. L., “Extraction of high-resolution frames from video sequences,” IEEE Trans. Image Processing, vol. 5, pp. 996–1011, June 1996. [58] Segall, C. A., Molina, R., Katsaggelos, A. K., and Mateos, J., “Reconstruction of high-resolution image frames from a sequence of low-resolution and compressed observations,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, vol. 2, pp. 1701–1704, 2002. [59] Shechtman, E., Caspi, Y., and Irani, M., “Increasing space-time resolution in video,” in European Conference on Computer Vision (ECCV), May 2002. [60] Shekarforoush, H. and Chellappa, R., “Data-driven multichannel superresolution with application to video sequences,” J. of the Optical Society of America, vol. 16, pp. 481–492, March 1999. 113

[61] Sim, T., Baker, S., and Bsat, M., “The cmu pose, illumination, and expression (pie) database of human faces,” Tech. Rep. CMU-RI-TR-01-02, The Robotics Institute, Carnegie Mellon University, 2001. [62] Sirovich, L. and Kirby, M., “Low dimensional procedure for the characterization of human faces,” J. of the Optical Society of America, vol. 4, no. 3, pp. 519–524, 1987. [63] Srinivas, C. and Srinath, M. D., “A stochastic model-based approach for simultaneous restoration of multiple misregistered images,” SPIE, vol. 1360, pp. 1416–1427, 1990. [64] Stark, H. and Oskoui, P., “High-resolution image recovery from image-plane arrays, using convex projections,” J. of the Optical Society of America, vol. 6, no. 11, pp. 1715– 1726, 1989. [65] Stark, H. and Woods, J. W., Probability, Random Processes, and Estimation Theory for Engineers. Englewood Cliffs: Prentice-Hall Inc., 1986. [66] Szeliski, R., “Video mosaics for virtual environments,” in Computer Graphics Applications, pp. 22–30, March 1996. [67] Taubman, D., “Generalized wiener reconstruction of images from colour sensor data using a scale invariant prior,” in Proc. IEEE Int. Conf. Image Processing, vol. 3, pp. 801–804, 2000. [68] Tekalp, A. M., Digital Video Processing. Prentice Hall, 1995. [69] Tekalp, A. M., Ozkan, M. K., and Sezan, M. I., “High-resolution image reconstruction from lower-resolution image sequences and space-varying image restoration,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, pp. 169–172, March 1992. [70] Tom, B. C. and Katsaggelos, A. K., “Reconstruction of a high resolution image by simultaneous registration, restoration, and interpolation of low-resolution images,” in Proc. IEEE Int. Conf. Image Processing, vol. 2, pp. 539–542, October 1995. [71] Tom, B. C. and Katsaggelos, A. K., “Resolution enhancement of monochrome and color video using motion compensation,” IEEE Trans. Image Processing, vol. 10, pp. 278–287, February 2001. [72] Tom, B. C., Katsaggelos, A. K., and Galatsanos, N. P., “Reconstruction of a high resolution image from registration and restoration of low-resolution images,” in Proc. IEEE Int. Conf. Image Processing, vol. 3, pp. 553–557, November 1994. [73] Tom, B. C., Lay, K. T., and Katsaggelos, A. K., “Multi-channel image identification and restoration using the expectation-maximization algorithm,” Optical Engineering, Special Issue on Visual Communications and Image Processing, vol. 35, pp. 241–254, January 1996. [74] Trussell, H. J. and Hartwig, R. E., “Mathematics for demosaicking,” IEEE Trans. Image Processing, vol. 3, pp. 485–492, April 2002.

114

[75] Tsai, R. Y. and Huang, T. S., Multiframe Image Restoration and Registration, In: Advances in Computer Vision and Image Processing, ed. T.S.Huang. Greenwich, CT. JAI Press, 1984. [76] Turk, M. A. and Pentland, A. P., “Face recognition using eigenfaces,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, pp. 586–591, 2001. [77] Ur, H. and Gross, D., “Improved resolution from subpixel shifted pictures,” CVGIP: Graphical Models and Image Processing, vol. 54, pp. 181–186, March 1992. [78] Weldy, J. A., “Optimized design for a single-sensor color electronic camera system,” SPIE, vol. 1071, pp. 300–307, 1988. [79] Wu, X., Choi, W. K., and Bao, P., “Color restoration from digital camera data by pattern matching,” SPIE, vol. 3018, pp. 12–17, 1997.

115

VITA

Bahadir K. Gunturk was born in Adana, Turkey, in 1976. He received his B.S. degree in Electrical Engineering from Bilkent University, Ankara, Turkey, in 1999. In Fall 1999, he began full-time graduate studies as a Ph.D. candidate at Georgia Institute of Technology, where he received his M.S. degree in Electrical Engineering in 2001. Since Summer 2000, he has been working in the area of image processing and computer vision under the guidance of Prof. Yucel Altunbasak. He spent the summer of 2001 as an intern at the Imaging Science Research Laboratory at Eastman Kodak Company, Rochester, New York. He received the Outstanding Research Award from the Center for Signal and Image Processing, Georgia Institute of Technology, in 2001.

116

Multi-Frame Information Fusion For Image And Video Enhancement by

Bahadir K. Gunturk 116 Pages Directed by Professor Yucel Altunbasak Abstract—The need to enhance the resolution of a still image or of a video sequence arises frequently in a wide variety of applications. In this thesis, we address several aspects of the resolution-enhancement problem. We first look into the color filter array (CFA) interpolation problem, which arises because of the patterned sampling of color channels. We demonstrate that the correlation (among different color channels) differs at different frequency components, and propose an iterative CFA interpolation algorithm that exploits the frequency-dependent inter-channel correlation. The algorithm defines constraint sets based on the observed data and the inter-channel correlation, and employs the projections onto convex sets (POCS) technique to estimate the missing samples. To increase the resolution further to the subpixel levels, we need to use multiple frames. Such a multi-frame reconstruction process is usually called super-resolution (SR) reconstruction. We develop a SR algorithm for compressed video, where the quantization noise as well as other stochastic information is incorporated in a Bayesian framework. We also present a model-based SR algorithm for face images. Face images can be represented in a lower dimension through the application of the Karhunen-Loeve Transform (KLT). By embedding the KLT into an observation model, we eventually come up with an SR algorithm where the reconstruction is performed in a reduced dimensional face space instead of the pixel domain. The resulting algorithm has less computational complexity and is more robust to noise. Finally, we generalize the SR algorithms by extending the image acquisition model to include illumination changes and internal camera adjustments, such as exposure time, gain, or white balance. Using this extended model, we develop a POCS-based reconstruction algorithm that improves both spatial and gray-scale resolution.