self-supervised learning in cooperative stereo vision correspondence

Abstract: This paper presents a neural network model of stereoscopic vision, ..... example of source image (top-left), result of its convolution with a mexican hat ...
186KB taille 2 téléchargements 277 vues
SELF-SUPERVISED LEARNING IN COOPERATIVE STEREO VISION CORRESPONDENCE Benoît DECOUX Laboratoire Capteurs, Instrumentation et Analyse, Institut National des Sciences Appliquées de Rouen, B. P. 8, Place Emile Blondel, 76131 Mont-Saint-Aignan Cedex, France [email protected]

Abstract: This paper presents a neural network model of stereoscopic vision, in which a process of fusion seeks the correspondence between points of stereo inputs. Stereo fusion is obtained after a self-supervised learning phase, so called because the learning rule is a supervised-learning rule in which the supervisory information is autonomously extracted from the visual inputs by the model. This supervisory information arises from a global property of the potential matches between the points. The proposed neural network, which is of the cooperative type, and the learning procedure, are tested with random-dot stereograms (RDS) and feature points extracted from real-world images. Those feature points are extracted by a technique based on the use of sigma-pi units. The matching performance and the generalization ability of the model are quantified. The relationship between what have been learned by the network and the constraints used in previous cooperative models of stereo vision, is discussed. Keywords: stereoscopic vision, self-supervised learning, cooperative model, stereo correspondence problem, unsupervised learning, self-organization

1. Introduction Computational models of stereoscopic vision are aimed at finding the depth of a scene, and then its threedimensional (3D) structure, from two images of this scene taken from two nearby viewpoints. The depth information is conveyed in part by the disparity existing between the two stereo images, that is the difference of projection of surfaces and objects on those images. But before interpreting disparity, the correspondence must be found between feature points or small regions of the two stereo images. Many matches are possible, and among them some are true and the others are false. Correspondence is ambiguous, and establishing it is a difficult problem. Knowledge on the structure of the visual world and on its projection on images is often needed and must be transformed into algorithmic forms, to constrain the search for a solution1. Perhaps the most important constraints to disambiguate stereo matching are 1°) uniqueness, which exploits the fact that a given feature point taken from one of the two stereo images corresponds to no more than one point of the other image (in most cases), and 2°) smoothness (or continuity) of disparity, which tends to favour fronto-parallel surfaces, as in stereo pairs of images, disparity varies smoothly almost everywhere, excepted at a few surface boundaries2. Cross-spatial scale presence and figural continuity3 of some features, and a disparity-gradient limit allowing to extend matching to surfaces projected with varying disparities within this limit4, are three other constraints which have been shown to have a disambiguating power.

Various neural network (or connectionist) approaches to the stereo correspondence problem have been shown powerful. In addition, some of those models provide insights into possible mechanisms underlying natural stereo vision5. One type of models is made up by the cooperative models, often composed of disparity neurons. A disparity neuron is active when similar features are simultaneously present on the left and right images with a given amount of disparity between their locations on the two images, and then codes a potential match between those features. Cooperative models implement constraints for matching in the form of excitatory and inhibitory connections between disparity neurons. This implementation can be done in various ways2,3,4,6,7,8,9. The existence of cooperativity in human binocular vision is suggested by psychophysical experiments (see ref. 10 for a review on several experiments). Stereo matching can also be viewed as an optimization process: a cost function can be defined from constraints for matching, and implemented in the synaptic weights of a Hopfield network. The Hopfield network then converges towards a state which represents a valid solution to the correspondence problem11. Another class of methods is made up by the coarseto-fine models, which exploit a property of the spatial filtering of images: the probability of having false matches between features at a given disparity is linked to the spatial scale at which those features are extracted. The matching process can thus operate at several spatial scales, with false matches at each scale being avoided, and the processing at one scale constraining the search at the immediately finer scale12,13.

type, but instead of adjusting the weights of interconnections between disparity neurons ourselves, we are interested in finding a learning procedure that allows the network to find such weights autonomously. Simulations with RDS of different dot densities and small stereo images extracted from larger images of real-world scenes, are described.

In addition to cooperativity and coarse-to-fine processing, learning can take place in the matching process, and can be referred to as "learning the constraints". Supervised learning with an external (human) teacher in the back-propagation algorithm allows us to avoid the empirical adjustment of parameters to find their optimal value14, but is difficult to use with real-world images, as supervision means the indication to the model of all the true point-to-point correspondences. However, internal supervision signals can be generated, for example by maximizing the mutual information of two modules that look at nearby parts of the inputs15. A correlation matrix can also be used to associate the disparity information with the real depth of patterns16. Our present study is concentrated on how some stereo matches can be found by self-organization. The architecture of the proposed model is of the cooperative

2. Network architecture and operation The neural network that we have defined for stereo fusion is composed of two input layers (stereo visual inputs) and D layers of disparity neurons, where D is the number of levels of disparity that are to be processed (fig. 1). Those disparity neurons have two input links connecting one unit of each input layer to them, and the disparity they process is defined by the

right view

Constant-Disparity Maps

left view

-2

-1

0

+1

+2

global activity units

(a)

right view

visual inputs

left right

xl xr

-D/2 left view

disparity inputs

-1 0 +1

y

+D/2 +2

(b)

+1

0

-1

(c)

Fig. 1. (a): Architecture of the network used for stereo fusion. The indices of the CDMs correspond to the disparity they process. A few input links and receptive fields are shown: the stereo inputs of two disparity neurons: one of the CDM -2 and one of the CDM +1, together with the disparity receptive fields of two other neurons: one of the CDM 0 and one of the CDM +2. In this example the network is composed of 5 CDMs. (b): Detail of the stereo visual inputs of some disparity neurons. The CDMs are represented by the dashed lines. (c): Internal structure of a disparity-processing neuron.

shift, on the abscissa axis, between the locations of the two input units which are connected to them, in their respective layers. Each layer of disparity neurons extracts a single value of disparity: we then call them Constant-Disparity Maps (CDMs). The disparity neurons also have one receptive field on each CDM, square-shaped and centred at the same location on all of them (disparity receptive fields). The network is thus of the recurrent and cooperative type. The output of the neuron processing the disparity k at the image location (i,j) is defined by:

yi , j , k = xr ,i , j . xl ,i + k , j . X with

 D  2 X =σ ∑  D  k '=− 2

 ∑ ∑ wi' ,j' ,k';i, j,k . yi' ,j' ,k '  n n j '= j − i '=i −  2 2 j+

n 2

i+

n 2

where i and j are discrete and vary between 0 and N, and k between -D/2 and +D/2. xr,i,j is the right visual input of the disparity neuron, at the image location (i,j) and xl,i+k,j the left visual input, at the image location (i+k,j); they are also the output values of the right and left input units connected to this disparity neuron, respectively. The ordinates of these two input units in the input layers are the same; the abscissa of the left input unit is shifted of an amount equal to the disparity coded by the disparity neuron, with respect to the abscissa of the right input unit. σ is a threshold function: σ(x)=1 if x>0, 0 otherwise; wi ', j ',k ';i , j ,k is the synaptic weight of the link connecting the neuron i’,j’,k’ (with output yi’,j’,k’) with the neuron i,j,k; n is the number of links by side of the disparity receptive fields. The network we have simulated was composed of D=7 CDMs of size N×N disparity neurons with N=25, 32 or 50, and of disparity receptive fields of size n×n links with n=7. Another group of D neurons is used to code the global activities of the CDMs: the global-activity neurons, having each a receptive field on the whole of one CDM, and computing the sum of their inputs. Their goal is to detect in which CDM the density of matches is maximal, information needed in the self-supervised learning process described in the next section. This CDM can be known by detection of the global-activity neuron of maximal output (which can be achieved for example by a winner-take-all process). Initially, the disparity neurons code all the potential matches between the points present in the two input layers: at t=0, we then have: yi,j,k=xr,i,j.xl,i+k,j (X=1); the operation of the disparity neurons is then analogous to the one of some binocular neurons observed in the visual cortex, of which the response as a function of the binocular inputs is non-linear17.

3. Stereo learning

correspondence

and

self-supervised

The goal of learning is here to find weights between the disparity neurons, adapted so that the result of the activation of the network is to inactivate the false matches and to keep the true matches active. The principle of learning used is based on a global property of the potential matches between two stereo images representing a surface located at a single depth18: the density of the true matches is the same as the density of the elements to be matched, and the density of the false matches is lower. This global information can be used to derive a supervisory signal in a learning process which then becomes self-supervised. The learning rule that we have used is an error-correcting rule, also utilized in supervised learning and learning by punish/reward19. The weights of the links interconnecting the disparity neurons are modified by the small quantities:

∆wij = α . yi .( y dj − y j )

where i is the index of the source neuron and j the one of the target neuron, wij the weight of the link between those two neurons, α a learning rate (lower than 1), y the output of a neuron and yd the desired value of this output. If, during learning, the stereo inputs represent surfaces located at the same depth within the range (D/2, +D/2), the desired outputs of the disparity neurons can be known, by means of the global-activity neurons which can detect in which CDM the density of matches is maximal: the actual outputs of the disparity neurons are the result of their activation; the desired outputs are the initial outputs in the CDM of highest global activity (true matches), and zero in the others. The presentation of a new input to the network (points of RDS or feature points of real-world images), followed by the activation of the disparity neurons, and the adaptation of the weights of the disparity receptive fields by the errorcorrecting learning rule, make up a learning step.

4. Information processed by the network Two sets of simulations have been carried out: the first one uses RDS composed of black dots of which the density varies from 0.1 to 0.5, on a white background; the second one uses feature elements extracted from real-world images. In the first set, the stereo inputs represent a surface located at a single depth; in the second set, they are made of two regions of a single image, shifted on the abscissa axis of an amount within the range of allowable disparity (-D/2, +D/2). In both cases the stereo inputs are mapped on the input layers of the network. In the case of real-world images, the features which are matched are the zero-crossings of the

im(x, y)

1

im(x+1, y)

-1

im(x, y)

-1

im(x+1, y)

1

im(x, y)

1

im(x, y+1)

-1

im(x, y)

-1

im(x, y+1)

1

90°, sign -

90°, sign +

0°, sign -

0°, sign +

(a)

(b)

Fig. 2. (a): Zero-crossing detectors based on the use of sigma-pi units. im(x, y) is the pixel of the convoluted image located at (x, y). The orientation of the contour segments which are detected is indicated at the output of the detectors, together with the sign of the corresponding intensity contrast. (b): Extraction of features from a real-world scene: example of source image (top-left), result of its convolution with a mexican hat (top-right), horizontal zero-crossings for one sign of contrast (bottom-left) and sum of the two orientations and the two contrast signs (bottom-right).

analog inputs and/or the use of other activation functions.

convolution of the images by a Laplacian-of-Gaussian operator (mexican hat). The convolution of images with such a filter has the property of detecting intensity changes in them. The zero-crossings of the convoluted images, for several spatial scales of the filter, can be combined into a single representation which corresponds to the edges in the scenes20. In our model, the detection of zero-crossings is made at a single scale for a given orientation (horizontal or vertical) and a given sign of contrast (positive or negative intensity change). We have defined a detection scheme based on the use of sigma-pi units, which can compute the product between some of their inputs21 (fig. 2(a)). Our detectors of zero-crossings in one direction are composed of two linear units and one sigma-pi unit. We have to use 4 layers of such detectors for the coding of the horizontal and vertical zero-crossings and the two contrast signs. The bottom-left image of figure 2(b) shows the state of a map of detectors for one orientation and one contrast sign. The features which are thus extracted correspond roughly to the contours of the surfaces and of the textural elements of the scene. The detection of horizontal (resp. vertical) zerocrossings gives rise to vertical (resp. horizontal) and oblique segments, and isolated points. The simulations of stereo fusion with real-world scenes use only the horizontal zero-crossings of the convolution and one constrast sign. With the two different types of input information (RDS and feature points of real-world images), the stereo visual inputs of the disparity neurons are binary. Their output is then binary too, as it is the product of three binary terms. The disparity neurons then act as logical AND gates, but the advantage of using the product instead of the AND operation is that it allows the further extension of the model to the processing of

5. Stereo fusion In order to evaluate the performance of learning, we have defined two rates of matches which remain active after M successive activations of the network: the one of the true matches (tm) and the one of the false matches (fm):

tm( M ) =

ntm ( M ) ntm ( 0 )

fm( M ) =

n fm ( M ) n fm ( 0)

where n(M) is the number of active matches at the Mth iteration of activation. M=0 corresponds to the initial state of the network, i.e. to all the potential matches being active. In the ideal case, tm would converge towards 100% and fm towards 0%. Figure 3 shows the progression of those two rates as a function of the number of learning steps, for different input informations, with M=1. If we modify all the weights of the disparity neurons at each learning step, the progression of learning is somewhat slow, especially for weak densities of points (fig. 3, column (a)). A way of accelerating this progression is to modify, for a given disparity neuron, only the weights of its input links connected to the CDM of maximal activity, at each learning step. This learning constraint allows a faster progression of the learning rates towards the desired values (fig. 3, column (b)). Figure 4 shows the effect of the activation of the network after learning, in the case of RDS and in the

Random-dot stereograms of density 0.1: 100%

100% tm

tm

fm

fm 0

10000

0

5000

Random-dot stereograms of density 0.5: 100%

100% tm

tm

fm

fm

0

10000

0

5000

Images of real-world scenes: 100%

100%

tm

tm fm

fm 0

15000

(a)

0

10000

(b)

Fig. 3. Progression of the rates tm and fm as a function of the learning steps, in the case of RDS for two different densities of black dots: 0.1 and 0.5, and in the case of vertical contour elements extracted from a real-world scene, for a given sign of contrast. Column (a) corresponds to learning without learning-constraints, and column (b) to learning with the learning-constraint defined in the text.

(a)

-3

-2

-1

0

1

2

3

(b)

-3

-2

-1

0

1

2

3

(c)

-3

-2

-1

0

1

2

3

(d)

(e)

-3

-2

-1

0

1

2

3

(f)

-3

-2

-1

0

1

2

3

Fig. 4. Results of activation of the network after learning (stereo fusion). (a): Case of input layers and CDMs of size 32×32, with RDS displaying a surface located at a single depth (from top to bottom: stereo inputs and state of the CDMs before (potential matches) and after activation). (b): Case of input layers and CDMs of size 25×25, with RDS displaying three surfaces located at different depths (from top to bottom: stereo inputs, state of the network before activation, after activation, after two successive activations, and good responses). (c): Case of CDMs of size 50×50, in which the mean weights obtained by learning with inputs like those of (a) have been copied. The stereo inputs represent a RDS with a "wedding-cake" 3D structure. (d): Stereo images of size 256×256 used for simulations with real-world scenes. (e): Case of stereo images of size 25×25 made up of regions of a single image of (d)(from left to right and from top to bottom: stereo images, result of their convolution with the mexican hat and detection of the horizontal zerocrossings of the convolution for one contrast sign, state of the CDMs before and after activation of the network). (f): Generalization of stereo fusion to a real stereo case (for corresponding regions of the stereo images of (d)).

stereo input layers (but different disparities in the case of learning with a constraint), and an excitation between the neurons processing the same disparity but different locations in the inputs. The interactions are similar in the case of the real-world images, but have an elongated vertical form, as the activity of a disparity neuron is often accompanied by the activity of its neighbours on the vertical axis, and never by the activity of its nearest neighbours on the horizontal one, due to the way the features are extracted. Those interactions could correspond to the uniqueness constraint (inhibition) and the constraint of smoothness of disparity (excitation), used in previous cooperative models2. The shape of the interactions is roughly the same in the case of learning with a constraint (described in section 5) as in the case of learning without constraints, excepted that in the second case the recurrent connection between a disparity neuron and itself has become strongly inhibitory.

one of real-world scenes. The RDS used are of two types: composed of one surface located at a single depth (of the same type as those used for learning) for the first one, and of three surfaces located at different depths for the second one. With this second type (fig. 4(b)), after one activation of the network, the rates tm and fm are respectively 78.97% and 0.62%, and after two successive activations, 99.16% and 4.66%. The model then has the ability to generalize its stereo fusion, to a certain extent, to inputs that it has not learned.

6. Weights resulting from learning After learning, the distribution of the weights of the links interconnecting the disparity neurons (fig. 5) shows that an inhibition has appeared between the neurons processing the same location in one of the two -3

0

+3

-3

-3

0

0

+3

+3

-3

0

+3

-3

0

+3

(a)

-3

(b)

0

+3

-3

-3

0

0

+3

+3

Fig. 5. Mean weights obtained by learning, in the disparity receptive fields (7 receptive fields of size 7×7 links), in the case of RDS (left of (a) and (b)) and in the one of real-world images (right). The horizontal index is the one of the source CDM of the receptive fields and the vertical index the one of the destination CDM. Light grey represents negative values and dark grey positive values. (a): Learning with a constraint. (b): Learning without constraints.

7. Conclusion A neural network model of stereo vision of the cooperative type is presented in this paper. This model

is able to find matches in random-dot stereograms and stereo images of real-world scenes by unsupervised learning. It seems that two constraints for matching

have been learned by the network: the uniqueness constraint and the continuity-of-disparity constraint. The learning rule is an error-correcting rule, used to adapt the synaptic weights of the interconnections between the disparity neurons of the network. The learning procedure is self-supervised: an internal supervisory signal is generated from the fact that the density of the true matches is greater than the density of the false matches when the inputs represent surfaces located at the same depth, that is in a single frontoparallel plane. Only small images and small disparities are processed in the present simulations, because the numerous links of the network use a lot of computer memory. For the processing of greater disparities, which is useful especially in the case of real-world images, the model has to be adapted, for example by means of a complementary coarse-to-fine process of vergence, based on the minimization of a global disparity, allowing the axes of view of the two cameras to converge on the same point of a scene22. Simulations show that very good rates of determination of matches are obtained after the learning phase, with inputs of the same type as those used for learning. The model also shows a generalization ability: the rates are still good in the case of RDS displaying several fronto-parallel surfaces located at different depths, which may be considered as cases where small disparity gradients are present in the inputs. In this model, some matches in stereo vision correspondence are found by self-organization. But in order to increase the properties of self-organization of such a network, it would be interesting to find a way to still generate internal supervisory signals when disparity gradients are present in the inputs, so that such inputs could be learned. From the engineering viewpoint, the benefits (or disadvantages) of bringing the type of self-organization proposed here to stereo vision can be quantified by a comparison between the matching performance of this model with the one of other approaches, especially in the case of real-world images, as it concerns a broad range of applications.

8. References 1. U. R. Dhond, J. K. Aggarwal, "Structure from stereo-A review", IEEE Trans. on Syst., Man, and Cybern., 19(6), 1489-1510 (1989). 2. D. Marr, T. Poggio, "Cooperative computation of stereo disparity", Science, 194, 283-287 (1976). 3. J. E. W. Mayhew, J. P. Frisby, "Psychophysical and computational studies towards a theory of human stereopsis", Artif. Intell., 17, 349-385 (1981). 4. S. B. Pollard, J. E. W. Mayhew, J. P. Frisby, "PMF: a stereo correspondence algorithm using a disparity gradient limit", Perception, 14, 449-470 (1985). 5. R. Blake, H. R. Wilson, "Neural models of

stereoscopic vision", Trends in Neurosciences, 14(10), 445-452 (1991). 6. P. Dev, "Perception of depth surfaces in random-dot stereograms: a neural model", Int. J. Man-Machine Studies, 7, 511-528 (1975). 7. J. I. Nelson, "Globality and stereoscopic fusion in binocular vision", J. Theor. Biol., 49, 1-88 (1975). 8. N. Sugie, M. Suwa, "A scheme for binocular depth perception suggested by neurophysiological evidence", Biological Cybernetics, 26, 1-15 (1977). 9. S. Grossberg, J. A. Marshall, "Stereo boundary fusion by cortical complex cells: a system of maps, filters, and feedback networks for multiplexing distributed data", Neural Networks, 2, 29-51 (1989). 10. B. Julesz, "Early vision and focal attention", Reviews of Modern Physics, 63(3), 735-772 (1991). 11. J. J. Lee, J. C. Shim, Y. H. Ha, "Stereo correspondence using the Hopfield neural network of a new energy function", Pattern Recognition, 27(11), 1513-1522 (1994). 12. D. Marr, T. Poggio, "A computational theory of human stereo vision", Proc. R. Soc. London B, 204, 301-328 (1979). 13. W. E. L. Grimson, "Computational experiments with a feature based stereo algorithm", IEEE Trans. on Pattern Anal. and Mach. Intell., 7(1), 17-34 (1985). 14. A. Khotanzad, A. Bokil, Y.-W. Lee, "Stereopsis by constraint learning feed-forward neural networks", IEEE Trans. on Neural Networks, 4(2), 332-342 (1993). 15. S. Becker, G. E. Hinton, "Self-organizing neural network that discovers surfaces in random-dot stereograms", Nature, 355, 161-163 (1992). 16. A. J. O'Toole, "Structure from stereo by associative learning of the constraints", Perception, 18, 767782 (1989). 17. P. O. Bishop, G. H. Henry, C. J. Smith, "Binocular interaction fields of simple units in the cat striate cortex", J. Physiol., 216, 39-68 (1971). 18. D. Marr, "Vision", Freeman, New-York (1982). 19. B. Widrow, N. K. Gupta, S. Maitra, "Punish/reward: learning with a critic in adaptative threshold systems", IEEE Trans. on Syst., Man and Cybern., 3(5), 455-465 (1973). 20. D. Marr, E. Hildreth, "Theory of edge detection", Proc. R. Soc. London B, 207, 187-217 (1980). 21. D. E. Rumelhart, J. L. McClelland and the PDP Research Group, "Parallel distributed processing: exploration in the microstructure of cognition. Volume 1: Foundations", The MIT Press, Cambridge, MA (1986). 22. B. Decoux, R. Debrie, "Vergence and stereo fusion for 3-D reconstruction", Proc. NEURAP’95/96 (Neural Networks and their Applications), pp. 299305, Marseille, France (1996).