Deep Learning Compact and Invariant Image Descriptors for Instance Retrieval Olivier Morère 08/06/2016
About this thesis
Antoine VEILLARD Daniel RACOCEANU
Vijay CHANDRASEKHAR Hanlin GOH
2
Image Instance Retrieval Existing image database
New query image
Retrieve database images depicting the same object
3
A Wide Range of Applications
4
A Challenging Problem
5
Comparing Image Global Descriptors for Retrieval New query image
Pairwise distances
Images global descriptors
Image global descriptor
1 1 0 1 1 0 … 0 0
0 1 0 1 1 0 … 0 0
1 0 0 0 0 1 … 1 0
Existing image database
0 0 1 1 1 1 … 0 1 6
SIFT Descriptor
Local Descriptor SIFT S S S S S S S S
Dxx
Maxima Dxy
y
DxxDyy-(0.9Dxy)2 Dyy
Orient along dominant gradient
Filters
x Oriented Patch Gradient Field
Blob Response
Global Descriptor: VLAD/FV K x 128 Step 4 Concatenate residuals into the global descriptor
Step 1 Training K clusters of local descriptors on the image training set
Step 2 Extract local descriptors from new image
128
128
…
Step 3 Compute local descriptor residual statistics in each bin
8
ImageNet Results over the Years
9
Deep Convolutional Neural Networks (CNN)
[Krizhevsky, 2012]
10
Convolution & Pooling
• Convolution • Local connectivity • Stationarity of the signal
• Pooling • Dimensionality reduction • Local invariance 11
Learning Abstract Visual Representations [Zeiler, 2013] Layer 1
Layer 4
Layer 2
Layer 5
Layer 3
Original Contributions • Image classification with Convolutional Neural Networks (CNN), ImageNet2014 • Thorough comparison study of FV and CNN for image instance retrieval • Hashing CNN descriptors • Unsupervised dimensionality reduction • Descriptor finetuning with unsupervised and semisupervised metric learning
• Robust CNN descriptors with i-theory 13
List of Publications (1/2) Conference papers • O. Morère, J. Lin, A. Veillard, V. Chandrasekhar, T. Poggio. Nested Invariance Pooling and RBM Hashing for Image Instance Retrieval. Submitted to European Conference on Computer Vision (ECCV) 2016 • O. Morère, J. Lin, V. Chandrasekhar, A. Veillard, H. Goh. Co-Sparsity Regularized Deep Hashing for Image Instance Retrieval. Accepted in International Conference on Image Processing (ICIP) 2016 • O. Morère, J. Lin, J. Petta, V. Chandrasekhar and A. Veillard Tiny descriptors for image retrieval with unsupervised triplet hashing. Data Compression Conference (DCC) 2016. • C. Dao-Duc, H. Xiaohui and O. Morère. Maritime Vessel Images Classification Using Deep Convolutional Neural Networks. Symposium on Information and Communication Technology (SOICT) 2015 • V. Chandrasekhar, J. Lin, O. Morère, A. Veillard and H. Goh. Compact Global Descriptors for Visual Search. Data Compression Conference (DCC) 2015 14
List of Publications (2/2) Technical reports
• O. Morère, A. Veillard, J. Lin, J. Petta, V. Chandrasekhar and T. Poggio. Group In- variant Deep Representations for Image Instance Retrieval. Center for Brains, Minds and Machines (CBMM) 2016 • O. Morère, J. Lin, V. Chandrasekhar, A. Veillard and H. Goh. Deephash: Getting Regularization, Depth and Fine-tuning Right. arXiv preprint arXiv:1501.04711 2015
Contests and workshops
• O. Morère, A. Veillard, H. Goh. Team “LateFusion”. Kaggle National Data Science Bowl Challenge 2015 • O. Morère, H. Goh, A. Veillard, V. Chandrasekhar. Large Scale Image Classifica- tion on a Shoe String. ImageNet Large Scale Visual Recognition Challenge, European Conference on Computer Vision (ECCV) 2014
Journal articles
• O. Morère, V. Chandrasekhar, J. Lin, H. Goh and A. Veillard. A Practical Guide to CNNs and Fisher Vectors for Image Instance Retrieval. Accepted in Signal Processing (SIGPRO) 2016 15
PART 1 1. Thorough comparison study of FV and CNN for image instance retrieval 2. Hashing CNN descriptors 3. Robust CNN descriptors with i-theory
16
CNN vs Fisher Vector Fisher Vector
FV global descriptor
Input Image
Deep Convolutional Neural Network
CNN global descriptor
17
How to Extract CNN Descriptors for Image Instance Retrieval ? CAT
VISUAL
?
SEMANTIC
Retrieval Data Sets Stanford Mobile Visual Search [Chandrasekhar, 2011]
University of Kentucky Benchmark [Nister, 2006]
Oxford Buildings [Philbin, 2007]
INRIA Holidays [Jégou, 2008]
Object centric
Scene centric
19
Evaluation Metrics |{RelevantIndividuals} \ {RetrievedIndividuals}| Recall = |RelevantIndividuals|
AP =
Pn
⇥ isRelevant(k)) |RelevantIndividuals|
k=1 (P recision(k)
20
Best Practices for CNN Descriptors
21
Best Practices for FV Interest Points
• Multi-scale improves performance over single-scale • DoG is required when there are big scale and rotation changes (e.g. Graphics) • Dense works better with highly textured images (e.g. Holidays) 22
Impact of Rotation Query image vectors Feature extraction Unknown objects Matching
Feature extraction
Rotations
Database image vector
Unknown orientations Database images
Query image
23
Impact of Rotation - CNN Holidays 0.8
pool5 fc6 fc7 fc8
Mean Average Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1 −200
−150
−100
−50
0
50
100
150
200
Query Rotation Angle
• Very limited invariance to rotation • Invariance to rotation does not increase with depth
24
Impact of Rotation - CNN vs FV Holidays 0.9
OxfordNet−FC6 . FVDoG FVDS FVDM
Mean Average Precision
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −200
−150
−100
−50
0
50
100
150
200
Query Rotation Angle 25
Further Readings O. Morère, V. Chandrasekhar, J. Lin, H. Goh and A. Veillard. A Practical Guide to CNNs and Fisher Vectors for Image Instance Retrieval. Accepted in Signal Processing (SIGPRO) 2016
26
Part 1: Summary • CNN performs well for image instance retrieval • Two problems remain to be addressed: • Descriptor dimensionality • Lack of robustness
27
PART 2 1. Thorough comparison study of FV and CNN for image instance retrieval 2. Hashing CNN descriptors 3. Robust CNN descriptors with i-theory
28
Why 64-bit Hash ? • Motivation • Billions of images can be stored in RAM • Fast matching with ultra-fast Hamming distance • Challenges • Global descriptors are very high dimensional • Uncompressed: 4K-25K floating point numbers
29
Hashing Outline Global Feature Extraction High-dimensional Image Descriptor
Training Phase 1: Unsupervised Stacked Regularized RBMs W1
W2
WL ...
Input Image Training
Fisher Vector or Deep Convolutional Neural Network
8K-64K dim.
or
Transfer model
Training Phase 2: Fine-Tuning
4K dim.
Deep Siamese Network W1
W2
WL ...
Testing
Trained DeepHash Model
Image Descriptor Hashing (Testing)
W1
W2
Compact Binary Hash
Matching & non-matching pairs
W1
WL
Loss1
Loss2
W2
LossL
WL ...
... 64-1K bits
30
Hashing Outline 1. Dimensionality Reduction with Stacked RBM 2. Semi-supervised fine-tuning with Siamese networks 3. Unsupervised fine-tuning with triplet networks
31
Generate latent
RBM
Reconstruct input
• Bipartite graph model with input units and latent units • Closed form expression from one to the other
P (zj = 1|x) = sigmoid(
X
weights
wij xi )
i
P (xi = 1|z) = sigmoid(
X j
wij zj ) x
w
z 32
Contrastive Divergence Input Layer Original training data x
[Hinton, 2002]
Latent Layer
P (z|x)
Latent representation z
}
P ositiveij = xi zj
}
N egativeij = x0i zj0
P (x|z)
Reconstructed data x’
P (z|x)
Latent representation z’
W = ✏(P ositive
N egative) 33
Greedily Training Stacked RBM
[Hinton, 2006 Bengio, 2006]
RBM 2 64 units RBM 1 Stacked RBM
1024 units
8192 units Step 1: - Training RBM1
Step 2: - Freezing RBM1 weights - Training RBM2 using RBM1 latent layer as input layer 34
Regularization
[Hinton, 2009]
• It is often desirable that the latent variables follow certain distributions • For example, sparse distributions work well for classification • Hinton proposes regularization that encourages each unit activation q towards p |n A m j=0
0
1 n1
m 1 1 @X XG,i,n (x) = fi (gj .x)n A m j=0
59
Nested Invariance Pooling (NIP) Translation Invariance
Rotation Invariance
Scale Invariance
60
Distances for Three Matching Pairs
(a)
(b)
(c) 61
NIP Evaluation
62
NIP + RBMH • NIP transformation invariant descriptors can be hashed into compact binary codes using RBMH Batch Regularizer
Convolutional Neural Net
0.5 0.5 0.5 0.5 0.5 0.5
0 1 0 1
GS ; n = 1
GT ; n = 2
1 0 0 1
0 1 0 1
1 0 1 0
0 1 1 0
1 0 1 0
Invariant Descriptor
1. CNN feature extraction
GR ; n ! 1 2. Nested Invariance Pooling
0.5 0.5 0.5 Compact Hash
Group Transform.
Input Image
0.5
RBMH
32#256&bits
512&dim.
3. RBM for Hashing
63
NIP + RBMH - Evaluation
64
Conclusion • Thorough comparison study of FV and CNN for image instance retrieval • Hashing CNN descriptors • Unsupervised dimensionality reduction • Descriptor finetuning with unsupervised and semisupervised metric learning
• Robust CNN descriptors with i-theory
65
Acknowledgements Antoine VEILLARD Daniel RACOCEANU
Vijay CHANDRASEKHAR Hanlin GOH Jie Lin