Semi-supervised clustering in image databases
Nizar Grira NII 1-11-06
Main Interest at NII
Data Mining (Knowledge Discovery in Databases) Why do we need Data Mining?
Technologies such Internet provided a great boost to information collection and sharing We
are drowning in data, but starving for knowledge
How can I analyze this data?
Thursday, November 02, 2006
Semi-supervised clustering in image databases
2
Data Mining
Data Mining: discovering interesting (non-trivial, useful, unknown) patterns in data. knowledge
Patterns presentation and evaluation Data mining ( C l a s s i f i c a t i o n , C l u s t e r i n g … )
Data preprocessing
Database Thursday, November 02, 2006
Semi-supervised clustering in image databases
3
Outline
Workspace setting Categorization by Pairwise Constraints Competitive Agglomeration
Background Contributions: Fuzzy
Clustering with Pairwise Constraints Results and discussions
Conclusion
Thursday, November 02, 2006
Semi-supervised clustering in image databases
4
Image Database Mining problem
Challenge situation: huge amount of images in an unexplored database How to explore the content of such collection? Generate a good overview of the image database
Thursday, November 02, 2006
Semi-supervised clustering in image databases
5
Purpose
Provide an effective access to the image database through a meaningful categorization Rely on a machine learning method working on image features
Benefit from an external knowledge
the user can provide
Thursday, November 02, 2006
Semi-supervised clustering in image databases
6
Outline
Workspace setting Categorization by Pairwise Constraints Competitive Agglomeration
Background Contributions: Fuzzy
Clustering with Pairwise Constraints Results and discussion
Conclusion
Thursday, November 02, 2006
Semi-supervised clustering in image databases
7
Test databases
CO43: most difficult 43 classes from the database COIL-100
BS8: high level semantic database (8 categories)
Thursday, November 02, 2006
Semi-supervised clustering in image databases
8
Image feature space
Intensity/color: basic and weighted histograms
Shape and structure: Histogram based on the Hough transform
(hsv_hist 120 dimensions, lapl_rgb 216, prob_rgb 216) (hough_sig 49 dimensions)
Texture: Texture feature vectors are based on the Fourier transform
(four_sig 40 dimensions) discriminant
signatures complementary features
Thursday, November 02, 2006
Semi-supervised clustering in image databases
9
Challenging conditions
High-dimensional feature space
Unknown number of categories
Natural categories have various shapes and overlap
Visual descriptors sometimes fail to retrieve user expected categories
Thursday, November 02, 2006
Semi-supervised clustering in image databases
10
Pre-processing of feature space
We perform reduction of dimensionality by linear principal component analysis (PCA) pca all
pca all CO43
Precision
Precision
BS8
Rappel Rappel After a reduction in dimension of about 5 times, we remain within a 5% overall loss of quality in the precision-recall diagrams. Thursday, November 02, 2006
Semi-supervised clustering in image databases
11
Outline
Workspace setting Categorization by Pairwise Constraints Competitive Agglomeration
Background Our Contributions: Fuzzy
Clustering with Pairwise Constraints
Conclusion and perspectives
Thursday, November 02, 2006
Semi-supervised clustering in image databases
12
Categorization methods: which one is more appropriate?
In many cases, knowledge in the form of labels (hard to generate) of the relevant categories is not available Classification is not appropriate
In clustering, we cannot incorporate some specific knowledge and the resulting structure found in the data often does not reflect the user expectations Clustering is not appropriate
Therefore, semi-supervised learning is important to balance these two approaches
Thursday, November 02, 2006
Semi-supervised clustering in image databases
13
Semi-supervised Learning
Use a few labeled items and the unlabeled ones to find frontiers between classes (Transduction)
What if there are many classes, most of them unknown a priori ?
Add simple supervision to a clustering algorithm working on image features simple in nature limited in amount
Thursday, November 02, 2006
Semi-supervised clustering in image databases
14
Pairwise constraints
Usually, image classes are unknown a priori Cells
Castles
User cannot provide class labels but is able to tell whether 2 images should belong to a same class (must-link constraint) or different classes (cannot-link constraint)
Falcons Sunrise Rhinoceros
Constraints let us indicate how similarity space is different from feature space
Thursday, November 02, 2006
must-link
Semi-supervised clustering in image databases
cannot-link
15
K-Means like approaches
K-Means like algorithms are the most widely used heuristics
Initialization
Expectation Step Thursday, November 02, 2006
Maximization step
Final partition Semi-supervised clustering in image databases
16
K-Means like approaches
K-Means
Aim: Categorize vectors to get prototypes Method: Minimize the squared error criterion
cluster
Fuzzy C-Means
Aim: Categorize prototypes each cluster with Method: Minimize
vectors to get cluster with each vector being member of degree ( )
Thursday, November 02, 2006
Semi-supervised clustering in image databases
17
Background: Competitive agglomeration
CA [Frigui & Krishnapuram,97]
Aim: Categorize prototypes
Method: Minimize
vectors to get without specifying
cluster a priori
constrained by:
with at iteration k:
Thursday, November 02, 2006
Semi-supervised clustering in image databases
18
Competitive agglomeration
Assets
Number of clusters does no longer need to be specified explicitly Can be adapted to retrieve variously shaped clusters by the use of different metrics Low computational complexity (suitable for real-world clustering applications)
Weaknesses
Cannot incorporate semantic information resulting categories often do not reflect user expectations on real databases
Thursday, November 02, 2006
Semi-supervised clustering in image databases
19
Pairwise Constrained Competitive Agglomeration Let be the set of must-link pairs and cannot-link pairs
the set of
Using the same notations as for CA, PCCA minimizes:
Thursday, November 02, 2006
Semi-supervised clustering in image databases
20
PCCA: memberships update
The membership of a vector
Clusters with low cardinality Thursday, November 02, 2006
to a cluster
is:
are discarded. Semi-supervised clustering in image databases
21
Outline
Workspace setting Categorization by Pairwise Constraints Competitive Agglomeration
Background Contributions: Fuzzy
Clustering with Pairwise Constraints
Results and discussion
Conclusion
Thursday, November 02, 2006
Semi-supervised clustering in image databases
22
Performance measures
Traditional information retrieval measures are adapted for evaluating clustering by considering pairs of points F-measure (Van Rijsbergen, 79)
is then an harmonic mean of
pairwise precision P and recall R Where:
PCCA [Grira & al, IEE Proceedings - Vision, Image & Signal Processing, 2005] and
Basic CA algorithm PCKmeans [Basu & al 2001]
Thursday, November 02, 2006
Semi-supervised clustering in image databases
23
PCCA results on IRIS
IRIS benchmark: 3 not spherical classes of 50 instances each, only one class is linearly separable from the other two 0 constraints
20 constraints
50 constraints
Thursday, November 02, 2006
Semi-supervised clustering in image databases
24
PCCA results on IRIS
CA Kmeans
Thursday, November 02, 2006
Semi-supervised clustering in image databases
25
PCCA results on BS8 100 constraints
0 constraints
500 constraints
Thursday, November 02, 2006
Semi-supervised clustering in image databases
26
PCCA results on BS8
CA Kmeans
Thursday, November 02, 2006
Semi-supervised clustering in image databases
27
PCCA results on BS8 Without supervision, «Castle» cluster is split into two clusters
Some of the constraints collected succeeded to merge the above clusters
Thursday, November 02, 2006
Semi-supervised clustering in image databases
28
AFCC results on BS8
CA Kmeans
Thursday, November 02, 2006
Semi-supervised clustering in image databases
29
Application to a scientific database
Arabidopsis database
CA Kmeans
Images provided by NASC (http://arabidopsis.info), ground-truth by INRA (http://www.inra.fr) Thursday, November 02, 2006
Semi-supervised clustering in image databases
30
Application to a scientific database
Arabidopsis database
0 constraints
Thursday, November 02, 2006
50 constraints (AFCC)
Semi-supervised clustering in image databases
31
Application to a scientific database
Visual summary with 50 constraints
Thursday, November 02, 2006
Semi-supervised clustering in image databases
32
Outline
Workspace setting Categorization by Pairwise Constraints Competitive Agglomeration
Background Contributions: Fuzzy
Clustering with Pairwise Constraints
Results and discussion
Conclusion
Thursday, November 02, 2006
Semi-supervised clustering in image databases
33
Conclusion
I have presented a methodology to provide relevant image overviews based on semi-supervised clustering algorithms that
Bring categorization much closer to user expectations by taking into account a simple semantic information Allow the number of constraints to remain sufficiently low to become an interesting alternative in the categorization of image databases
Thursday, November 02, 2006
Semi-supervised clustering in image databases
34