Semi-supervised clustering in image databases - Horizons

Nov 1, 2006 - Semi-supervised clustering in image databases. 8. Test databases. ▫ CO43: most difficult 43 classes from the database COIL-100. ▫ BS8: high ...
2MB taille 2 téléchargements 342 vues
Semi-supervised clustering in image databases

Nizar Grira NII 1-11-06

Main Interest at NII  

Data Mining (Knowledge Discovery in Databases) Why do we need Data Mining? 

Technologies such Internet provided a great boost to information collection and sharing  We

are drowning in data, but starving for knowledge

How can I analyze this data?

Thursday, November 02, 2006

Semi-supervised clustering in image databases

2

Data Mining 

Data Mining: discovering interesting (non-trivial, useful, unknown) patterns in data. knowledge

Patterns presentation and evaluation Data mining ( C l a s s i f i c a t i o n , C l u s t e r i n g … )

Data preprocessing

Database Thursday, November 02, 2006

Semi-supervised clustering in image databases

3

Outline  

Workspace setting Categorization by Pairwise Constraints Competitive Agglomeration  

Background Contributions:  Fuzzy

Clustering with Pairwise Constraints  Results and discussions 

Conclusion

Thursday, November 02, 2006

Semi-supervised clustering in image databases

4

Image Database Mining problem 

Challenge situation: huge amount of images in an unexplored database How to explore the content of such collection?  Generate a good overview of the image database

Thursday, November 02, 2006

Semi-supervised clustering in image databases

5

Purpose 

Provide an effective access to the image database through a meaningful categorization  Rely on a machine learning method working on image features

 Benefit from an external knowledge

the user can provide

Thursday, November 02, 2006

Semi-supervised clustering in image databases

6

Outline  

Workspace setting Categorization by Pairwise Constraints Competitive Agglomeration  

Background Contributions:  Fuzzy

Clustering with Pairwise Constraints  Results and discussion 

Conclusion

Thursday, November 02, 2006

Semi-supervised clustering in image databases

7

Test databases 

CO43: most difficult 43 classes from the database COIL-100



BS8: high level semantic database (8 categories)

Thursday, November 02, 2006

Semi-supervised clustering in image databases

8

Image feature space



Intensity/color: basic and weighted histograms 



Shape and structure: Histogram based on the Hough transform 



(hsv_hist 120 dimensions, lapl_rgb 216, prob_rgb 216) (hough_sig 49 dimensions)

Texture: Texture feature vectors are based on the Fourier transform 

(four_sig 40 dimensions)  discriminant

signatures complementary features

Thursday, November 02, 2006

Semi-supervised clustering in image databases

9

Challenging conditions



High-dimensional feature space



Unknown number of categories



Natural categories have various shapes and overlap



Visual descriptors sometimes fail to retrieve user expected categories

Thursday, November 02, 2006

Semi-supervised clustering in image databases

10

Pre-processing of feature space 

We perform reduction of dimensionality by linear principal component analysis (PCA) pca all

pca all CO43

Precision

Precision

BS8

Rappel Rappel  After a reduction in dimension of about 5 times, we remain within a 5% overall loss of quality in the precision-recall diagrams. Thursday, November 02, 2006

Semi-supervised clustering in image databases

11

Outline  

Workspace setting Categorization by Pairwise Constraints Competitive Agglomeration  

Background Our Contributions:  Fuzzy



Clustering with Pairwise Constraints

Conclusion and perspectives

Thursday, November 02, 2006

Semi-supervised clustering in image databases

12

Categorization methods: which one is more appropriate? 

In many cases, knowledge in the form of labels (hard to generate) of the relevant categories is not available  Classification is not appropriate



In clustering, we cannot incorporate some specific knowledge and the resulting structure found in the data often does not reflect the user expectations  Clustering is not appropriate



Therefore, semi-supervised learning is important to balance these two approaches

Thursday, November 02, 2006

Semi-supervised clustering in image databases

13

Semi-supervised Learning 

Use a few labeled items and the unlabeled ones to find frontiers between classes (Transduction) 



What if there are many classes, most of them unknown a priori ?

Add simple supervision to a clustering algorithm working on image features  simple in nature  limited in amount

Thursday, November 02, 2006

Semi-supervised clustering in image databases

14

Pairwise constraints 

Usually, image classes are unknown a priori Cells

Castles

User cannot provide class labels but is able to tell whether 2 images should belong to a same class (must-link constraint) or different classes (cannot-link constraint)

Falcons Sunrise Rhinoceros

 Constraints let us indicate how similarity space is different from feature space

Thursday, November 02, 2006

must-link

Semi-supervised clustering in image databases

cannot-link

15

K-Means like approaches 

K-Means like algorithms are the most widely used heuristics

Initialization

Expectation Step Thursday, November 02, 2006

Maximization step

Final partition Semi-supervised clustering in image databases

16

K-Means like approaches 

K-Means 





Aim: Categorize vectors to get prototypes Method: Minimize the squared error criterion

cluster

Fuzzy C-Means 



Aim: Categorize prototypes each cluster with Method: Minimize

vectors to get cluster with each vector being member of degree ( )

Thursday, November 02, 2006

Semi-supervised clustering in image databases

17

Background: Competitive agglomeration 

CA [Frigui & Krishnapuram,97] 

Aim: Categorize prototypes



Method: Minimize

vectors to get without specifying

cluster a priori

constrained by:

with at iteration k:

Thursday, November 02, 2006

Semi-supervised clustering in image databases

18

Competitive agglomeration 

Assets  





Number of clusters does no longer need to be specified explicitly Can be adapted to retrieve variously shaped clusters by the use of different metrics Low computational complexity (suitable for real-world clustering applications)

Weaknesses 

Cannot incorporate semantic information  resulting categories often do not reflect user expectations on real databases

Thursday, November 02, 2006

Semi-supervised clustering in image databases

19

Pairwise Constrained Competitive Agglomeration Let be the set of must-link pairs and cannot-link pairs 



the set of

Using the same notations as for CA, PCCA minimizes:

Thursday, November 02, 2006

Semi-supervised clustering in image databases

20

PCCA: memberships update 

The membership of a vector

Clusters with low cardinality Thursday, November 02, 2006

to a cluster

is:

are discarded. Semi-supervised clustering in image databases

21

Outline  

Workspace setting Categorization by Pairwise Constraints Competitive Agglomeration  

Background Contributions:  Fuzzy





Clustering with Pairwise Constraints

Results and discussion

Conclusion

Thursday, November 02, 2006

Semi-supervised clustering in image databases

22

Performance measures 

Traditional information retrieval measures are adapted for evaluating clustering by considering pairs of points F-measure (Van Rijsbergen, 79)

is then an harmonic mean of

pairwise precision P and recall R Where:



PCCA [Grira & al, IEE Proceedings - Vision, Image & Signal Processing, 2005] and  

Basic CA algorithm PCKmeans [Basu & al 2001]

Thursday, November 02, 2006

Semi-supervised clustering in image databases

23

PCCA results on IRIS 

IRIS benchmark: 3 not spherical classes of 50 instances each, only one class is linearly separable from the other two 0 constraints

20 constraints

50 constraints

Thursday, November 02, 2006

Semi-supervised clustering in image databases

24

PCCA results on IRIS

CA Kmeans

Thursday, November 02, 2006

Semi-supervised clustering in image databases

25

PCCA results on BS8 100 constraints

0 constraints

500 constraints

Thursday, November 02, 2006

Semi-supervised clustering in image databases

26

PCCA results on BS8

CA Kmeans

Thursday, November 02, 2006

Semi-supervised clustering in image databases

27

PCCA results on BS8 Without supervision, «Castle» cluster is split into two clusters

Some of the constraints collected succeeded to merge the above clusters

Thursday, November 02, 2006

Semi-supervised clustering in image databases

28

AFCC results on BS8

CA Kmeans

Thursday, November 02, 2006

Semi-supervised clustering in image databases

29

Application to a scientific database 

Arabidopsis database

CA Kmeans

Images provided by NASC (http://arabidopsis.info), ground-truth by INRA (http://www.inra.fr) Thursday, November 02, 2006

Semi-supervised clustering in image databases

30

Application to a scientific database 

Arabidopsis database

0 constraints

Thursday, November 02, 2006

50 constraints (AFCC)

Semi-supervised clustering in image databases

31

Application to a scientific database 

Visual summary with 50 constraints

Thursday, November 02, 2006

Semi-supervised clustering in image databases

32

Outline  

Workspace setting Categorization by Pairwise Constraints Competitive Agglomeration  

Background Contributions:  Fuzzy





Clustering with Pairwise Constraints

Results and discussion

Conclusion

Thursday, November 02, 2006

Semi-supervised clustering in image databases

33

Conclusion



I have presented a methodology to provide relevant image overviews based on semi-supervised clustering algorithms that 



Bring categorization much closer to user expectations by taking into account a simple semantic information Allow the number of constraints to remain sufficiently low to become an interesting alternative in the categorization of image databases

Thursday, November 02, 2006

Semi-supervised clustering in image databases

34