A combinatorial approach to search and clustering - Horizons

Apr 26, 2007 - Experimental Results. V. Extensions and Applications a. Query Result Clustering b. Outlier Detection c. Feature Set Evaluation and Selection ...
4MB taille 2 téléchargements 273 vues
A Combinatorial Approach to Search and Clustering

Michael Houle National Institute of Informatics

26 April 2007

M E Houle @ NII

1

Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model

I. II. a. b. c.

Association Measures Significance of Association Partial Significance and Reshaping

a. b. c.

Query Result Clustering Outlier Detection Feature Set Evaluation and Selection

III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications

26 April 2007

M E Houle @ NII

2

What is Clustering?  Clustering is:  Organization of data into well-differentiated groups of highly-similar items.  A form of unsupervised learning.  Fundamental operation in data mining & knowledge discovery.  Important tool in the design of efficient algorithms and heuristics.  Closely related to search & retrieval.

26 April 2007

M E Houle @ NII

3

Clustering Paradox  Clustering models/methods traditionally

make assumptions on the nature of the data:  Data representation.  Similarity measures.  Data distribution.  Cluster numbers, sizes and/or densities.  Definition of noise.

 … but, cluster analysis seeks to

discover the nature of the data!

26 April 2007

M E Houle @ NII

4

Shared Neighbor Clustering  Similarity measures not fully trusted?  Curse of dimensionality – concentration effect.  Variations in density.  Lack of objective meaning.  Shared neighbor information:  “If two items have many neighbors in common, they are probably closely related.”  Similarity measure used primarily for ranking.  Adaptive to variations in density. 26 April 2007

M E Houle @ NII

5

Shared-Neighbor Clustering Methods (1)  Jarvis-Patrick (1973)  Hierarchical clustering heuristic.  Single-linkage merge criterion.  Fixed-cardinality neighborhoods.  Merge threshold t.  Merge if there exists a pair a, b such that:  a and b are k-NNs of one A another;  Intersection of k-NN lists contains at least tk items. 26 April 2007

M E Houle @ NII

B

6

Shared-Neighbor Clustering Methods (2)  ROCK

(Guha, Rastogi, Shim 2000)  Hierarchical clustering

heuristic.  Fixed-radius neighborhoods.  Pairwise linkage defined as size of intersection of neighborhoods.  Merge if total (size-weighted) inter-cluster linkage size is maximized.

26 April 2007

M E Houle @ NII

B A

7

Shared-Neighbor Clustering Methods (3)  SNN (Ertöz, Steinbach, Kumar 2003)  Based on DBSCAN (1996):  Density over fixed-radius neighborhood.  Core points – density exceeding a supplied threshold.  Merging – if one core point is contained in the neighborhood of another.  SNN: DBSCAN with A fixed-cardinality neighborhoods.  Similarity: intersection size of fixed-cardinality neighborhoods. 26 April 2007

M E Houle @ NII

B

8

Drawbacks of Shared Neighbor Clustering  Fixed k-NNs:  Bias towards clusters of size order k.  Examples: Jarvis-Patrick, SNN.  How to choose k ?  Fixed radius neighborhoods:  Bias towards clusters of larger density.  Examples: ROCK.  How to choose radius?  Clustering depends on parameters that make

implicit assumptions regarding the data.

26 April 2007

M E Houle @ NII

9

Desiderata for Clustering  Fully automated clustering:  Similarity measure, but used strictly for ranking.  Otherwise, no knowledge of data distribution.  Parameters must have domain-independent interpretation.  Automatic determination of number of clusters, cluster sizes.  Other desiderata:  Scalable heuristics.  Adaptive to variations in density.  Handles cluster overlap. 26 April 2007

M E Houle @ NII

10

Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model

I. II. a. b. c.

Association Measures Significance of Association Partial Significance and Reshaping

a. b. c.

Query Result Clustering Outlier Detection Feature Set Evaluation and Selection

III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications

26 April 2007

M E Houle @ NII

11

Query-Based Clustering  How can we cluster when the nature of

the data is hidden? No pairwise (dis)similarity measure? Only assumption: relevancy rankings for queries-by-example.

 Q(q, k): ranked relevant set for query

item q. |Q(q, k)| = k.

 Clusters will be patterned on query

relevant sets Q(q, k) for some q in S. 26 April 2007

M E Houle @ NII

12

Confidence  Two sets A and B related according

to their degree of overlap.

 Natural measure – confidence

(inspired by Association Rule Mining) (related to Jarvis-Patrick merge criterion): A A ∩B 0 ≤ conf (A ,B ) = ≤1

A

B

 Interpretation of conf : precision & recall (as in IR).  Query result Q for concept set C.  Precision is conf (Q, C).  Recall is conf (C, Q). 26 April 2007

M E Houle @ NII

13

Mutual Confidence  Symmetric measure – mutual

confidence: 0 ≤ MC (A ,B )

= conf (A ,B )⋅conf (B ,A ) =

A B A ⋅B

A B

≤1

 Interpretation of MC : cosine-angle between set

vectors.  If item j is a member, j-th coordinate equals 1.  Otherwise, j-th coordinate equals 0.  cos-1 (MC (A, B)) is a distance metric. 26 April 2007

M E Houle @ NII

14

Set Correlation  Pearson correlation formula:

∑i

n

r=

(∑

n

=1

xi yi − nxy

x − nx i=1 2 i

2

) (∑

n

2 2 y − n y i=1 i

)

A

 Apply this to coordinate pairs of set vectors for A, B ⊂ S…  Gives set correlation between A and B :

R (A ,B ) =

 A ∩B  − ( S − A ) ( S − B )  A ⋅ B

S

A ⋅ B   S 

 Tends to cosine similarity measure when A and B are small. 26 April 2007

M E Houle @ NII

15

B

Intra-set Association  How do we measure the goodness of a cluster

candidate C ?  No pairwise similarity measure is available!

 First-order criterion – if v belongs to C, then:  The items relevant to v should belong to C.  R (C, Q(v, |C|)) should be high.  Second-order criterion – if v, w belong to C, then:  The items relevant to v should coincide with those

relevant to w.  (Will discuss only first-order criterion here.)

26 April 2007

M E Houle @ NII

16

Self-confidence  Measure – self-confidence:  The average mutual confidence

between a set and the (samesized) relevant sets of its members.  Denoted SC (A), where 1 SC (A ) = A

MC ∑ v A

(A ,Q (v, A ))=



1

A

2

A ∩ Q (v, A ) ∑ v A ∈

 Related to SNN density

criterion.

26 April 2007

M E Houle @ NII

17

Self-correlation  Measure – self-correlation: The average correlation

between a set and the (same-sized) relevant sets of its members. Denoted SR (A), where 1 SR (A ) = A

26 April 2007

S ⋅ SC (A )− A R (A ,Q (v, A ))= ∑ S−A v∈ A

M E Houle @ NII

18

Significance & Size  Which aggregation of points

is more significant?  SC (A) = 0.8525, SR (A) = 0.815625.  SC (B) = 1.0, SR (B) = 1.0.  SC (C) = 0.45, SR (C) ≈ 0.3888889.

 Set size must be considered.  Note: proper interpretation of Pearson correlation

requires a test of significance. 26 April 2007

M E Houle @ NII

19

Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model

I. II. a. b. c.

Association Measures Significance of Association Partial Significance and Reshaping

a. b. c.

Query Result Clustering Outlier Detection Feature Set Evaluation and Selection

III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications

26 April 2007

M E Houle @ NII

20

Randomness Hypothesis  What if every query relevant set (QRS) were

independently selected uniformly at random?  Number of intersections between QRS and fixed set is distributed hypergeometrically.  Expectation and variance can be determined.

 If X = |A ∩ B|, where A is fixed and B

is selected randomly from S, then: A ⋅ B ( S − A )( S − B ) A⋅B Var (X ) = E (X ) = 2 S S ( S − 1)

26 April 2007

M E Houle @ NII

A B 21

Significance & Standard Scores  Standard score under randomness hypothesis: Measure of the deviation from randomness. ZSC (A): number of standard devs of SC (A)

from its expectation. ZSR (A): number of standard devs of SR (A) from its expectation.  The greater the standard scores, the more significant the aggregation. SC (A )− E ( SC (A )) Z SC (A ) = Var ( SC (A )) 26 April 2007

SR (A )− E ( SR (A )) Z SR (A ) = Var ( SR (A ))

M E Houle @ NII

22

Intra-Set Significance E( A ∩ B )  E ( R ( A ,B ) ) = −  (S − A)(S − B) A ⋅B

S

  = ( S − A ) ( S − B ) 

S

A ⋅ B  S   A ⋅ B  =0 S  

A⋅B ⋅ − S A⋅B 1

2

S Var ( R ( A ,B ) ) = Var ( A ∩ B ) (S − A)(S − B) A ⋅B

(S − A)(S − B) A ⋅B = 1 S = ⋅ 2 (S − A)(S − B) A ⋅B S −1 S ( S − 1) 2

SR (A )− E ( SR (A )) Z SR (A ) = = Var ( SR (A ))

SR (A )− 1

A 26 April 2007

2

1 A

∑ E ( R ( A ,Q ( v, A ) ) ) v∈ A

∑ Var ( R ( A ,Q ( v, A ) ) )

=

A ( S − 1) SR (A )

v∈ A

M E Houle @ NII

23

Intra-Set Significance E ( SC (A )) =

1

A

(

)

1

E A ∩ Q ( v, A ) = 2 ∑ v∈ A

Var ( SC (A )) =

1

A

4

A

2

∑ v A ∈

2

A A = S S

Var ( A ∩ Q ( v, A ) ) = ∑ v A ∈

(S − A) A ⋅ S ( S − 1) 2

2

 Can show: Z SR (A ) = Z SC (A ) =

A ( S − 1) SR (A )= Z (A ) ∆

 Z (A): intra-set significance of set A. 26 April 2007

M E Houle @ NII

24

Example  Set significances of A, B, C:  SC (A) = 0.8525,

SR (A) Z (A)  SC (B) SR (B) Z (B)  SC (C) SR (C) Z (C)

= ≈ = = ≈ = ≈ ≈

0.815625, 36.29. 1.0, 1.0, 22.25. 0.45, 0.3888889, 12.24.

 Z (A) > Z (B) > Z (C). 26 April 2007

M E Houle @ NII

25

Inter-Set Significance  Z (A, B): inter-set significance of (the

relationship between) A and B.

E ( R (A ,B )) = 0

1 Var ( R (A ,B )) = S −1

A⋅B E ( MC (A ,B )) = S Var ( MC (A ,B )) =

A ⋅ B ⋅ ( S − A )( S − B ) S

2

( S − 1)

R (A ,B )− E ( R (A ,B )) MC (A ,B )− E ( MC (A ,B )) Z (A ,B ) = = = S − 1 R (A ,B ) Var ( R (A ,B )) Var ( MC (A ,B ))  For fixed S, inter-set significance is

equivalent to the set correlation R (A, B).

26 April 2007

M E Houle @ NII

26

Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model

I. II. a. b. c.

Association Measures Significance of Association Partial Significance and Reshaping

a. b. c.

Query Result Clustering Outlier Detection Feature Set Evaluation and Selection

III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications

26 April 2007

M E Houle @ NII

27

Contributions to Significance  Some members contribute more

than others towards the set significance Z (A).

 Contribution of member v to

SR (A): ∆

t(v |A )=

1 R ( A ,Q ( v, A ) ) A

 Can consider potential

contributions Z (v|A) even when v ∉ A. 26 April 2007

M E Houle @ NII

A ∩ Q ( v, A ) 28

Partial Significance  Standard score of t (v|A)

w.r.t. the randomness hypothesis:

S − 1 R ( A ,Q ( v, A ) )

Z (v |A ) =

 The significance of A can be

expressed in terms of these partial significances: Z (A ) =

1

A

26 April 2007

Z ( v |A ) ∑ v A ∈

A ∩ Q ( v, A ) M E Houle @ NII

29

Set Reshaping (1)  Idea: modify A so as to

boost its significance.

 New set A′ has average

mutual correlation to A: 1 SR (A ′ |A ) = A′

R ( A ,Q ( v, A ) ) ∑ v A ∈ ′

 Significance of SR(A′|A)

w.r.t. randomness hypoth.: Z (A ′ |A ) =

1

Z ( v |A ) ∑ A′ v A ∈ ′

26 April 2007

A ∩ Q ( v, A ) M E Houle @ NII

30

Set Reshaping (2)  For fixed size |A′|, we

can maximize Z (A′|A) by taking largest values of Z (v|A).

 For this example:  Z (A|A) = Z (A) = 36.29  Z (A′|A) = 37.18  Maximum achieved at A′.  A serves as a pattern for

the discovery of A′.

A ∩ Q ( v, A ) 26 April 2007

M E Houle @ NII

31

Partitioning  What if we need to assign item v to a single group?  Want a group C for which both:  the set significance is high; and  the significance of the relationship with v is

high.

 In practice, can choose group C (reshaped from

pattern C*) satisfying:

* * maximize Z ( v | C ) ⋅ Z ( C | C ) *

C

26 April 2007

M E Houle @ NII

32

Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model

I. II. a. b. c.

Association Measures Significance of Association Partial Significance and Reshaping

a. b. c.

Query Result Clustering Outlier Detection Feature Set Evaluation and Selection

III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications

26 April 2007

M E Houle @ NII

33

Cluster Map Generation  Nodes are sets having sufficiently-high

significance scores.

 Edges appear between set nodes having

sufficiently-high inter-set significances.

 Retained cluster candidates should not be too

highly correlated with other retained candidates.

 Final clusters form an “independent set”

within an initial candidate cluster map.

26 April 2007

M E Houle @ NII

34

 Nodes meeting min threshold on

set significance.  Edges meeting min threshold on inter-set significance (correlation).

26 April 2007

Candidate Map

M E Houle @ NII

35

 Nodes ranked by significance

(red is highest).  Thick edges join nodes whose set correlations are too high.

26 April 2007

M E Houle @ NII

36

 Need independent node set

within thick-edged subgraph.  Heuristic: greedy by significance.

26 April 2007

M E Houle @ NII

37

 Need independent node set

within thick-edged subgraph.  Heuristic: greedy by significance.

26 April 2007

M E Houle @ NII

38

 Final cluster map is typically

disconnected.  Clusters nodes form a rough cover of the data set.

26 April 2007

M E Houle @ NII

Cluster Map

39

GreedyRSC

DB Queries

Pattern Pruning

Candidate Patterns

Member Assignment 26 April 2007

Cluster Pruning M E Houle @ NII

Sample Relevant Sets

Cluster Map Generation

40

Scalability Issues  Problems:  Curse of dimensionality: computing queries Q(q,k)

can take time linear in |S| even when k is small.  Computing Z(A) takes time quadratic in |A|.

 Workarounds:  Use approximate neighborhoods using SASH search

structure [H. '03, H. & Sakuma '05].  Pattern generation over samples of varying sizes, with fixed range for |A|.

26 April 2007

M E Houle @ NII

41

Sampling Strategy  Create bands of samples of sizes |S|, |S|/2, |S|/4, ….  Within each sample, compute candidate patch for each

   

element via maximization of set significance.  Fixed range of patch sizes a < k < b. Within each sample, select patterns greedily and eliminate duplicates. Reshape patterns to form cluster candidates. Eliminate duplicate candidates to form final cluster set. Generate cluster map edges.

26 April 2007

M E Houle @ NII

42

Overall Time Complexity Without data partitioning:  Precompute relevant sets: O(n log n) queries, n = |S|.  Compute pattern for O(n log n) neighborhoods: O(b 2) time each.  Compute all patterns: O(b 2n log n) time.  Eliminate duplicate patterns: O((b 2+σ2)n log n), where σ2 is the average variance of inverted member list sizes over each sample.  Form candidate clusters: bounded by O(b n log2n).  Eliminate duplicate clusters & create map: O((b 2+τ2)n log n ), where τ2 is the average variance of inverted member list sizes over each sample.  Total excluding relevant set computation: O((b 2+σ2+τ2)n log n + b n log2n). Data partitioning (details omitted!) introduces factor of c, the number of data chunks. M E Houle @ NII 26 April 2007 43

Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model

I. II. a. b. c.

Association Measures Significance of Association Partial Significance and Reshaping

a. b. c.

Query Result Clustering Outlier Detection Feature Set Evaluation and Selection

III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications

26 April 2007

M E Houle @ NII

44

Clustering Parameters  For comparison of significance values, can normalize

significance scores for convenience.  Common factor dependent on |S| can be dropped.  Normalizing inter-set significance → set correlation.  Normalizing square of set significance: use

Z 2 (A ) 0 ≤ = S −1

A ⋅ SR 2 (A ) ≤

A

 For all experiments:  Min normalized squared set significance = 4.  Max normalized inter-set significance = 0.5.  Min norm. inter-set sig. = 0.1 (for cluster map). 26 April 2007

M E Houle @ NII

45

Images  Amsterdam Library of Object Images (ALOI):  Dense feature vectors, colour & texture histogram (prepared       

by INRIA-Rocquencourt). Number of vectors: 110,250. 641 features per vector. 5322 clusters computed in < 4 hours on desktop (older, slower implementation). Maximum cluster size: 7201. Median cluster size: 12. Minimum cluster size: 4. SASH accuracy of ~96%.

26 April 2007

M E Houle @ NII

46

Journal Abstracts  Medline medical journal abstracts, 1996 to mid-2003:  Vectors with TF-IDF term weighting, NO dimensional        

reduction (prepared by IBM TRL). Number of vectors: 1,055,073. ~75 non-zero attributes per vector. Representational dimension: 1,101,003. 9789 clusters computed in < 24 hours on 3.0GHz desktop. Maximum cluster size: 15,255. Median cluster size: 45. Minimum cluster size: 4. SASH accuracy of 51%, 115 x faster than sequential search.

26 April 2007

M E Houle @ NII

47

Protein Sequences  Bacterial ORFs (protein sequences):  Vectors of gapped BLAST scores with respect to fixed sample

        

of 1/10th size (as per Liao & Noble, 2003) (prepared by DNA Data Base of Japan). Number of vectors: 378,659. ~125 non-zero attributes per vector. Representational dimension: 40,000. Vector preparation: < 1 day on 16 node PC cluster. 8907 clusters computed in 7 hours on 3.0GHz desktop. Maximum cluster size: 69,859. Median cluster size: 20. Minimum cluster size: 4. SASH accuracy of 75%, 34 x faster than sequential search.

26 April 2007

M E Houle @ NII

48

Demos

26 April 2007

M E Houle @ NII

49

Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model

I. II. a. b. c.

Association Measures Significance of Association Partial Significance and Reshaping

a. b. c.

Query Result Clustering Outlier Detection Feature Set Evaluation and Selection

III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications

26 April 2007

M E Houle @ NII

50

Query Result Clustering  “Pure” shared-neighbor clustering can be used

to cluster results of queries.  Produce long ranked query result lists.  Ranking function can be hidden.  Database must support queries-by-example.  Otherwise can be performed without the cooperation of the database manager.

 Example at NII:  WEBCAT Plus library database.  GETA search engine.  WEBCAT Plus QRC tool currently under

development (with N. Grira).

26 April 2007

M E Houle @ NII

51

Query Result Clustering  Adapting RSC for QRC:  Database size may not be known.  Self-correlation and significance formula

undefined.  Approximation: assume infinite database size.  As |S| tends to infinity, the normalized squared significance tends to: Z 2 (A ) = S −1

A ⋅ SR 2 (A ) →

A ⋅ SC 2 (A )

 This formula does not depend on |S|.  Computation as per GreedyRSC heuristic.

26 April 2007

M E Houle @ NII

52

Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model

I. II. a. b. c.

Association Measures Significance of Association Partial Significance and Reshaping

a. b. c.

Query Result Clustering Outlier Detection Feature Set Evaluation and Selection

III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications

26 April 2007

M E Houle @ NII

53

Outlier Detection  RSC model:  Patterns of low significance can indicate the

presence of outliers.  Can be detected as per the initial stages of GreedyRSC.  Many potential definitions are possible.

 Some possible formulations (to be minimized):

Z (Q (v,k))

SR (Q (v,k))

max Z (Q (v,k))

max SR (Q (v,k))

1≤ i≤ k

1≤ i≤ k

 Work in progress with M. Gebski, NICTA. 26 April 2007

M E Houle @ NII

54

Overview Shared-Neighbor Clustering The Relevant-Set Correlation Model

I. II. a. b. c.

Association Measures Significance of Association Partial Significance and Reshaping

a. b. c.

Query Result Clustering Outlier Detection Feature Set Evaluation and Selection

III. The GreedyRSC Heuristic IV. Experimental Results V. Extensions and Applications

26 April 2007

M E Houle @ NII

55

Feature Set Evaluation  Feature selection methods:  To our knowledge, current feature selection techniques developed only for supervised learning.  Training set needed to guide the process.  Work with N. Grira: unsupervised feature set evaluation and selection.  Assumptions:  Similarity measure and candidate features unknown.  Assess the effectiveness of the features & similarity measure for search and clustering. 26 April 2007

M E Houle @ NII

56

Good Feature Sets  For any given 'true'

cluster C:

 For any item v in C, any

query based at v should ideally rank the items of C ahead of any other items.  Two-set classification.  Best result – when there exists a partition of the data set into clusters such that the two-set classification property holds. 26 April 2007

M E Houle @ NII

57

Distinctiveness  Distinctiveness criterion:  Variant of self-correlation.  External items that are well-correlated are

penalized.  Equals 1 if A is perfectly associated (SR(A)=1) and external points are uncorrelated with A.

1 DR (A ) = A

1 R (A ,Q (v, A ))− ∑ S−A v∈ A

1 = SR (A )− S−A

26 April 2007

R (A ,Q (v, A )) ∑ v A ∉

R (A ,Q (v, A )) ∑ v A ∉

M E Houle @ NII

58

Significance  Significance can be derived as under RSC:  Randomness hypothesis. A ( S − A )( S − 1) Z DR (A ) = DR (A ) S  Unlike self-correlation significance,

distinctiveness significance tends to 0 as |A| tends to |S|.  Distinctiveness is expensive to compute in practice.  Can use SR(A) to approximate DR(A). 26 April 2007

M E Houle @ NII

59

Feature Set Evaluation  Criterion for feature set selection:  For each item q, estimate the most significant relevant set based at q.  Average the self-correlations of the most significant relevant set identified.  Serves as the basis for search, e.g. local improvement methods such as Tabu Search. maximize

26 April 2007

1 S

SR (Q (q,kq )), ∑ q S

kq = argmax DR (Q (q,k ))



M E Houle @ NII

1≤ k ≤ S

60

Feature Set Evaluation maximize

1 S

SR (Q (q,kq )), ∑ q S

kq = argmax DR (Q (q,k ))



1≤ k ≤ S

 Properties:  If all relevant sets are randomly generated, criterion

is 0.  If all relevant sets are identical, criterion is 0.  If two-set classification holds for all q, criterion is 1 (the maximum possible).  Conjectured: for any disjoint partition of the data into clusters of size at least some constant, there exists a set of rankings for which the maximum of 1 is achieved.

26 April 2007

M E Houle @ NII

61

26 April 2007

M E Houle @ NII

62

Background: Protein Sequence Analysis  Applications of clustering:  Classification of sequences of unknown

functionality with respect to clusters of sequences of known functionality.  Discovery of new motifs from clusters of sequences of previously unknown function.

 Problems:  Single linkage (agglomerative) clustering

techniques produce clusters with poor internal association.  Traditional clustering techniques do not scale well to large set sizes.  Protein sequence data is “inherently” highdimensional. 26 April 2007 M E Houle @ NII 63

Protein Sequence Similarity  Pairwise (gapped) sequence alignment

scoring: BLAST heuristic [Altschul et al. '97].  Bonus for matching and near-matching symbols.  Penalty for non-matching symbols.  Penalty for gaps (increases with gap length).  Dynamic programming (expensive!).  Faster heuristics exist.

MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDEREIKKRDIFSLLLGVA 60 M M++K+L+PTDFSE A A++ + ++ EVILLHVIDE +++ L+ G + MIFMFRKVLFPTDFSEGAYRAVEVFEKRNKMEVGEVILLHVIDEGTLEE-----LMDGYS 55

26 April 2007

M E Houle @ NII

64

Alignment-based Similarity  Problem: direct use of BLAST scores fails!  Not transitive since alignments are incomplete.  SASH index for approximate neighbourhood computation

achieves poor accuracy vs time trade-off.  Sequential search would work, but is too expensive.

 Example: pairs (A,B) and (B,C) have high BLAST

scores, but pair (A,B) has score of zero. A B C

26 April 2007

M E Houle @ NII

65

Reference Set Similarity sample sequences

BLAST scores with respect to fixed sample of 1/10th size (as per Liao & Noble, 2003).  Conversion of BLAST scores to E-values.  Vector sparsification via thresholding to zero.  Vector angle distance metric for neighbourhood computation.

full set sequences

 Solution (with Å. J. Västermark):  Vectors of gapped-alignment

BLAST E-values 26 April 2007

M E Houle @ NII

66

Analogy to Text sample sequences

 Each sequence analogous to a

full set sequences

document.  Reference sequences analogous to terms.  Sparse vectorization.  Significant BLAST scores analogous to terms appearing in document.  Strong BLAST scores analogous to BLAST dominant terms of document (as E-values per TF-IDF weighting). 26 April 2007

M E Houle @ NII

67

26 April 2007

M E Houle @ NII

68