On-line Construction of Compact Suffix Vectors and ... - Elise Prieur

On-line construction of a compact suffix vector ... expressed by Gusfield: Rule 2 x w. Élise Prieur. Compact Suffix Vectors. 7/24 .... introduced by K. Monostori in.
2MB taille 6 téléchargements 290 vues
On-line Construction of Compact Suffix Vectors and Maximal Repeats ´ Elise Prieur and Thierry Lecroq [email protected] Laboratoire d’Informatique de Traitement de l’Information et des Syst`emes. Journ´ees Montoises August 30th, 2006, Rennes

Introduction

Suffix Vectors

Computing maximal repeats

Plan

1

Introduction

2

Suffix Vectors

3

Computing maximal repeats

4

Conclusion

´ Elise Prieur

Compact Suffix Vectors

2/24

Conclusion

Introduction

Suffix Vectors

Computing maximal repeats

1

Introduction Motivation Suffix trees Ukkonen’s algorithm

2

Suffix Vectors Introduction Compact Suffix Vectors On-line construction of a compact suffix vector

3

Computing maximal repeats

4

Conclusion

´ Elise Prieur

Compact Suffix Vectors

3/24

Conclusion

Introduction

Suffix Vectors

Computing maximal repeats

Motivation

Detecting repeats in long biological sequences. Adapted index structure.

´ Elise Prieur

Compact Suffix Vectors

4/24

Conclusion

Introduction

Suffix Vectors

Computing maximal repeats

Notations y is a sequence of length n on the alphabet A. $ is a terminator symbol.

Conclusion

Suffix tree of tata$ (4,1)$

(0,2)ta (1,1) a a

ta

Suffix tree index structure;

(2,3) ta$

all substrings represented; edges labeled (begin position, length);

0

(4,1) $

(2,3) ta$ 3

2

leaves represent suffixes.

´ Elise Prieur

Compact Suffix Vectors

5/24

4 (4,1)$ 1

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

Ukkonen’s algorithm

On-line algorithm Construction split into n phases which are also split into extensions. During the phase i, construction of the implicit tree of y[0..i] from the one of y[0..i − 1]. During the extension j of the phase i, the suffix y[j + 1..i] is added to the tree. The last added substring is w = y[j + 1..i − 1].

´ Elise Prieur

Compact Suffix Vectors

6/24

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

Introduction

Suffix Vectors

Computing maximal repeats

The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 1

w y[i]=y[j+1...i]

´ Elise Prieur

Compact Suffix Vectors

7/24

Conclusion

Introduction

Suffix Vectors

Computing maximal repeats

The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 2

wx

´ Elise Prieur

Compact Suffix Vectors

7/24

Conclusion

Introduction

Suffix Vectors

Computing maximal repeats

The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 2

w

x

´ Elise Prieur

y[i]

Compact Suffix Vectors

7/24

Conclusion

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

Some properties

leaves are added in increasing order; rule 1 does not need any treatment; phase i begins at the extension j` + 1, where j` is the number of the last created leaf; phase i ends at the first extension j > j ` such that rule 3 is applied.

´ Elise Prieur

Compact Suffix Vectors

8/24

Introduction

Suffix Vectors

Computing maximal repeats

1

Introduction Motivation Suffix trees Ukkonen’s algorithm

2

Suffix Vectors Introduction Compact Suffix Vectors On-line construction of a compact suffix vector

3

Computing maximal repeats

4

Conclusion

´ Elise Prieur

Compact Suffix Vectors

9/24

Conclusion

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

Introduction to suffix vectors

Root (0,2)ta (1,1) a a

ta (2,3) ta$ 0

(4,1) $

(0, 2) − (1,1) − (4, 1)

(4,1)$

(2,3) ta$ 3

4 (4,1)$ 1

2

´ Elise Prieur

0

1

2

3

4

t

a

t

a

$

2 1

3 3

(4,1) (4,1)

Compact Suffix Vectors

10/24

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

Introduction to suffix vectors (0, 1) − (2, 1) − (13, 1)

Root (13, 1)$

R

(2, 1)t

13

0

(0, 1)a

0

(13, 1)$

(5, 1)a

2

(13, 1)$

(2, 2)tt

50

12

(3, 1)t

1

2

3

4

5

6

7

8

9

10 11 12 13

a a t t t a t t t a t t a $ 11

(6, 2)tt

(1, 13) 30

3 0

(5, 1)a

7000

5 (4, 4)tatt

10 70

(12, 2)a$ (8, 6)tatta$

4

(4, 4)tatt 9

7

1

3|2|(13, 1) 2|2|(13, 1)

(12, 2)a$

(12, 2)a$

2

(8, 6)tatta$

8

3|4|(12, 2) 2|4|(5, 1)

(13, 1)$ (6, 2)tt

(12, 2)a$

(8, 6)tatta$

7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2)

1|2|(5, 1)

700

(12, 2)a$

6

1|13|(2, 2) − (13, 1)

5 3

(8, 6)tatta$

7

´ Elise Prieur

Compact Suffix Vectors

11/24

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

Introduction to suffix vectors (0, 1) − (2, 1) − (13, 1)

Root

Alternative data structure to suffix trees

0

1

2

3

4

same information in reduced space introduced by K. Monostori in 2001

5

6

7

9

10 11 12 13

3|2|(13, 1) 2|2|(13, 1)

3|4|(12, 2) 2|4|(5, 1)

7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2)

1|2|(5, 1)

1|13|(2, 2) − (13, 1)

´ Elise Prieur

8

a a t t t a t t t a t t a $

Compact Suffix Vectors

11/24

Introduction

Suffix Vectors

Computing maximal repeats

Introduction to suffix vectors Definition A succession of boxes whose lines contain: the depth of the node; the natural edge; the edge list. The root is a special box. Notations

Conclusion

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

Introduction to suffix vectors Example tatt is a substring of y ? The root contains the edge (2, 1) beginning by t leading to B2 . The edge (5, 1) by a leads to B5 . The natural edge begins by tt.

(0, 1) − (2,1) − (13, 1)

Root

0

1

2

3

4

5

6

7

8

9

10 11 12 13

a a t t t a t t t a t t a $

3|2|(13, 1) 2|2|(13, 1)

3|4|(12, 2) 2|4|(5, 1)

7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2)

1|1|(5,1)

1|13|(2, 2) − (13, 1)

´ Elise Prieur

Compact Suffix Vectors

11/24

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

Compact a vector

Definition A group of nodes is a set of nodes which are in the same box and have exactly the same edges.

´ Elise Prieur

Compact Suffix Vectors

12/24

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

Compact suffix vectors 3 rules of compaction of a box: Rule A the node with depth d − 2 has the same edges as the node with depth d − 1, Rule B the node with depth d − 1 has the same edges as the node with depth d and some extra edges, Rule C the node with depth d − 3 has different edges to the node with depth d − 2. Rule B

d d−1

Rule A d−2

Rule C

d−3

´ Elise Prieur

Compact Suffix Vectors

13/24

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

Compacting V(aatttatttatta$) (0, 1) − (2, 1) − (13, 1)

Root

0

1

2

3

4

5

6

7

8

9

(0, 1) − (2, 1) − (13, 1)

Root

10 11 12 13

0

a a t t t a t t t a t t a $

1

2

3

4

3|2|(13, 1) 2|2|(13, 1)

6

7

3|2|(13, 1)

=⇒

3|4|(12, 2) 2|4|(5, 1)

1|2|(5, 1)

5

8

9

10 11 12 13

a a t t t a t t t a t t a $

7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2)

3|4|(12, 2) 2|4|(5, 1)

1|2|(5, 1)

1|13|(2, 2) − (13, 1)

7|6|(12, 2)

1|13|(2, 2) − (13, 1)

´ Elise Prieur

2

Compact Suffix Vectors

14/24

4

Introduction

Suffix Vectors

Computing maximal repeats

Monostori

Monostori

O(n)

O(n)

y −−−−−−→ Extended vector −−−−−−→ Compact vector

´ Elise Prieur

Compact Suffix Vectors

15/24

Conclusion

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

On-line construction of a compact vector Proposition When an edge is added to the node w of depth d in a box B p , this edge will be added to all the nodes in B p of depth smaller then d in the group of nodes of w. p+1 y

a

j

i

v

a

w

w p’+1 y

a

v

v

j

i a

w

´ Elise Prieur

v w

Compact Suffix Vectors

17/24

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

On-line construction of a compact vector

Skip k − 1 extensions where k is the number of the nodes in the group into the edge is added.

´ Elise Prieur

Compact Suffix Vectors

18/24

Introduction

Suffix Vectors

Computing maximal repeats

1

Introduction Motivation Suffix trees Ukkonen’s algorithm

2

Suffix Vectors Introduction Compact Suffix Vectors On-line construction of a compact suffix vector

3

Computing maximal repeats

4

Conclusion

´ Elise Prieur

Compact Suffix Vectors

19/24

Conclusion

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

Definition A maximal repeat in a string is a substring such that there exist at least 2 occurrences : a1 ub1 and a2 ub2 with a1 6= a2 , b1 6= b2 and a1 , a2 , b1 , b2 ∈ A. Example y =aatttatttatta$ tta is a maximal repeat at positions 5 and 12.

´ Elise Prieur

Compact Suffix Vectors

20/24

Introduction

Suffix Vectors

Computing maximal repeats

Applying to suffix vectors

Proposition The deepest node of each group of nodes represents a maximal repeat.

´ Elise Prieur

Compact Suffix Vectors

21/24

Conclusion

Introduction

1

2

3

Computing maximal repeats

Conclusion

(0, 1) − (2, 1) − (13, 1)

Root

0

Suffix Vectors

4

5

6

7

8

9

10 11 12 13

a a t t t a t t t a t t a $

3|2|(13, 1)

2

3|4|(12, 2) 2|4|(5, 1)

1|2|(5, 1)

7|6|(12, 2)

Example Boxes 0, 2, 5 et 7 are reduced: a, t, tta, atttatt are maximal repeats. Box B3 is extended, the 2 lines have different edges: att, tt are maximal repeats.

4

1|13|(2, 2) − (13, 1)

´ Elise Prieur

Compact Suffix Vectors

22/24

Introduction

Suffix Vectors

Computing maximal repeats

1

Introduction Motivation Suffix trees Ukkonen’s algorithm

2

Suffix Vectors Introduction Compact Suffix Vectors On-line construction of a compact suffix vector

3

Computing maximal repeats

4

Conclusion

´ Elise Prieur

Compact Suffix Vectors

23/24

Conclusion

Introduction

Suffix Vectors

Computing maximal repeats

Conclusion

Conclusion

More economical construction of the compact suffix vector. Linear method to compute maximal repeats with a compact suffix vector.

´ Elise Prieur

Compact Suffix Vectors

24/24