On-line Construction of Compact Suffix Vectors and Maximal Repeats ´ Elise Prieur and Thierry Lecroq
[email protected] Laboratoire d’Informatique de Traitement de l’Information et des Syst`emes. Journ´ees Montoises August 30th, 2006, Rennes
Introduction
Suffix Vectors
Computing maximal repeats
Plan
1
Introduction
2
Suffix Vectors
3
Computing maximal repeats
4
Conclusion
´ Elise Prieur
Compact Suffix Vectors
2/24
Conclusion
Introduction
Suffix Vectors
Computing maximal repeats
1
Introduction Motivation Suffix trees Ukkonen’s algorithm
2
Suffix Vectors Introduction Compact Suffix Vectors On-line construction of a compact suffix vector
3
Computing maximal repeats
4
Conclusion
´ Elise Prieur
Compact Suffix Vectors
3/24
Conclusion
Introduction
Suffix Vectors
Computing maximal repeats
Motivation
Detecting repeats in long biological sequences. Adapted index structure.
´ Elise Prieur
Compact Suffix Vectors
4/24
Conclusion
Introduction
Suffix Vectors
Computing maximal repeats
Notations y is a sequence of length n on the alphabet A. $ is a terminator symbol.
Conclusion
Suffix tree of tata$ (4,1)$
(0,2)ta (1,1) a a
ta
Suffix tree index structure;
(2,3) ta$
all substrings represented; edges labeled (begin position, length);
0
(4,1) $
(2,3) ta$ 3
2
leaves represent suffixes.
´ Elise Prieur
Compact Suffix Vectors
5/24
4 (4,1)$ 1
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
Ukkonen’s algorithm
On-line algorithm Construction split into n phases which are also split into extensions. During the phase i, construction of the implicit tree of y[0..i] from the one of y[0..i − 1]. During the extension j of the phase i, the suffix y[j + 1..i] is added to the tree. The last added substring is w = y[j + 1..i − 1].
´ Elise Prieur
Compact Suffix Vectors
6/24
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
Introduction
Suffix Vectors
Computing maximal repeats
The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 1
w y[i]=y[j+1...i]
´ Elise Prieur
Compact Suffix Vectors
7/24
Conclusion
Introduction
Suffix Vectors
Computing maximal repeats
The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 2
wx
´ Elise Prieur
Compact Suffix Vectors
7/24
Conclusion
Introduction
Suffix Vectors
Computing maximal repeats
The 3 rules Ukkonen’s algorithm is based on 3 rules expressed by Gusfield: Rule 2
w
x
´ Elise Prieur
y[i]
Compact Suffix Vectors
7/24
Conclusion
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
Some properties
leaves are added in increasing order; rule 1 does not need any treatment; phase i begins at the extension j` + 1, where j` is the number of the last created leaf; phase i ends at the first extension j > j ` such that rule 3 is applied.
´ Elise Prieur
Compact Suffix Vectors
8/24
Introduction
Suffix Vectors
Computing maximal repeats
1
Introduction Motivation Suffix trees Ukkonen’s algorithm
2
Suffix Vectors Introduction Compact Suffix Vectors On-line construction of a compact suffix vector
3
Computing maximal repeats
4
Conclusion
´ Elise Prieur
Compact Suffix Vectors
9/24
Conclusion
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
Introduction to suffix vectors
Root (0,2)ta (1,1) a a
ta (2,3) ta$ 0
(4,1) $
(0, 2) − (1,1) − (4, 1)
(4,1)$
(2,3) ta$ 3
4 (4,1)$ 1
2
´ Elise Prieur
0
1
2
3
4
t
a
t
a
$
2 1
3 3
(4,1) (4,1)
Compact Suffix Vectors
10/24
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
Introduction to suffix vectors (0, 1) − (2, 1) − (13, 1)
Root (13, 1)$
R
(2, 1)t
13
0
(0, 1)a
0
(13, 1)$
(5, 1)a
2
(13, 1)$
(2, 2)tt
50
12
(3, 1)t
1
2
3
4
5
6
7
8
9
10 11 12 13
a a t t t a t t t a t t a $ 11
(6, 2)tt
(1, 13) 30
3 0
(5, 1)a
7000
5 (4, 4)tatt
10 70
(12, 2)a$ (8, 6)tatta$
4
(4, 4)tatt 9
7
1
3|2|(13, 1) 2|2|(13, 1)
(12, 2)a$
(12, 2)a$
2
(8, 6)tatta$
8
3|4|(12, 2) 2|4|(5, 1)
(13, 1)$ (6, 2)tt
(12, 2)a$
(8, 6)tatta$
7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2)
1|2|(5, 1)
700
(12, 2)a$
6
1|13|(2, 2) − (13, 1)
5 3
(8, 6)tatta$
7
´ Elise Prieur
Compact Suffix Vectors
11/24
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
Introduction to suffix vectors (0, 1) − (2, 1) − (13, 1)
Root
Alternative data structure to suffix trees
0
1
2
3
4
same information in reduced space introduced by K. Monostori in 2001
5
6
7
9
10 11 12 13
3|2|(13, 1) 2|2|(13, 1)
3|4|(12, 2) 2|4|(5, 1)
7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2)
1|2|(5, 1)
1|13|(2, 2) − (13, 1)
´ Elise Prieur
8
a a t t t a t t t a t t a $
Compact Suffix Vectors
11/24
Introduction
Suffix Vectors
Computing maximal repeats
Introduction to suffix vectors Definition A succession of boxes whose lines contain: the depth of the node; the natural edge; the edge list. The root is a special box. Notations
Conclusion
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
Introduction to suffix vectors Example tatt is a substring of y ? The root contains the edge (2, 1) beginning by t leading to B2 . The edge (5, 1) by a leads to B5 . The natural edge begins by tt.
(0, 1) − (2,1) − (13, 1)
Root
0
1
2
3
4
5
6
7
8
9
10 11 12 13
a a t t t a t t t a t t a $
3|2|(13, 1) 2|2|(13, 1)
3|4|(12, 2) 2|4|(5, 1)
7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2)
1|1|(5,1)
1|13|(2, 2) − (13, 1)
´ Elise Prieur
Compact Suffix Vectors
11/24
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
Compact a vector
Definition A group of nodes is a set of nodes which are in the same box and have exactly the same edges.
´ Elise Prieur
Compact Suffix Vectors
12/24
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
Compact suffix vectors 3 rules of compaction of a box: Rule A the node with depth d − 2 has the same edges as the node with depth d − 1, Rule B the node with depth d − 1 has the same edges as the node with depth d and some extra edges, Rule C the node with depth d − 3 has different edges to the node with depth d − 2. Rule B
d d−1
Rule A d−2
Rule C
d−3
´ Elise Prieur
Compact Suffix Vectors
13/24
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
Compacting V(aatttatttatta$) (0, 1) − (2, 1) − (13, 1)
Root
0
1
2
3
4
5
6
7
8
9
(0, 1) − (2, 1) − (13, 1)
Root
10 11 12 13
0
a a t t t a t t t a t t a $
1
2
3
4
3|2|(13, 1) 2|2|(13, 1)
6
7
3|2|(13, 1)
=⇒
3|4|(12, 2) 2|4|(5, 1)
1|2|(5, 1)
5
8
9
10 11 12 13
a a t t t a t t t a t t a $
7|6|(12, 2) 6|6|(12, 2) 5|6|(12, 2) 4|6|(12, 2)
3|4|(12, 2) 2|4|(5, 1)
1|2|(5, 1)
1|13|(2, 2) − (13, 1)
7|6|(12, 2)
1|13|(2, 2) − (13, 1)
´ Elise Prieur
2
Compact Suffix Vectors
14/24
4
Introduction
Suffix Vectors
Computing maximal repeats
Monostori
Monostori
O(n)
O(n)
y −−−−−−→ Extended vector −−−−−−→ Compact vector
´ Elise Prieur
Compact Suffix Vectors
15/24
Conclusion
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
On-line construction of a compact vector Proposition When an edge is added to the node w of depth d in a box B p , this edge will be added to all the nodes in B p of depth smaller then d in the group of nodes of w. p+1 y
a
j
i
v
a
w
w p’+1 y
a
v
v
j
i a
w
´ Elise Prieur
v w
Compact Suffix Vectors
17/24
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
On-line construction of a compact vector
Skip k − 1 extensions where k is the number of the nodes in the group into the edge is added.
´ Elise Prieur
Compact Suffix Vectors
18/24
Introduction
Suffix Vectors
Computing maximal repeats
1
Introduction Motivation Suffix trees Ukkonen’s algorithm
2
Suffix Vectors Introduction Compact Suffix Vectors On-line construction of a compact suffix vector
3
Computing maximal repeats
4
Conclusion
´ Elise Prieur
Compact Suffix Vectors
19/24
Conclusion
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
Definition A maximal repeat in a string is a substring such that there exist at least 2 occurrences : a1 ub1 and a2 ub2 with a1 6= a2 , b1 6= b2 and a1 , a2 , b1 , b2 ∈ A. Example y =aatttatttatta$ tta is a maximal repeat at positions 5 and 12.
´ Elise Prieur
Compact Suffix Vectors
20/24
Introduction
Suffix Vectors
Computing maximal repeats
Applying to suffix vectors
Proposition The deepest node of each group of nodes represents a maximal repeat.
´ Elise Prieur
Compact Suffix Vectors
21/24
Conclusion
Introduction
1
2
3
Computing maximal repeats
Conclusion
(0, 1) − (2, 1) − (13, 1)
Root
0
Suffix Vectors
4
5
6
7
8
9
10 11 12 13
a a t t t a t t t a t t a $
3|2|(13, 1)
2
3|4|(12, 2) 2|4|(5, 1)
1|2|(5, 1)
7|6|(12, 2)
Example Boxes 0, 2, 5 et 7 are reduced: a, t, tta, atttatt are maximal repeats. Box B3 is extended, the 2 lines have different edges: att, tt are maximal repeats.
4
1|13|(2, 2) − (13, 1)
´ Elise Prieur
Compact Suffix Vectors
22/24
Introduction
Suffix Vectors
Computing maximal repeats
1
Introduction Motivation Suffix trees Ukkonen’s algorithm
2
Suffix Vectors Introduction Compact Suffix Vectors On-line construction of a compact suffix vector
3
Computing maximal repeats
4
Conclusion
´ Elise Prieur
Compact Suffix Vectors
23/24
Conclusion
Introduction
Suffix Vectors
Computing maximal repeats
Conclusion
Conclusion
More economical construction of the compact suffix vector. Linear method to compute maximal repeats with a compact suffix vector.
´ Elise Prieur
Compact Suffix Vectors
24/24