Spell Checking Theory and Implementation
[email protected]
Ugo Jardonnet
Spell Checking
1 / 20
Table of Contents 1
Edit distance Introduction Algorithm Implementation
2
Fuzzy Search Trie Search
3
Optimization Patricia Trie Multiple Tries
4
Conclusion
Ugo Jardonnet
Spell Checking
2 / 20
Edit distance
Introduction
Outline 1
Edit distance Introduction Algorithm Implementation
2
Fuzzy Search Trie Search
3
Optimization Patricia Trie Multiple Tries
4
Conclusion
Ugo Jardonnet
Spell Checking
3 / 20
Edit distance
Introduction
Edit distance
Measuring the amount of difference between two sequences. Levenshtein Distance Insertion: writing → writting Deletion: learn → larn Substitution: kitten → litten Damerau-Levenshtein Transposition: levenshtein → levensthein
Ugo Jardonnet
Spell Checking
4 / 20
Edit distance
Algorithm
Outline 1
Edit distance Introduction Algorithm Implementation
2
Fuzzy Search Trie Search
3
Optimization Patricia Trie Multiple Tries
4
Conclusion
Ugo Jardonnet
Spell Checking
5 / 20
Edit distance
Algorithm
Algorithm Levenshtein distance between ”Criteo” and ”Citeo”. if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # )
’’ c i t e o Ugo Jardonnet
’’ 0 1 2 3 4 5
# keep preceding distance
a deletion an insertion a substitution
c 1 0 1 2
r 2 1 1
i 3
Spell Checking
t 4
e 5
o 6
6 / 20
Edit distance
Algorithm
Algorithm Levenshtein distance between ”Criteo” and ”Citeo”. if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # )
’’ c i t e o Ugo Jardonnet
’’ 0 1 2 3 4 5
# keep preceding distance
a deletion an insertion a substitution
c 1 0 1 2
r 2 1 1
i 3
Spell Checking
t 4
e 5
o 6
6 / 20
Edit distance
Algorithm
Algorithm Levenshtein distance between ”Criteo” and ”Citeo”. if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # )
’’ c i t e o Ugo Jardonnet
’’ 0 1 2 3 4 5
# keep preceding distance
a deletion an insertion a substitution
c 1 0 1 2
r 2 1 1
i 3
Spell Checking
t 4
e 5
o 6
6 / 20
Edit distance
Algorithm
Algorithm Levenshtein distance between ”Criteo” and ”Citeo”. if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # )
’’ c i t e o Ugo Jardonnet
’’ 0 1 2 3 4 5
# keep preceding distance
a deletion an insertion a substitution
c 1 0 1 2
r 2 1 1
i 3
Spell Checking
t 4
e 5
o 6
6 / 20
Edit distance
Algorithm
Algorithm Levenshtein distance between ”Criteo” and ”Citeo”. if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # )
’’ c i t e o Ugo Jardonnet
’’ 0 1 2 3 4 5
# keep preceding distance
a deletion an insertion a substitution
c 1 0 1 2
r 2 1 1
i 3
Spell Checking
t 4
e 5
o 6
6 / 20
Edit distance
Algorithm
Algorithm Levenshtein distance between ”Criteo” and ”Citeo”. if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # )
’’ c i t e o Ugo Jardonnet
’’ 0 1 2 3 4 5
# keep preceding distance
a deletion an insertion a substitution
c 1 0 1 2
r 2 1 1
i 3
Spell Checking
t 4
e 5
o 6
6 / 20
Edit distance
Algorithm
Algorithm Levenshtein distance between ”Criteo” and ”Citeo”. if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # )
’’ c i t e o Ugo Jardonnet
’’ 0 1 2 3 4 5
# keep preceding distance
a deletion an insertion a substitution
c 1 0 1 2 3 4
r 2 1 1 2 3 4
i 3 2 1 2 3 4
Spell Checking
t 4 3 2 1 2 3
e 5 4 3 2 1 2
o 6 5 4 3 2 1 7 / 20
Edit distance
Implementation
Outline 1
Edit distance Introduction Algorithm Implementation
2
Fuzzy Search Trie Search
3
Optimization Patricia Trie Multiple Tries
4
Conclusion
Ugo Jardonnet
Spell Checking
8 / 20
Edit distance
Implementation
Python Implementation import numpy as np def levenshtein(s1, s2): "Calculates the Levenshtein distance between a and b." m, n = len(s1), len(s2) d = np.zeros((m+1,n+1)) for i in d[i, for j in d[0,
range(0,m+1): 0] = i # the distance of s1 to an empty second string range(0,n+1): j] = j # the distance of s2 to an empty first string
for j in range(1,n+1): for i in range(1,m+1): if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # ) return d[m, n]
Ugo Jardonnet
# keep preceding distance
a deletion an insertion a substitution
Spell Checking
9 / 20
Edit distance
Implementation
Wiser Python Implementation We need only the last row! ’’ c
’’ 0 1
c 1 0
r 2 1
i 3 ?
t 4
e 5
o 6
def levenshtein(s1, s2): "Calculates the Levenshtein distance between a and b." n, m = len(s1), len(s2) current = range(n+1) for i in range(1,m+1): previous, current = current, [i]+[0]*n for j in range(1,n+1): add, delete = previous[j]+1, current[j-1]+1 change = previous[j-1] if s1[j-1] != s2[i-1]: change = change + 1 current[j] = min(add, delete, change) return current[n]
Ugo Jardonnet
Spell Checking
10 / 20
Fuzzy Search
Trie
Outline 1
Edit distance Introduction Algorithm Implementation
2
Fuzzy Search Trie Search
3
Optimization Patricia Trie Multiple Tries
4
Conclusion
Ugo Jardonnet
Spell Checking
11 / 20
Fuzzy Search
Trie
Prefix tree Dictionary: war, when, who . w a r
h o
e n
Ugo Jardonnet
Spell Checking
12 / 20
Fuzzy Search
Search
Outline 1
Edit distance Introduction Algorithm Implementation
2
Fuzzy Search Trie Search
3
Optimization Patricia Trie Multiple Tries
4
Conclusion
Ugo Jardonnet
Spell Checking
13 / 20
Fuzzy Search
Search
Approximate Search
Looking for ’ ’ .
0
1
1
2
a
3
2
2
2
r
i
1
2
3
0 h
o 3
Ugo Jardonnet
h
1
w 2
w
2
1
1
Spell Checking
1 2
2 1
0
1
e
3
2
1
1
n
4
3
2
2
14 / 20
Fuzzy Search
Search
Approximate Search
Looking for ’ ’ .
0
1
1
2
a
3
2
2
2
r
i
1
2
3
0 h
o 3
Ugo Jardonnet
h
1
w 2
w
2
1
1
Spell Checking
1 2
2 1
0
1
e
3
2
1
1
n
4
3
2
2
14 / 20
Fuzzy Search
Search
Approximate Search
Looking for ’ ’ .
0
1
1
2
a
3
2
2
2
r
i
1
2
3
0 h
o 3
Ugo Jardonnet
h
1
w 2
w
2
1
1
Spell Checking
1 2
2 1
0
1
e
3
2
1
1
n
4
3
2
2
14 / 20
Fuzzy Search
Search
Approximate Search
Looking for ’ ’ .
0
1
1
2
a
3
2
2
2
r
i
1
2
3
0 h
o 3
Ugo Jardonnet
h
1
w 2
w
2
1
1
Spell Checking
1 2
2 1
0
1
e
3
2
1
1
n
4
3
2
2
14 / 20
Fuzzy Search
Search
Approximate Search
Looking for ’ ’ .
0
1
1
2
a
3
2
2
2
r
i
1
2
3
0 h
o 3
Ugo Jardonnet
h
1
w 2
w
2
1
1
Spell Checking
1 2
2 1
0
1
e
3
2
1
1
n
4
3
2
2
14 / 20
Fuzzy Search
Search
Approximate Search
Looking for ’ ’ .
0
1
1
2
a
3
2
2
2
r
i
1
2
3
0 h
o 3
Ugo Jardonnet
h
1
w 2
w
2
1
1
Spell Checking
1 2
2 1
0
1
e
3
2
1
1
n
4
3
2
2
14 / 20
Fuzzy Search
Search
Approximate Search
Looking for ’ ’ .
0
1
1
2
a
3
2
2
2
r
i
1
2
3
0 h
o 3
Ugo Jardonnet
h
1
w 2
w
2
1
1
Spell Checking
1 2
2 1
0
1
e
3
2
1
1
n
4
3
2
2
14 / 20
Fuzzy Search
Search
Approximate Search
Looking for ’ ’ .
0
1
1
2
a
3
2
2
2
r
i
1
2
3
0 h
o 3
Ugo Jardonnet
h
1
w 2
w
2
1
1
Spell Checking
1 2
2 1
0
1
e
3
2
1
1
n
4
3
2
2
14 / 20
Fuzzy Search
Search
Approximate Search
Looking for ’ ’ .
0
1
1
2
a
3
2
2
2
r
i
1
2
3
0 h
o 3
Ugo Jardonnet
h
1
w 2
w
2
1
1
Spell Checking
1 2
2 1
0
1
e
3
2
1
1
n
4
3
2
2
14 / 20
Fuzzy Search
Search
Approximate Search
Complexity: O(query length × nb nodes) In practice we trim lots of nodes if the max edit distance is low.
Ugo Jardonnet
Spell Checking
15 / 20
Optimization
Patricia Trie
Outline 1
Edit distance Introduction Algorithm Implementation
2
Fuzzy Search Trie Search
3
Optimization Patricia Trie Multiple Tries
4
Conclusion
Ugo Jardonnet
Spell Checking
16 / 20
Optimization
Patricia Trie
Patricia Trie Compressed version of a trie. en
ever
h w
o ar Complex implementation.
Ugo Jardonnet
Spell Checking
17 / 20
Optimization
Multiple Tries
Outline 1
Edit distance Introduction Algorithm Implementation
2
Fuzzy Search Trie Search
3
Optimization Patricia Trie Multiple Tries
4
Conclusion
Ugo Jardonnet
Spell Checking
18 / 20
Optimization
Multiple Tries
Multiple Tries
.
.
Easily parallelisable:
Ugo Jardonnet
.
...
..
.
...
Spell Checking
...
..
...
...
19 / 20
Conclusion
Conclusion
Spell checking is part of the keyword normalization. Normalized keywords go to Hadoop and User cookies.
Ugo Jardonnet
Spell Checking
20 / 20