Spell Checking - Ugo Jardonnet

Edit distance. Implementation. Python Implementation import numpy as np def levenshtein(s1, s2):. "Calculates the Levenshtein distance between a and b.".
237KB taille 3 téléchargements 385 vues
Spell Checking Theory and Implementation

[email protected]

Ugo Jardonnet

Spell Checking

1 / 20

Table of Contents 1

Edit distance Introduction Algorithm Implementation

2

Fuzzy Search Trie Search

3

Optimization Patricia Trie Multiple Tries

4

Conclusion

Ugo Jardonnet

Spell Checking

2 / 20

Edit distance

Introduction

Outline 1

Edit distance Introduction Algorithm Implementation

2

Fuzzy Search Trie Search

3

Optimization Patricia Trie Multiple Tries

4

Conclusion

Ugo Jardonnet

Spell Checking

3 / 20

Edit distance

Introduction

Edit distance

Measuring the amount of difference between two sequences. Levenshtein Distance Insertion: writing → writting Deletion: learn → larn Substitution: kitten → litten Damerau-Levenshtein Transposition: levenshtein → levensthein

Ugo Jardonnet

Spell Checking

4 / 20

Edit distance

Algorithm

Outline 1

Edit distance Introduction Algorithm Implementation

2

Fuzzy Search Trie Search

3

Optimization Patricia Trie Multiple Tries

4

Conclusion

Ugo Jardonnet

Spell Checking

5 / 20

Edit distance

Algorithm

Algorithm Levenshtein distance between ”Criteo” and ”Citeo”. if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # )

’’ c i t e o Ugo Jardonnet

’’ 0 1 2 3 4 5

# keep preceding distance

a deletion an insertion a substitution

c 1 0 1 2

r 2 1 1

i 3

Spell Checking

t 4

e 5

o 6

6 / 20

Edit distance

Algorithm

Algorithm Levenshtein distance between ”Criteo” and ”Citeo”. if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # )

’’ c i t e o Ugo Jardonnet

’’ 0 1 2 3 4 5

# keep preceding distance

a deletion an insertion a substitution

c 1 0 1 2

r 2 1 1

i 3

Spell Checking

t 4

e 5

o 6

6 / 20

Edit distance

Algorithm

Algorithm Levenshtein distance between ”Criteo” and ”Citeo”. if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # )

’’ c i t e o Ugo Jardonnet

’’ 0 1 2 3 4 5

# keep preceding distance

a deletion an insertion a substitution

c 1 0 1 2

r 2 1 1

i 3

Spell Checking

t 4

e 5

o 6

6 / 20

Edit distance

Algorithm

Algorithm Levenshtein distance between ”Criteo” and ”Citeo”. if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # )

’’ c i t e o Ugo Jardonnet

’’ 0 1 2 3 4 5

# keep preceding distance

a deletion an insertion a substitution

c 1 0 1 2

r 2 1 1

i 3

Spell Checking

t 4

e 5

o 6

6 / 20

Edit distance

Algorithm

Algorithm Levenshtein distance between ”Criteo” and ”Citeo”. if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # )

’’ c i t e o Ugo Jardonnet

’’ 0 1 2 3 4 5

# keep preceding distance

a deletion an insertion a substitution

c 1 0 1 2

r 2 1 1

i 3

Spell Checking

t 4

e 5

o 6

6 / 20

Edit distance

Algorithm

Algorithm Levenshtein distance between ”Criteo” and ”Citeo”. if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # )

’’ c i t e o Ugo Jardonnet

’’ 0 1 2 3 4 5

# keep preceding distance

a deletion an insertion a substitution

c 1 0 1 2

r 2 1 1

i 3

Spell Checking

t 4

e 5

o 6

6 / 20

Edit distance

Algorithm

Algorithm Levenshtein distance between ”Criteo” and ”Citeo”. if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # )

’’ c i t e o Ugo Jardonnet

’’ 0 1 2 3 4 5

# keep preceding distance

a deletion an insertion a substitution

c 1 0 1 2 3 4

r 2 1 1 2 3 4

i 3 2 1 2 3 4

Spell Checking

t 4 3 2 1 2 3

e 5 4 3 2 1 2

o 6 5 4 3 2 1 7 / 20

Edit distance

Implementation

Outline 1

Edit distance Introduction Algorithm Implementation

2

Fuzzy Search Trie Search

3

Optimization Patricia Trie Multiple Tries

4

Conclusion

Ugo Jardonnet

Spell Checking

8 / 20

Edit distance

Implementation

Python Implementation import numpy as np def levenshtein(s1, s2): "Calculates the Levenshtein distance between a and b." m, n = len(s1), len(s2) d = np.zeros((m+1,n+1)) for i in d[i, for j in d[0,

range(0,m+1): 0] = i # the distance of s1 to an empty second string range(0,n+1): j] = j # the distance of s2 to an empty first string

for j in range(1,n+1): for i in range(1,m+1): if s1[i-1] == s2[j-1]: d[i, j] = d[i-1, j-1] else: d[i, j] = min( d[i-1, j] + 1, # d[i, j-1] + 1, # d[i-1, j-1] + 1 # ) return d[m, n]

Ugo Jardonnet

# keep preceding distance

a deletion an insertion a substitution

Spell Checking

9 / 20

Edit distance

Implementation

Wiser Python Implementation We need only the last row! ’’ c

’’ 0 1

c 1 0

r 2 1

i 3 ?

t 4

e 5

o 6

def levenshtein(s1, s2): "Calculates the Levenshtein distance between a and b." n, m = len(s1), len(s2) current = range(n+1) for i in range(1,m+1): previous, current = current, [i]+[0]*n for j in range(1,n+1): add, delete = previous[j]+1, current[j-1]+1 change = previous[j-1] if s1[j-1] != s2[i-1]: change = change + 1 current[j] = min(add, delete, change) return current[n]

Ugo Jardonnet

Spell Checking

10 / 20

Fuzzy Search

Trie

Outline 1

Edit distance Introduction Algorithm Implementation

2

Fuzzy Search Trie Search

3

Optimization Patricia Trie Multiple Tries

4

Conclusion

Ugo Jardonnet

Spell Checking

11 / 20

Fuzzy Search

Trie

Prefix tree Dictionary: war, when, who . w a r

h o

e n

Ugo Jardonnet

Spell Checking

12 / 20

Fuzzy Search

Search

Outline 1

Edit distance Introduction Algorithm Implementation

2

Fuzzy Search Trie Search

3

Optimization Patricia Trie Multiple Tries

4

Conclusion

Ugo Jardonnet

Spell Checking

13 / 20

Fuzzy Search

Search

Approximate Search

Looking for ’ ’ .

0

1

1

2

a

3

2

2

2

r

i

1

2

3

0 h

o 3

Ugo Jardonnet

h

1

w 2

w

2

1

1

Spell Checking

1 2

2 1

0

1

e

3

2

1

1

n

4

3

2

2

14 / 20

Fuzzy Search

Search

Approximate Search

Looking for ’ ’ .

0

1

1

2

a

3

2

2

2

r

i

1

2

3

0 h

o 3

Ugo Jardonnet

h

1

w 2

w

2

1

1

Spell Checking

1 2

2 1

0

1

e

3

2

1

1

n

4

3

2

2

14 / 20

Fuzzy Search

Search

Approximate Search

Looking for ’ ’ .

0

1

1

2

a

3

2

2

2

r

i

1

2

3

0 h

o 3

Ugo Jardonnet

h

1

w 2

w

2

1

1

Spell Checking

1 2

2 1

0

1

e

3

2

1

1

n

4

3

2

2

14 / 20

Fuzzy Search

Search

Approximate Search

Looking for ’ ’ .

0

1

1

2

a

3

2

2

2

r

i

1

2

3

0 h

o 3

Ugo Jardonnet

h

1

w 2

w

2

1

1

Spell Checking

1 2

2 1

0

1

e

3

2

1

1

n

4

3

2

2

14 / 20

Fuzzy Search

Search

Approximate Search

Looking for ’ ’ .

0

1

1

2

a

3

2

2

2

r

i

1

2

3

0 h

o 3

Ugo Jardonnet

h

1

w 2

w

2

1

1

Spell Checking

1 2

2 1

0

1

e

3

2

1

1

n

4

3

2

2

14 / 20

Fuzzy Search

Search

Approximate Search

Looking for ’ ’ .

0

1

1

2

a

3

2

2

2

r

i

1

2

3

0 h

o 3

Ugo Jardonnet

h

1

w 2

w

2

1

1

Spell Checking

1 2

2 1

0

1

e

3

2

1

1

n

4

3

2

2

14 / 20

Fuzzy Search

Search

Approximate Search

Looking for ’ ’ .

0

1

1

2

a

3

2

2

2

r

i

1

2

3

0 h

o 3

Ugo Jardonnet

h

1

w 2

w

2

1

1

Spell Checking

1 2

2 1

0

1

e

3

2

1

1

n

4

3

2

2

14 / 20

Fuzzy Search

Search

Approximate Search

Looking for ’ ’ .

0

1

1

2

a

3

2

2

2

r

i

1

2

3

0 h

o 3

Ugo Jardonnet

h

1

w 2

w

2

1

1

Spell Checking

1 2

2 1

0

1

e

3

2

1

1

n

4

3

2

2

14 / 20

Fuzzy Search

Search

Approximate Search

Looking for ’ ’ .

0

1

1

2

a

3

2

2

2

r

i

1

2

3

0 h

o 3

Ugo Jardonnet

h

1

w 2

w

2

1

1

Spell Checking

1 2

2 1

0

1

e

3

2

1

1

n

4

3

2

2

14 / 20

Fuzzy Search

Search

Approximate Search

Complexity: O(query length × nb nodes) In practice we trim lots of nodes if the max edit distance is low.

Ugo Jardonnet

Spell Checking

15 / 20

Optimization

Patricia Trie

Outline 1

Edit distance Introduction Algorithm Implementation

2

Fuzzy Search Trie Search

3

Optimization Patricia Trie Multiple Tries

4

Conclusion

Ugo Jardonnet

Spell Checking

16 / 20

Optimization

Patricia Trie

Patricia Trie Compressed version of a trie. en

ever

h w

o ar Complex implementation.

Ugo Jardonnet

Spell Checking

17 / 20

Optimization

Multiple Tries

Outline 1

Edit distance Introduction Algorithm Implementation

2

Fuzzy Search Trie Search

3

Optimization Patricia Trie Multiple Tries

4

Conclusion

Ugo Jardonnet

Spell Checking

18 / 20

Optimization

Multiple Tries

Multiple Tries

.

.

Easily parallelisable:

Ugo Jardonnet

.

...

..

.

...

Spell Checking

...

..

...

...

19 / 20

Conclusion

Conclusion

Spell checking is part of the keyword normalization. Normalized keywords go to Hadoop and User cookies.

Ugo Jardonnet

Spell Checking

20 / 20