Using top n Recognition Candidates to Categorize On-line Handwritten Documents
Sebastián , Emmanuel et Christian {sebastian.pena-saldarriaga, emmanuel.morin, christian.viard-gaudin}@univ-nantes.fr (†) LINA Université de Nantes, (‡) IRCCyN École Polytechnique de l'Université de Nantes † Peña Saldarriaga
† Morin
1 Introduction Archival & retrieval of on-line handwriting Particular interest for text categorization (TC) TC attempts to derive information from text Recognition is a necessary eort
Noisy text categorization
Top 1 recognition
SVM
Recognition Errors
top 1 recognition.
2 Intuitive Idea
... as n increases
Char. level rec. Accuracy improves
Use top n (n > 1) recognition candidates Greater probability of having the correct word
n 1 5 10 15 Word level rec. 22.08% 15.73% 13.41% 12.04% Char. level rec. 52.48% 37.28% 33.09% 31.02% Recognition rates for dierent n However, the text is ooded with false occurrences of words. We then redene the standard tf × idf weighting scheme.
Weighting top n candidates 3 w i = qP M
& the candidate-term
frequency (ctf ) by
N
candidate lists in which
i
i
This work is funded by La Région Pays de la Loire under the MILES Project and by The French NaTLOG-009.
in
occurs
Acknowledgments tional Research Agency grant number ANR-06-
1 2 3 4 5
n
10
Redenition of the tf × idf weighting Based on probabilities of recognition candidates
The bad
However, it is uneective with word level recognition. Needs further experimental validation.
Thresholding/rejection strategies on candidates Track & type recognition errors
The sum of the probabilities of the
80%
What has been accomplished ?
What's next ?
2 (ctf (j) × idf (i)) j=1
85%
6 Conclusion
A simple idea that yields interesting results on heavily degraded texts with two dierent classiers.
ctf (i) × idf (i)
svm/word kppv/word svm/char kppv/char
90%
... ∀ n > 1 With both algorithms
The good
The weight w of a term i is given by
n=1 pn(i)
With 1 or n recognition candidates
Word level rec. Accuracy decreases
...
ctf (i) =
Experiments Comparing kNN & SVM With word & char. level recognition
5 Results
kNN
PN
On-line handwritten data Reuters-21578 corpus 2,000 samples 10 categories