1 Introduction 2 Intuitive Idea 3 Weighting top n candidates 4 Data

LINA Université de Nantes, (‡) IRCCyN École Polytechnique de l'Université de Nantes. Archival & retrieval of on-line handwriting. Particular interest for text ...
796KB taille 2 téléchargements 161 vues
Using top n Recognition Candidates to Categorize On-line Handwritten Documents

Sebastián , Emmanuel et Christian {sebastian.pena-saldarriaga, emmanuel.morin, christian.viard-gaudin}@univ-nantes.fr (†) LINA  Université de Nantes, (‡) IRCCyN  École Polytechnique de l'Université de Nantes † Peña Saldarriaga

† Morin

1 Introduction Archival & retrieval of on-line handwriting Particular interest for text categorization (TC) TC attempts to derive information from text Recognition is a necessary eort

Noisy text categorization

Top 1 recognition

SVM

Recognition Errors

top 1 recognition.

2 Intuitive Idea

... as n increases

Char. level rec. Accuracy improves

Use top n (n > 1) recognition candidates Greater probability of having the correct word

n 1 5 10 15 Word level rec. 22.08% 15.73% 13.41% 12.04% Char. level rec. 52.48% 37.28% 33.09% 31.02% Recognition rates for dierent n However, the text is ooded with false occurrences of words. We then redene the standard tf × idf weighting scheme.

Weighting top n candidates 3 w i = qP M

& the candidate-term

frequency (ctf ) by

N

candidate lists in which

i

i

This work is funded by La Région Pays de la Loire under the MILES Project and by The French NaTLOG-009.

in

occurs

Acknowledgments tional Research Agency grant number ANR-06-

1 2 3 4 5

n

10

Redenition of the tf × idf weighting Based on probabilities of recognition candidates

The bad

However, it is uneective with word level recognition. Needs further experimental validation.

Thresholding/rejection strategies on candidates Track & type recognition errors

The sum of the probabilities of the

80%

What has been accomplished ?

What's next ?

2 (ctf (j) × idf (i)) j=1

85%

6 Conclusion

A simple idea that yields interesting results on heavily degraded texts with two dierent classiers.

ctf (i) × idf (i)

svm/word kppv/word svm/char kppv/char

90%

... ∀ n > 1 With both algorithms

The good

The weight w of a term i is given by

n=1 pn(i)

With 1 or n recognition candidates

Word level rec. Accuracy decreases

...

ctf (i) =

Experiments Comparing kNN & SVM With word & char. level recognition

5 Results

kNN

PN

On-line handwritten data Reuters-21578 corpus 2,000 samples 10 categories

Information loss with

Stand. Methods

Data & Experiments 4

Classication rate

Vector Space Model

‡ Viard-Gaudin

15