Dans cette communication nous proposons une

is intended to confirm or reject the writer candidates proposed by a writer identification system. Keywords : Writer verification, hypothesis test, mutual information, ...
243KB taille 0 téléchargements 353 vues
in 11th Conference of the International Graphonomics Society, IGS'2003, Scottsdale, Arizona, pp. 274-277, 2003.

Grapheme Based Writer Verification A. BENSEFIA, T. PAQUET, L. HEUTTE Laboratoire PSI – FRE CNRS 2645, Université de Rouen F-76821 MONT-SAINT-AIGNAN Cedex, France [email protected] Abstract : In this communication we propose an approach for the writer verification task. The difficulty of this task lies first, in the decision making between the two assumptions which model the problem: "Do the two writings come from the same writer ?" or "Do the two writings come from different writers?" and second, in the evaluation of the error risk associated to this decision. We have called upon a mutual information based hypothesis test for the evaluation of these errors taking into account the distribution of the two assumptions (within-writer distribution and between-writer distribution) on a sample set of 88 writers. Features used for this writer verification problem are graphemes produced by a segmentation module. This verification process is intended to confirm or reject the writer candidates proposed by a writer identification system Keywords : Writer verification, hypothesis test, mutual information, graphemes.

1. Introduction Writer verification is the task of assessing the authorship of a handwritten sample. Most of the time this task is carried out by human experts and is prone to many subjective decisions. Recent studies have investigated the possibility of building a computer aided, unbiased, analysis of handwritten samples in order to provide scientific evidences to the writer verification task, Cha et al. (2000). Writer verification has been carried out using different experimental processes in the literature. On the one hand, many studies have investigated the writer identification task by Said et al. (2000), Marti et al. (2001). It can be defined as the problem of assigning to an unknown handwritten sample its most likely writer among a known finite set of possible writers. Most of the studies concerning this subject are based on the detection of specific features (either structural features or features from texture analysis, extracted from large text blocs) and the use of a particular metric to provide an ordered list of possible writers. These approaches are well suited to query handwritten documents in large databases and provide in this case very helpful tools. However, they are not dedicated to the precise assessment of the authorship of a handwritten sample e.g. the writer verification task. Writer verification, on the other hand, has not received as much attention as writer identification. This is mainly due to the fact that writer verification involves a precise local decision process which is generally text dependent. Indeed, writer verification generally turns out to fall into the process of comparing the shapes of similar characters in the documents under study. Therefore, the full automation of this process appears to be unrealistic as it would imply the perfect recognition of the handwritten samples. Nevertheless, the writer verification task can bring significant help to the expert provided that local shapes under assessment can be labelled by the user. Performance evaluation of the writer verification task therefore requires the use of a special dedicated database both at the word and the character levels which is tedious to build. In this communication, we are interested in providing automatic means of assessing the authorship of a handwritten sample following a first stage of writer identification in a database. In other words, we want to validate the result of the identification process using a writer verification approach on text blocs. Our writer identification process operates on text blocs and uses the graphemes as local features together with a fast information retrieval technique to provide a list of possible writers, Bensefia et al. (2002)(2003). In this communication we show that a mutual information based criterion, using graphemes as features, provides a suitable statistical test for assessing the hypothesis of similar writers of two unknown documents.

2. Mutual Information Criterion We assume that 2 input handwritten documents D1 and D2 have been written by writer S1 and writer S2 respectively. S denotes the set of these two writers: S = { S1, S2 }. We also assume that a pre-processing stage, including line localisation and connected component segmentation, has allowed to extract the set of graphemes in these two documents (cf. figure 1). Moreover, a grapheme feature set G is extracted thanks to a particular

in 11th Conference of the International Graphonomics Society, IGS'2003, Scottsdale, Arizona, pp. 274-277, 2003. sequential clustering technique: G = { g1, g2, g3,….. gN }. Some of these features may occur on the two documents while the others may occur specifically on one single document.

Figure 1. Graphemes of samples on handwriting word “man”

Mutual information will serve as a measure of independence between S and G. Low values of mutual information imply independence between the two random variables, while higher values implies a dependence between the variables. In the case of independence between G and S we expect that the feature set is evenly distributed over the two writers and would imply the identity of S1 and S2. On the contrary, we expect that this criterion would allow to detect two different writers that would be responsible for a significant dependence between S and G. Mutual information between G and S is defined by the following expression : IM(G,S) = H(G) - H(G/S)

[1]

where : H(G) is the Shannon’s entropy (Shannon., 1984), defined by : card(G)

H(G)=−

∑P(g ) H(G=g ) i

[2]

i

i =1

and H(G/S) is the conditional entropy, defined by : card(S)

H(G S )=−

∑ j=1

P(Sj) H(G S=Sj) = −

card(G) card(S)

∑ ∑P(S ) P(g S ) j

i =1

i j

[

log 2 P(gi Sj)

]

[3]

j=1

For assessing the interest of this criterion, an experiment was carried out on a sample of 88 different writers that were asked to copy a text. These samples have been splited into two documents (first and second half). Figure 2 gives the distribution of the mutual information criterion in the two following cases : figure 2.a gives the distribution of the criterion in the case where the two writers are identical while figure 2.b gives the distribution in the case where the two writers are different. From the observation of these two distributions it seems clear that mutual information should provide a quantitative criterion for the writer verification task.

(a)

(b)

Figure 2. within-writer distribution (a) between-writer distribution (b)

in 11th Conference of the International Graphonomics Society, IGS'2003, Scottsdale, Arizona, pp. 274-277, 2003. 3. Hypothesis testing Using this mutual information criterion, we are now interested in assessing which of the two possible hypothesis is the correct one : H0 : S1 = S2 and H1 : S 1 ≠ S 2 [4] This can be accomplished using classical hypothesis testing Saporta. (1990). H0 will serve as the null hypothesis or the default hypothesis. Each of the two possible decisions is associated to a probability of correct decision and a probability of false decision or error probability. Probability of error on the null hypothesis is the first order error and is denoted α , while probability of error on H1 is the second order error and is denoted β. Table 1 summarises the possible situations. Truth

H0 is true

H0 is false

accept H0

Correct decision (1-α)

Second order error (β)

accept H1

First order error (α)

Power of the test (1-β)

Decision

Table 1. the two possible decisions and associated probabilities

3.1. Statistical test First and second order errors must be evaluated in order to decide between the two possible hypothesis. The mutual information criterion statistic will be used for this purpose. Assuming normal distribution of the mutual information criterion for the two hypothesis allows to determine the two kinds of error (cf. figure 3).

(a)

(b)

Figure 3. normal distribution models for H0 (a) and H1 (b)

3.2. Rejection and acceptation regions The distribution of the test is divided into two regions : the acceptation and the rejection regions. Rejection region of H0 , denoted by W0 , is defined according to the first order error α, Chen (2003). The limit of this region also allows to define the rejection region of H1 denoted by W1 . This region allows to define the second order error denoted β. This allows to define the following quantities: P(W0 | H0) = α

and

P(W1 | H1) = β

[5]

In a similar manner, acceptation region of H0 is the complement region of W0 denoted by W 0 , while acceptation of H1 is the complement of W1 denoted by W 1 (cf. figure 4).We have :

in 11th Conference of the International Graphonomics Society, IGS'2003, Scottsdale, Arizona, pp. 274-277, 2003. P( W0 | H0) = 1-α and

P( W1 | H1) = 1 - β

[6]

Figure 4. Rejection regions

3.3. Decision rules We are now able to determine for a given first order error the power of the mutual information statistical test defined. Experimental distribution of the criterion leads to the following errors : α = 0.02 involves 1 - β = 0.9744 α = 0.05 involves 1 - β = 0.9641

4. Conclusions We have proposed a method for assessing whether two handwritten samples are written by the same writer or not. We have shown using hypothesis testing that the mutual information criterion provides very high confidence values both on null hypothesis acceptation and on the contradictory hypothesis acceptation. This criterion can serve as a rejection criterion for our writer identification process thus making the system able to reject written samples unknown by the system.

5. Bibliography Bensefia, A., Nosary, A., Paquet, T., & Heutte, L. (2002). Writer Identification by Writer’s Invariants. Proceedings of the 8th International Workshop on Frontiers in Handwriting Recognition, IWFHR’02. (pp.274-279). Bensefia, A., Paquet, T., & Heutte, L. (2003). Writer Identification Using Retrieval Information Paradigm. Proceedings of the 7th International Conference on Document Analysis and Recognition, ICDAR’03. Cha, S.H., & Srihari, S. (2000). Multiple Feature Integration for Writer Verification. Proceedings of the 7th International Workshop on Frontiers in Handwriting Recognition; IWFHR VII . (pp.333-342) Chen, K. (2003). Towards better making decision in speaker verification. Pattern Recognition, 36, 329-346. Marti, U.V., Messerli, R., & Bunke, H. (2001). Writer Identification Using Text Line Based Features. Proceedings of the 6th International Conference in Document Analysis Recognition, ICDAR’01. (pp.101-105). Said, H.E.S., Tan, T.N., & Baker, K.D. (2000). Personal Identification Based on Handwriting. Pattern Recognition, 33, 149160. Saporta., G. (1990). Probabilités analyse des données et sattistiques. Edition Technip. pp 317-330. Shannon, C. (1984). The Mathematical Theory of Communication. Bell System Technical Journal, Roberts, J.A. 27, 379423.