HLT Course: Lesson 14: Synthesis - Andrei Popescu-Belis

Dec 22, 2016 - Text mining. LANGUAGE GENERATION. Text generation. Report generation from databases. Generation of multimedia presentations. Speech ...
754KB taille 1 téléchargements 200 vues
Human Language Technology: Applications to Information Access

Lesson 12: Synthesis & Conclusions December 22, 2016 EPFL Doctoral Course EE-724 Andrei Popescu-Belis, Idiap Research Institute

Reflections on HLT research • The software lifecycle in general – why is HLT so particular?

• Classification of HLT on input/output • Abstract model of HLT experiments • Evaluation metrics – inter-annotator agreement: kappa – constraints on distance-based metrics

• Looking back at the HLT lessons: a conclusion 2

Software development lifecycle With reference to ISO standards

Software quality: ISO definitions • Quality = ability to satisfy users’ stated or implied needs – can be decomposed into six quality characteristics: • functionality • efficiency

reliability maintainability

user-friendliness portability

• Each quality characteristic can be decomposed into subcharacteristics, then attributes (depend on the domain) – quality attributes can be measured by metrics – quality attributes can be internal or external • external attributes are measured by running the system • internal attributes are measured by looking inside the system

• Quality in use = measured in a given context of use – four characteristics, also decomposed into attributes: • effectiveness • safety

productivity user-satisfaction 4

Software development lifecycle User requirements user needs external quality requirements internal quality requirements

Design

Evaluation evaluation in use evaluation of external qualities evaluation of internal qualities

Implementation 5

Why HLT software is special: verification vs. evaluation • Traditional software engineering specification  implementation  verification & validation • specifications must be precise, complete, verifiable • verification: does the result conform to the specifications?

• Language engineering (HLT) – problems to be solved are difficult to specify formally – the required behavior (e.g. output) is specified by referring to human linguistic and cognitive capacities • still, some aspects (e.g. format) can be formally specified

– hence, we evaluate such systems rather than verify them • how well they satisfy the non-formalizable specifications

• Example of MT: verify that output does not contain foreign words, but evaluate that output is a proper translation 6

ISO/IEC standards for software evaluation • ISO 9126 on quality models (sets of quality characteristics) – first version: 1991 – second series 9126-1 (2001): quality models; 9126-2 to 4 (2003): metrics (internal, external, in use)



ISO 14958 on the evaluation process – 14958-1 (1999): overview; 14958-2 (2000): management – 14958-3 (2000): process for developers; 14958-4 (1999): for buyers; 14958-5 (1998): for evaluators – 14958-6 (2001): documenting the evaluation process

• SQUARE, Software Quality Requirements and Evaluation, five series – – – – –

9126-1n (n = 0, 1): overview of notions and processes 9126-20: quality models 9126-3n (n = 0-5) quality metrics and their documentation 9126-40: specification of requirements 9126-5n (n = 0-3): evaluation process (new version of 14598) 7

View of HLT research and development process Abstract model of HLT experiments

A taxonomy of HLT research topics LANGUAGE ANALYSIS Speech recognition Handwriting recognition Text alignment Sentence segmentation Word segmentation / tokenization Part-of-speech tagging Word sense disambiguation Word sense learning Collocation detection Term recognition / extraction Parsing Grammatical inference Semantic analysis Anaphora / coreference resolution Discourse parsing Subjectivity / intention recognition Authorship attribution Information retrieval Automatic indexing

Information extraction Text mining LANGUAGE GENERATION Text generation Report generation from databases Generation of multimedia presentations Speech synthesis ANALYSIS AND GENERATION Machine translation Question answering Automatic summarization

INTERACTIVE (ANALYSIS + GENERATION) Spoken/written dialogue systems Multimodal dialogue systems Computer-assisted language learning Writing assistants Database interfaces

9

Design process for non-interactive HLT 1. Define the objective (purpose) of the study – which linguistic phenomenon is targeted

2. Characterize human competence at it: first conceptually, then by creating reference data – measure inter-annotator agreement – adjudicate annotations to produce gold standard – data to be used for training and test

3. Specify software that will simulate human competence

4. Specify the evaluation metrics 5. Obtain results, evaluate and analyze them 10

Research process for non-interactive HLT Targeted phenomenon Annotation guidelines

≠?

Human annotation: “best recognition” Inter-annotator agreement

Design of an automatic recognizer ≈?

Trng.

Reference annotation

Features for recognition

Test

Training

Operational system Evaluation: scores 11

Evaluation of inter-annotator agreement

The Kappa measure () • Measure agreement between annotators (or raters, coders) on classification tasks with nominal values • Origins of kappa – medicine, psychology, behavioral sciences: diagnoses • Scott’s pi (1955), Cohen’s alpha (1960), Fleiss’ kappa (1971) • Landis and Koch (1977) , Siegel and Castellan (1988)

– social sciences: content analysis • Krippendorff (1980, 2004)

• Application to HLT – text data annotation • Carletta (1996), discussion papers in Comp. Ling. > 2000 13

Motivation • Measuring agreement using accuracy (% of instances on which annotators agree) is not sufficient – must be corrected by considering agreement by chance – e.g., if two annotators classify instances into N classes at random, then they can reach by chance 1/N agreement

• Example: two annotators – classify meetings as constructive / destructive / neutral – observed frequencies are around 15% C, 15% D, 70% N – the 2 annotators agree on 70% of samples… are we happy? • not really: if they answer randomly with the above frequencies we can expect on average 53.5% agreement 14

Definitions • Proportion of agreement above chance P(A) = observed agreement (as a percentage of the total number of classified instances) P(E) = agreement due to chance (“expected”)

P( A)  P( E ) • Accuracy corrected for chance:   1  P( E ) maximum:  = 1, perfect agreement minimum:  = -1, total disagreement  = 0, independence / no correlation 15

How do we estimate the probability of agreeing by chance? • From a limited number of annotation samples • Based on the proportions of each category used by each annotator – two main options  two versions of  • specific proportions for each annotator (Cohen 1960) • same proportion for all annotators (all the others…)

16

Contingency table (confusion matrix) • The a priori probability of Coder A to … … answer ‘Cat1’ is (a+c) / total … answer ‘Cat2’ is (b+d) / total (and similarly for Coder B)

P( E ) 

(a  c)(a  b) (a  b  c  d ) 2



(b  d )(c  d ) (a  b  c  d ) 2

Cat1 Cat2

Coder B

• Hence, and

Coder A

ad P( A)  abcd

Cat1 a

b

a+b

Cat2 c

d

c+d

b+d

total

a+c

17

A different approach: agreement matrix • The a priori probability is estimated from all k

coders’ data as

P( E ) 



p 2j

j 1

where

1 pj  N

N

nij

n

Number of assignments (total = n per item)

i 1

is the probability of each category

Cat.1 Cat.2 Cat. j Cat.k Item 1

0

0



n

Item 2

0

3



1

Item i

0

0

nij

1

Item N









“Totals” p1

p2

pj

pk 18

P(A) • The proportion of observed agreement P(A) is computed using i , the average proportion of agreement for each item i (over all k categories, with n annotators) nij nij  1 i   n n 1 j 1 k



• So,

1 P( A)  N

N

 i 1

i

and again

P( A)  P( E )  1  P( E ) 19

Differences between the two versions • Generally small, especially when  is high • Despite apparently different formulae, P(A) is the same, but there is a small difference in P(E): • First case:

1  P( E )  N j 1 

• Second case:

1  P( E )  N j 1 

k

N

  k

i 1

N

  i 1





nij  iN1 nij  1    n  Nn  1  nij   n 

2

20

Example: annotating meetings with ‘Constructive’ / ‘Destructive’ / ‘Neutral’ Two annotators over 500 samples: ~ 16% C, 14% D, 70% N Coder A

C

D

N

C

D

N

Mtng 1

0

2

0

C

50

5

25

80

.. i ..

1

0

1

D

10

40

15

65

Mtng N 0

0

2

N

25

30

300

355

Totals

140

695

85

75

340

500

165

 Results: P(A) = 0.78000 P(E) = 0.52985  = 0.53206

Coder B

Nb. of codings (n=2)

and vs. vs.

P(A) = 0.78000 P(E) = 0.52950  = 0.53241 21

Meaning of kappa values •  = 1  identical annotations |  = 0  independence • Subjective scales of agreement Landis and Koch 1977 m2) 30

Conclusion

I. The quantity barrier • A lot of knowledge and information is enclosed in textbased documents, which thanks to the Internet become more and more numerous • The dream: make this information just as accessible as your own knowledge stored in your brain • Lessons 1 – 4 text classification, IR, relevance feedback, learning to rank, recommender systems, just-in-time retrieval 32

II. The cross-lingual barrier • Many languages are used on the Web but users still prefer their mother tongue • The dream: software that fully translates text • Lessons 5 – 8 types of MT systems, language models, translation models, decoding, evaluation, Moses, BLEU 33

III. The subjectivity barrier • IT supports human interactions (written or spoken, synchronous or not), but key information is “beyond words”, enclosed at the discourse or interactive levels • The dream: decode interaction patterns to infer new knowledge, including subjective opinions • Lessons 9 – 12 sentiment analysis, dialogue acts, meeting browsers, task-based evaluation through QA 34

The future looks interesting • Quantity  finding without searching

• Cross-lingual  language-transparent Web • Subjectivity  social media and networks With or without deep neural networks? With AI! 35