Human Language Technology: Applications to Information Access
Lesson 12: Synthesis & Conclusions December 22, 2016 EPFL Doctoral Course EE-724 Andrei Popescu-Belis, Idiap Research Institute
Reflections on HLT research • The software lifecycle in general – why is HLT so particular?
• Classification of HLT on input/output • Abstract model of HLT experiments • Evaluation metrics – inter-annotator agreement: kappa – constraints on distance-based metrics
• Looking back at the HLT lessons: a conclusion 2
Software development lifecycle With reference to ISO standards
Software quality: ISO definitions • Quality = ability to satisfy users’ stated or implied needs – can be decomposed into six quality characteristics: • functionality • efficiency
reliability maintainability
user-friendliness portability
• Each quality characteristic can be decomposed into subcharacteristics, then attributes (depend on the domain) – quality attributes can be measured by metrics – quality attributes can be internal or external • external attributes are measured by running the system • internal attributes are measured by looking inside the system
• Quality in use = measured in a given context of use – four characteristics, also decomposed into attributes: • effectiveness • safety
productivity user-satisfaction 4
Software development lifecycle User requirements user needs external quality requirements internal quality requirements
Design
Evaluation evaluation in use evaluation of external qualities evaluation of internal qualities
Implementation 5
Why HLT software is special: verification vs. evaluation • Traditional software engineering specification implementation verification & validation • specifications must be precise, complete, verifiable • verification: does the result conform to the specifications?
• Language engineering (HLT) – problems to be solved are difficult to specify formally – the required behavior (e.g. output) is specified by referring to human linguistic and cognitive capacities • still, some aspects (e.g. format) can be formally specified
– hence, we evaluate such systems rather than verify them • how well they satisfy the non-formalizable specifications
• Example of MT: verify that output does not contain foreign words, but evaluate that output is a proper translation 6
ISO/IEC standards for software evaluation • ISO 9126 on quality models (sets of quality characteristics) – first version: 1991 – second series 9126-1 (2001): quality models; 9126-2 to 4 (2003): metrics (internal, external, in use)
•
ISO 14958 on the evaluation process – 14958-1 (1999): overview; 14958-2 (2000): management – 14958-3 (2000): process for developers; 14958-4 (1999): for buyers; 14958-5 (1998): for evaluators – 14958-6 (2001): documenting the evaluation process
• SQUARE, Software Quality Requirements and Evaluation, five series – – – – –
9126-1n (n = 0, 1): overview of notions and processes 9126-20: quality models 9126-3n (n = 0-5) quality metrics and their documentation 9126-40: specification of requirements 9126-5n (n = 0-3): evaluation process (new version of 14598) 7
View of HLT research and development process Abstract model of HLT experiments
A taxonomy of HLT research topics LANGUAGE ANALYSIS Speech recognition Handwriting recognition Text alignment Sentence segmentation Word segmentation / tokenization Part-of-speech tagging Word sense disambiguation Word sense learning Collocation detection Term recognition / extraction Parsing Grammatical inference Semantic analysis Anaphora / coreference resolution Discourse parsing Subjectivity / intention recognition Authorship attribution Information retrieval Automatic indexing
Information extraction Text mining LANGUAGE GENERATION Text generation Report generation from databases Generation of multimedia presentations Speech synthesis ANALYSIS AND GENERATION Machine translation Question answering Automatic summarization
INTERACTIVE (ANALYSIS + GENERATION) Spoken/written dialogue systems Multimodal dialogue systems Computer-assisted language learning Writing assistants Database interfaces
9
Design process for non-interactive HLT 1. Define the objective (purpose) of the study – which linguistic phenomenon is targeted
2. Characterize human competence at it: first conceptually, then by creating reference data – measure inter-annotator agreement – adjudicate annotations to produce gold standard – data to be used for training and test
3. Specify software that will simulate human competence
4. Specify the evaluation metrics 5. Obtain results, evaluate and analyze them 10
Research process for non-interactive HLT Targeted phenomenon Annotation guidelines
≠?
Human annotation: “best recognition” Inter-annotator agreement
Design of an automatic recognizer ≈?
Trng.
Reference annotation
Features for recognition
Test
Training
Operational system Evaluation: scores 11
Evaluation of inter-annotator agreement
The Kappa measure () • Measure agreement between annotators (or raters, coders) on classification tasks with nominal values • Origins of kappa – medicine, psychology, behavioral sciences: diagnoses • Scott’s pi (1955), Cohen’s alpha (1960), Fleiss’ kappa (1971) • Landis and Koch (1977) , Siegel and Castellan (1988)
– social sciences: content analysis • Krippendorff (1980, 2004)
• Application to HLT – text data annotation • Carletta (1996), discussion papers in Comp. Ling. > 2000 13
Motivation • Measuring agreement using accuracy (% of instances on which annotators agree) is not sufficient – must be corrected by considering agreement by chance – e.g., if two annotators classify instances into N classes at random, then they can reach by chance 1/N agreement
• Example: two annotators – classify meetings as constructive / destructive / neutral – observed frequencies are around 15% C, 15% D, 70% N – the 2 annotators agree on 70% of samples… are we happy? • not really: if they answer randomly with the above frequencies we can expect on average 53.5% agreement 14
Definitions • Proportion of agreement above chance P(A) = observed agreement (as a percentage of the total number of classified instances) P(E) = agreement due to chance (“expected”)
P( A) P( E ) • Accuracy corrected for chance: 1 P( E ) maximum: = 1, perfect agreement minimum: = -1, total disagreement = 0, independence / no correlation 15
How do we estimate the probability of agreeing by chance? • From a limited number of annotation samples • Based on the proportions of each category used by each annotator – two main options two versions of • specific proportions for each annotator (Cohen 1960) • same proportion for all annotators (all the others…)
16
Contingency table (confusion matrix) • The a priori probability of Coder A to … … answer ‘Cat1’ is (a+c) / total … answer ‘Cat2’ is (b+d) / total (and similarly for Coder B)
P( E )
(a c)(a b) (a b c d ) 2
(b d )(c d ) (a b c d ) 2
Cat1 Cat2
Coder B
• Hence, and
Coder A
ad P( A) abcd
Cat1 a
b
a+b
Cat2 c
d
c+d
b+d
total
a+c
17
A different approach: agreement matrix • The a priori probability is estimated from all k
coders’ data as
P( E )
p 2j
j 1
where
1 pj N
N
nij
n
Number of assignments (total = n per item)
i 1
is the probability of each category
Cat.1 Cat.2 Cat. j Cat.k Item 1
0
0
…
n
Item 2
0
3
…
1
Item i
0
0
nij
1
Item N
…
…
…
…
“Totals” p1
p2
pj
pk 18
P(A) • The proportion of observed agreement P(A) is computed using i , the average proportion of agreement for each item i (over all k categories, with n annotators) nij nij 1 i n n 1 j 1 k
• So,
1 P( A) N
N
i 1
i
and again
P( A) P( E ) 1 P( E ) 19
Differences between the two versions • Generally small, especially when is high • Despite apparently different formulae, P(A) is the same, but there is a small difference in P(E): • First case:
1 P( E ) N j 1
• Second case:
1 P( E ) N j 1
k
N
k
i 1
N
i 1
nij iN1 nij 1 n Nn 1 nij n
2
20
Example: annotating meetings with ‘Constructive’ / ‘Destructive’ / ‘Neutral’ Two annotators over 500 samples: ~ 16% C, 14% D, 70% N Coder A
C
D
N
C
D
N
Mtng 1
0
2
0
C
50
5
25
80
.. i ..
1
0
1
D
10
40
15
65
Mtng N 0
0
2
N
25
30
300
355
Totals
140
695
85
75
340
500
165
Results: P(A) = 0.78000 P(E) = 0.52985 = 0.53206
Coder B
Nb. of codings (n=2)
and vs. vs.
P(A) = 0.78000 P(E) = 0.52950 = 0.53241 21
Meaning of kappa values • = 1 identical annotations | = 0 independence • Subjective scales of agreement Landis and Koch 1977 m2) 30
Conclusion
I. The quantity barrier • A lot of knowledge and information is enclosed in textbased documents, which thanks to the Internet become more and more numerous • The dream: make this information just as accessible as your own knowledge stored in your brain • Lessons 1 – 4 text classification, IR, relevance feedback, learning to rank, recommender systems, just-in-time retrieval 32
II. The cross-lingual barrier • Many languages are used on the Web but users still prefer their mother tongue • The dream: software that fully translates text • Lessons 5 – 8 types of MT systems, language models, translation models, decoding, evaluation, Moses, BLEU 33
III. The subjectivity barrier • IT supports human interactions (written or spoken, synchronous or not), but key information is “beyond words”, enclosed at the discourse or interactive levels • The dream: decode interaction patterns to infer new knowledge, including subjective opinions • Lessons 9 – 12 sentiment analysis, dialogue acts, meeting browsers, task-based evaluation through QA 34
The future looks interesting • Quantity finding without searching
• Cross-lingual language-transparent Web • Subjectivity social media and networks With or without deep neural networks? With AI! 35