Scaling Conditional Random Fields using Error-Correcting Codes Trevor Cohn Department of Computer Science University of Sheffield
March 2010
Joint work with Andrew Smith and Miles Osborne (Edinburgh)
Motivation
Conditional random fields are state-of-the-art models sequential analogue of maxent classifier discriminative probabilistic model of joint labelling allow for large and rich feature sets
Some caveats don’t scale well, especially to large label sets hard to regularise effectively
I’ll be presenting error-correcting output coding for CRFs decomposes multiclass task into many simpler binary tasks improves training time can better regularise the model
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
2 / 47
Outline
1
Introduction
2
Error Correcting Codes
3
Error Correcting Output Coding for Classification
4
Sequential ECOC
5
Experiments
6
Analysis
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
3 / 47
Outline
1
Introduction
2
Error Correcting Codes
3
Error Correcting Output Coding for Classification
4
Sequential ECOC
5
Experiments
6
Analysis
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
4 / 47
CRFs in natural language processing Model distribution over the structure given text input mostly time series data: given tokens predict a label for each token eg., part-of-speech tagging, named-entity recognition also more complex structured prediction, e.g., network labelling, parsing, translation
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
5 / 47
Why a CRF? Allows for rich and varied features which access any aspect of the observations Simple log linear formulation with convex training objective Global inference avoids bias label/observation bias
But . . . Complexity of inference can be a problem training requires label marginals decoding requires Viterbi label sequence produced by the sum/max-product algorithm in O(LC ) time A further problem is overfitting of the training sample
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
6 / 47
Conditional Random Fields Y1
Y2
X1
X2
Yn
...
Xn
N+1
p(y|x) =
XX 1 exp λk fk (n, yn−1 , yn , x) Z (x) n=1
Trained to maximise L(D) = log
k
Q
(x,y)∈D
p(y|x)
Convex optimisation problem Decoding solves y∗ = arg maxy p(y|x) Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
7 / 47
Outline
1
Introduction
2
Error Correcting Codes
3
Error Correcting Output Coding for Classification
4
Sequential ECOC
5
Experiments
6
Analysis
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
8 / 47
Communicating over a Noisy Channel
communication channel receiver
sender "one small step for a man" "one small step for man"
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
9 / 47
Encoding and Decoding
Solution is to add redundancy into the transmitted message Sender adds extra information to the message Received uses the known redundancy to infer both the orginal message and the noise An error correcting code describes how to add redundancy and how to recovering the original message. Conflicting aims: More redundancy = better able to correct errors Less redundancy = better throughput
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
10 / 47
Example Code: Repetition Code
Imagine we repeat each bit thrice 010 becomes 000 111 000 some random noise corrupts the signal flipping bits 2, 7 and eight: 010 000 110 receiver gets 010 111 110 process each triple deducing the most probable source bit mimimise the Hamming distance to 000 or 111 equivalent to maximising probability if bit errors are independent with probability f < 0.5
recovered message is 011 and noise signal 010 000 001 can correct single bit errors in each block with rate
Cohn (Sheffield)
Scaling CRFs using ECCs
1 3
March 2010
11 / 47
Example Code: Hamming Code
Take blocks of message bits and encode them with parity bits transmit 4 bits at a time s1 , s2 , s3 , s4 then transmit 3 parity bits t5 = (s1 + s2 + s3 ) mod 2 t6 = (s2 + s3 + s4 ) mod 2 t7 = (s1 + s3 + s4 ) mod 2
t5 s1 t7
s s3 2 s4 t6
message 0100 becomes 0100 110 corrupted in bit 2, so receiver gets 0000 110 parity checks t5 and t7 are odd: an error occurred
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
12 / 47
Decoding the Hamming Code Enumerate the code words = message + parity strings. Find one closest to received message, 0000 110. message 0 0 0 0 0 1 0 1 0 ... 0 1 0 0 0 1 0 1 ... 1 1 1 1 0 0 0
0 0 1
parity 0 0 1 1 1 1
1 1
1 0
0 1
1
1
1
All code words differ in at least three bits → can correct a single error Transmission rate of Cohn (Sheffield)
4 7 Scaling CRFs using ECCs
March 2010
13 / 47
Decoding the Hamming Code Enumerate the code words = message + parity strings. Find one closest to received message, 0000 110. message 0 0 0 0 0 1 0 1 0 ... 0 1 0 0 0 1 0 1 ... 1 1 1 1 0 0 0
0 0 1
parity 0 0 1 1 1 1
1 1
1 0
0 1
1
1
1
All code words differ in at least three bits → can correct a single error Transmission rate of Cohn (Sheffield)
4 7 Scaling CRFs using ECCs
March 2010
13 / 47
Decoding the Hamming Code Enumerate the code words = message + parity strings. Find one closest to received message, 0000 110. message 0 0 0 0 0 1 0 1 0 ... 0 1 0 0 0 1 0 1 ... 1 1 1 1 0 0 0
0 0 1
parity 0 0 1 1 1 1
1 1
1 0
0 1
1
1
1
All code words differ in at least three bits → can correct a single error Transmission rate of Cohn (Sheffield)
4 7 Scaling CRFs using ECCs
March 2010
13 / 47
Desiderata of the Code
Ability to detect and correct many errors A high Hamming distance between all code words H error correcting capacity = b min c 2 Minimal overhead reducing transmission rate Using as few parity bits as possible Efficient encoding and decoding algorithms
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
14 / 47
Outline
1
Introduction
2
Error Correcting Codes
3
Error Correcting Output Coding for Classification
4
Sequential ECOC
5
Experiments
6
Analysis
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
15 / 47
Error-Correcting Output Coding Replace noisy channel
communication channel receiver
sender
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
16 / 47
Error-Correcting Output Coding Replace noisy channel with learning framework
transmitting the correct label
sender
classification framework receiver
training instances, input features, learning alg.
predicted label
[Dietterich, Kong, Bakiri, 1995] Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
16 / 47
Error-Correcting Output Coding
Training Use an ECC to encode each label as a bit string Build a classifier for each bit position To distinguish between labels for which the bit is 0 vs 1
Testing All classifiers make predictions about their bit Find the label with the closest code word under the Hamming distance A form of majority vote
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
17 / 47
Example
sports
travel
arts
money
Label sports money arts travel
Code words 1 1 1 0 1 0 1 0 0 0 0 1
At training time . . . classifier 1 distinguishes {sports,arts} from {travel,money} classifier 2 distinguishes {sports,money} from {arts,travel} classifier 3 distinguishes {sports,travel} from {money,arts}
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
18 / 47
Example
sports
travel
arts
money
Label sports money arts travel
Code words 1 1 1 0 1 0 1 0 0 0 0 1
At test time . . . if all classifiers predict 1 the label is sports if they predict 010 label is money but for all 0s there’s no exact match: the label could be money, arts or travel Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
18 / 47
Why use ECOCs?
Allows binary classifiers to be used on multi-class problems traditionally used with binary classifiers (e.g., SVMs)
Robust to errors by individual classifers Provided error distributions have low correlation Reduces both the bias and variance Often improves performance over multi-class methods And in our case, can also reduce the complexity of training
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
19 / 47
Choosing the Right Code
Some similarities to vanilla error-correcting codes Robustness dependent on code’s error-correcting capacity choose code with good row separation
Column separation is also important otherwise binary classifiers model similar concepts consequently errors become more correlated imposes a limit on the size of the code of 2L−1 − 1 columns
Smaller codes with few columns are faster to use fewer binary models to train and to vote at test time
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
20 / 47
Outline
1
Introduction
2
Error Correcting Codes
3
Error Correcting Output Coding for Classification
4
Sequential ECOC
5
Experiments
6
Analysis
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
21 / 47
How can we apply ECOC to sequence models?
We use a very simple technique: encode each label independently as a bit string using a block code train a classifier to predict binary sequences for a given bit each classifier models the label sequence subject to various parameters being tied For testing: each classifier predicts a binary sequence these are then merged using the code to find the best label sequence
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
22 / 47
Decoding the Label Sequence
We compare three different means of decoding: standalone: using Viterbi paths from each CRF with independent label resolution marginals: using marginals at each position from each CRF with independent label resolution product: find Viterbi path under a product model of the component CRFs
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
23 / 47
Standalone Decoding Find Viterbi path for each binary CRF yields sequence of {0, 1}s
Create vector of predictions for each time Choose label with closest code word for each time using Hamming distance
model 1 2 3 ...
Cohn (Sheffield)
1 0 0
1 1 0
time 1 0 1 1 0 1
Scaling CRFs using ECCs
0 0 1
0 1 1
March 2010
24 / 47
Marginals Decoding Find marginal probabilities at each time for each binary CRF yields sequence of real numbers [0, 1]
Create vector of predictions at each time Find label with closest code word using L1 distance
model 1 2 3 ...
Cohn (Sheffield)
0.9 0.2 0.1
0.9 0.8 0.1
time 0.6 0.5 0.5 0.2 0.6 0.4
Scaling CRFs using ECCs
0.1 0.8 0.8
0.4 0.1 0.7
March 2010
25 / 47
Product Decoding
Assume that the joint probability of a label sequence decomposes into the k independent binary sequences As such, the probability of a labelling is given by: p(y|x) =
1 Y pj (bj (y)|x) ZP (x) j
ZP (x) is a normalising function pj is the j th binary CRF bj relabels a vector of labels into a binary vector using column j of the coding matrix
This product model is itself a CRF – use standard Viterbi decoding
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
26 / 47
Logarithmic Opinion Pools The product model is a Logarithmic Opinion Pool [Heskes 1998]. Can express the MLE objective of the ensemble: ˜ ||pLOP ) = KL(p
X
˜ ||pj ) − KL(p
X
j
|
KL(pLOP ||pj )
j
{z E
}
|
{z A
}
Balance between the two terms individually accurate constituent models, E → 0 considerable diversity relative to the ensemble, A → ∞
Means to optimise this objective decompose the label space or training set diverse feature representations [Smith 2007] Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
27 / 47
Outline
1
Introduction
2
Error Correcting Codes
3
Error Correcting Output Coding for Classification
4
Sequential ECOC
5
Experiments
6
Analysis
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
28 / 47
Experimental Setup Two tasks, one easy and one hard: Named entity recognition 8 labels denoting location, person, organisation, miscellaneous entity types using IOB tagging format CoNLL 2003 data set, approx 200k training tokens
Part-of-speech tagging 45 labels Penn Treebank WSJ 1 million training tokens I-ORG NN EU
O VBZ rejects
Cohn (Sheffield)
I-MISC JJ German
O NN call
O TO to
O VB boycott
Scaling CRFs using ECCs
I-MISC JJ British
O NN lamb
O . .
March 2010
29 / 47
Exhaustive Code
Exhaustive contains every unique column there are 2L−1 − 1 columns any other columns are trivial or would replicate existing columns
Label LOC MISC ORG O
Cohn (Sheffield)
1 0 0 0
0 1 0 0
Code word 1 0 1 1 0 0 0 1 1 0 0 0
Scaling CRFs using ECCs
0 1 1 0
1 1 1 0
March 2010
30 / 47
One-vs-Rest Code
Each model learns to distinguish a single tag simplest code no error-correcting capacity good independence between models
Label LOC MISC ORG O
Cohn (Sheffield)
1 0 0 0
0 1 0 0
Code word 1 0 1 1 0 0 0 1 1 0 0 0
Scaling CRFs using ECCs
0 1 1 0
1 1 1 0
March 2010
31 / 47
Random Code
Random random subset of the exhaustive code desirable properties in the limit (Berger, 1999)
Label LOC MISC ORG O
Cohn (Sheffield)
1 0 0 0
0 1 0 0
Code word 1 0 1 1 0 0 0 1 1 0 0 0
Scaling CRFs using ECCs
0 1 1 0
1 1 1 0
March 2010
32 / 47
Algebraic Codes Large literature on block codes for communication over a noisy-channel. E.g., Hamming codes BCH (Bose, Ray-Chaudhuri, Hocquenghem) codes Designed to: maximise error correcting capacity maximise transmission rate approaching optimal tradeoff column separation not pertinent often inflexible with respect to message length and error correcting capacity
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
33 / 47
NER Performance NER F1 multiclass
one-vs-rest
hamming 7
exhaustive 82
standalone Cohn (Sheffield)
84
86
marginals Scaling CRFs using ECCs
88
90
product March 2010
34 / 47
Training Time multiclass
5516
one-vs-rest
1388
hamming 7
1925
exhaustive
32827 0
10000
20000
30000
40000
training time (secs) Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
35 / 47
POS Tagging Accuracy multiclass one-vs-rest H(10,6) BCH(31,6) random 200 95
standalone
Cohn (Sheffield)
95.45
95.9
marginals
Scaling CRFs using ECCs
96.35
96.8
97.25
product
March 2010
36 / 47
POS Tagging Training Time
multiclass
2305160
one-vs-rest
21429
H(10,6)
22162
BCH(31,6)
72081
random 200
437834 0
Cohn (Sheffield)
750000
1500000
Scaling CRFs using ECCs
2250000
3000000
March 2010
37 / 47
Outline
1
Introduction
2
Error Correcting Codes
3
Error Correcting Output Coding for Classification
4
Sequential ECOC
5
Experiments
6
Analysis
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
38 / 47
standalone
marginals
ECOC has Regularising Effect
product
NER F1
multiclass
one-vs-rest
exhaustive 87
87.75
88.5
MLE Cohn (Sheffield)
89.25
90
MAP Scaling CRFs using ECCs
March 2010
39 / 47
Choosing a Good Code
90.0
●
Pairs Exhaustive
●
●
89.8
●
BCH(11,3)
OvR ●
89.6
●
H(6,3)
● OvR+BI ● BI+EO+O
●
89.2
89.4
Product F1 score
H+(7,3)
OvR+EO
●
1k
2k
OvR+BI+EO 4k
8k
32k
Training time (s)
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
40 / 47
How to Minimize Error Correlation?
0.6 0.4 0.0
0.2
error correlation
0.8
1.0
Hamming distance is a fairly blunt metric.
1
2
3
4
column separation (bits) Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
41 / 47
How to Minimize Error Correlation?
0.6 0.4 0.0
0.2
error correlation
0.8
1.0
A better measure is the number of shared decision boundaries.
1
2
3
4
5
6
7
8
9
10 12
shared decision boundaries
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
42 / 47
Designing a Good Code Label distribution often highly skewed, e.g., NER O I-PER I-ORG I-LOC I-MISC B-PER B-ORG B-LOC B-MISC 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
Critical to predict common labels correctly Less important for rare labelsUntitled 1 Formalise as bound on probability of misclassifying a label Greedily construct code to maximise bound Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
43 / 47
Greedy Code Outperforms Random 89
F1 score
88 87 86 85 random minimum loss bound exhaustive code
84 83
10
15
20
25
30
35
40
45
50
code length Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
44 / 47
Training Non-Uniform Ensemble
Weight each binary model differently:
pLOP (y|x) ∝
Y
pj (bj (y)|x)αj
s.t.
αj = 1 ∧ αj ≥ 0, ∀j
j
j
= exp
X
X j
αj
X
λj,k Fj,k (bj (y), x)
k
optimise for {αj } given trained {pj } using soft-max transformation to enforce constraints objective has similar form of gradient to standard CRF
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
45 / 47
Non-Uniform Ensemble Results Learned weights for simplified POS tagging: label J N O R V α
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 1 1 0 0 0 .1
3 0 0 1 0 0 0
4 1 0 1 0 0 .1
5 0 1 1 0 0 .3
6 1 1 1 0 0 0+
7 0 0 0 1 0 0
8 1 0 0 1 0 0+
9 0 1 0 1 0 .1
10 1 1 0 1 0 .1
11 0 0 1 1 0 .1
12 1 0 1 1 0 .2
13 0 1 1 1 0 0+
14 1 1 1 1 0 0
doesn’t use single bit columns majority of weight on models which discriminate between common tags O>N>V J>R
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
46 / 47
Conclusions ECOC can reduce the training requirements of CRFs maintain generalisation performance make modelling tasks with large label sets feasible
Choice of a good code crucial high error-correcting capacity and low error correlation potential for further development of better codes e.g. better block codes, convolutional codes
Link to ensemble literature ensembles can dramatically improve accuracy c.f. single classifiers plethora of avenues for building good ensembles applications to other structured prediction models and to classifiers
Cohn (Sheffield)
Scaling CRFs using ECCs
March 2010
47 / 47