Scaling Conditional Random Fields using Error-Correcting Codes

sender receiver communication channel. "one small step for a man". "one small step for man". Cohn (Sheffield). Scaling CRFs using ECCs. March 2010. 9 / 47 ...
785KB taille 14 téléchargements 445 vues
Scaling Conditional Random Fields using Error-Correcting Codes Trevor Cohn Department of Computer Science University of Sheffield

March 2010

Joint work with Andrew Smith and Miles Osborne (Edinburgh)

Motivation

Conditional random fields are state-of-the-art models sequential analogue of maxent classifier discriminative probabilistic model of joint labelling allow for large and rich feature sets

Some caveats don’t scale well, especially to large label sets hard to regularise effectively

I’ll be presenting error-correcting output coding for CRFs decomposes multiclass task into many simpler binary tasks improves training time can better regularise the model

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

2 / 47

Outline

1

Introduction

2

Error Correcting Codes

3

Error Correcting Output Coding for Classification

4

Sequential ECOC

5

Experiments

6

Analysis

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

3 / 47

Outline

1

Introduction

2

Error Correcting Codes

3

Error Correcting Output Coding for Classification

4

Sequential ECOC

5

Experiments

6

Analysis

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

4 / 47

CRFs in natural language processing Model distribution over the structure given text input mostly time series data: given tokens predict a label for each token eg., part-of-speech tagging, named-entity recognition also more complex structured prediction, e.g., network labelling, parsing, translation

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

5 / 47

Why a CRF? Allows for rich and varied features which access any aspect of the observations Simple log linear formulation with convex training objective Global inference avoids bias label/observation bias

But . . . Complexity of inference can be a problem training requires label marginals decoding requires Viterbi label sequence produced by the sum/max-product algorithm in O(LC ) time A further problem is overfitting of the training sample

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

6 / 47

Conditional Random Fields Y1

Y2

X1

X2

Yn

...

Xn

N+1

p(y|x) =

XX 1 exp λk fk (n, yn−1 , yn , x) Z (x) n=1

Trained to maximise L(D) = log

k

Q

(x,y)∈D

p(y|x)

Convex optimisation problem Decoding solves y∗ = arg maxy p(y|x) Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

7 / 47

Outline

1

Introduction

2

Error Correcting Codes

3

Error Correcting Output Coding for Classification

4

Sequential ECOC

5

Experiments

6

Analysis

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

8 / 47

Communicating over a Noisy Channel

communication channel receiver

sender "one small step for a man" "one small step for man"

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

9 / 47

Encoding and Decoding

Solution is to add redundancy into the transmitted message Sender adds extra information to the message Received uses the known redundancy to infer both the orginal message and the noise An error correcting code describes how to add redundancy and how to recovering the original message. Conflicting aims: More redundancy = better able to correct errors Less redundancy = better throughput

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

10 / 47

Example Code: Repetition Code

Imagine we repeat each bit thrice 010 becomes 000 111 000 some random noise corrupts the signal flipping bits 2, 7 and eight: 010 000 110 receiver gets 010 111 110 process each triple deducing the most probable source bit mimimise the Hamming distance to 000 or 111 equivalent to maximising probability if bit errors are independent with probability f < 0.5

recovered message is 011 and noise signal 010 000 001 can correct single bit errors in each block with rate

Cohn (Sheffield)

Scaling CRFs using ECCs

1 3

March 2010

11 / 47

Example Code: Hamming Code

Take blocks of message bits and encode them with parity bits transmit 4 bits at a time s1 , s2 , s3 , s4 then transmit 3 parity bits t5 = (s1 + s2 + s3 ) mod 2 t6 = (s2 + s3 + s4 ) mod 2 t7 = (s1 + s3 + s4 ) mod 2

t5 s1 t7

s s3 2 s4 t6

message 0100 becomes 0100 110 corrupted in bit 2, so receiver gets 0000 110 parity checks t5 and t7 are odd: an error occurred

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

12 / 47

Decoding the Hamming Code Enumerate the code words = message + parity strings. Find one closest to received message, 0000 110. message 0 0 0 0 0 1 0 1 0 ... 0 1 0 0 0 1 0 1 ... 1 1 1 1 0 0 0

0 0 1

parity 0 0 1 1 1 1

1 1

1 0

0 1

1

1

1

All code words differ in at least three bits → can correct a single error Transmission rate of Cohn (Sheffield)

4 7 Scaling CRFs using ECCs

March 2010

13 / 47

Decoding the Hamming Code Enumerate the code words = message + parity strings. Find one closest to received message, 0000 110. message 0 0 0 0 0 1 0 1 0 ... 0 1 0 0 0 1 0 1 ... 1 1 1 1 0 0 0

0 0 1

parity 0 0 1 1 1 1

1 1

1 0

0 1

1

1

1

All code words differ in at least three bits → can correct a single error Transmission rate of Cohn (Sheffield)

4 7 Scaling CRFs using ECCs

March 2010

13 / 47

Decoding the Hamming Code Enumerate the code words = message + parity strings. Find one closest to received message, 0000 110. message 0 0 0 0 0 1 0 1 0 ... 0 1 0 0 0 1 0 1 ... 1 1 1 1 0 0 0

0 0 1

parity 0 0 1 1 1 1

1 1

1 0

0 1

1

1

1

All code words differ in at least three bits → can correct a single error Transmission rate of Cohn (Sheffield)

4 7 Scaling CRFs using ECCs

March 2010

13 / 47

Desiderata of the Code

Ability to detect and correct many errors A high Hamming distance between all code words H error correcting capacity = b min c 2 Minimal overhead reducing transmission rate Using as few parity bits as possible Efficient encoding and decoding algorithms

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

14 / 47

Outline

1

Introduction

2

Error Correcting Codes

3

Error Correcting Output Coding for Classification

4

Sequential ECOC

5

Experiments

6

Analysis

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

15 / 47

Error-Correcting Output Coding Replace noisy channel

communication channel receiver

sender

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

16 / 47

Error-Correcting Output Coding Replace noisy channel with learning framework

transmitting the correct label

sender

classification framework receiver

training instances, input features, learning alg.

predicted label

[Dietterich, Kong, Bakiri, 1995] Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

16 / 47

Error-Correcting Output Coding

Training Use an ECC to encode each label as a bit string Build a classifier for each bit position To distinguish between labels for which the bit is 0 vs 1

Testing All classifiers make predictions about their bit Find the label with the closest code word under the Hamming distance A form of majority vote

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

17 / 47

Example

sports

travel

arts

money

Label sports money arts travel

Code words 1 1 1 0 1 0 1 0 0 0 0 1

At training time . . . classifier 1 distinguishes {sports,arts} from {travel,money} classifier 2 distinguishes {sports,money} from {arts,travel} classifier 3 distinguishes {sports,travel} from {money,arts}

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

18 / 47

Example

sports

travel

arts

money

Label sports money arts travel

Code words 1 1 1 0 1 0 1 0 0 0 0 1

At test time . . . if all classifiers predict 1 the label is sports if they predict 010 label is money but for all 0s there’s no exact match: the label could be money, arts or travel Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

18 / 47

Why use ECOCs?

Allows binary classifiers to be used on multi-class problems traditionally used with binary classifiers (e.g., SVMs)

Robust to errors by individual classifers Provided error distributions have low correlation Reduces both the bias and variance Often improves performance over multi-class methods And in our case, can also reduce the complexity of training

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

19 / 47

Choosing the Right Code

Some similarities to vanilla error-correcting codes Robustness dependent on code’s error-correcting capacity choose code with good row separation

Column separation is also important otherwise binary classifiers model similar concepts consequently errors become more correlated imposes a limit on the size of the code of 2L−1 − 1 columns

Smaller codes with few columns are faster to use fewer binary models to train and to vote at test time

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

20 / 47

Outline

1

Introduction

2

Error Correcting Codes

3

Error Correcting Output Coding for Classification

4

Sequential ECOC

5

Experiments

6

Analysis

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

21 / 47

How can we apply ECOC to sequence models?

We use a very simple technique: encode each label independently as a bit string using a block code train a classifier to predict binary sequences for a given bit each classifier models the label sequence subject to various parameters being tied For testing: each classifier predicts a binary sequence these are then merged using the code to find the best label sequence

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

22 / 47

Decoding the Label Sequence

We compare three different means of decoding: standalone: using Viterbi paths from each CRF with independent label resolution marginals: using marginals at each position from each CRF with independent label resolution product: find Viterbi path under a product model of the component CRFs

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

23 / 47

Standalone Decoding Find Viterbi path for each binary CRF yields sequence of {0, 1}s

Create vector of predictions for each time Choose label with closest code word for each time using Hamming distance

model 1 2 3 ...

Cohn (Sheffield)

1 0 0

1 1 0

time 1 0 1 1 0 1

Scaling CRFs using ECCs

0 0 1

0 1 1

March 2010

24 / 47

Marginals Decoding Find marginal probabilities at each time for each binary CRF yields sequence of real numbers [0, 1]

Create vector of predictions at each time Find label with closest code word using L1 distance

model 1 2 3 ...

Cohn (Sheffield)

0.9 0.2 0.1

0.9 0.8 0.1

time 0.6 0.5 0.5 0.2 0.6 0.4

Scaling CRFs using ECCs

0.1 0.8 0.8

0.4 0.1 0.7

March 2010

25 / 47

Product Decoding

Assume that the joint probability of a label sequence decomposes into the k independent binary sequences As such, the probability of a labelling is given by: p(y|x) =

1 Y pj (bj (y)|x) ZP (x) j

ZP (x) is a normalising function pj is the j th binary CRF bj relabels a vector of labels into a binary vector using column j of the coding matrix

This product model is itself a CRF – use standard Viterbi decoding

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

26 / 47

Logarithmic Opinion Pools The product model is a Logarithmic Opinion Pool [Heskes 1998]. Can express the MLE objective of the ensemble: ˜ ||pLOP ) = KL(p

X

˜ ||pj ) − KL(p

X

j

|

KL(pLOP ||pj )

j

{z E

}

|

{z A

}

Balance between the two terms individually accurate constituent models, E → 0 considerable diversity relative to the ensemble, A → ∞

Means to optimise this objective decompose the label space or training set diverse feature representations [Smith 2007] Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

27 / 47

Outline

1

Introduction

2

Error Correcting Codes

3

Error Correcting Output Coding for Classification

4

Sequential ECOC

5

Experiments

6

Analysis

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

28 / 47

Experimental Setup Two tasks, one easy and one hard: Named entity recognition 8 labels denoting location, person, organisation, miscellaneous entity types using IOB tagging format CoNLL 2003 data set, approx 200k training tokens

Part-of-speech tagging 45 labels Penn Treebank WSJ 1 million training tokens I-ORG NN EU

O VBZ rejects

Cohn (Sheffield)

I-MISC JJ German

O NN call

O TO to

O VB boycott

Scaling CRFs using ECCs

I-MISC JJ British

O NN lamb

O . .

March 2010

29 / 47

Exhaustive Code

Exhaustive contains every unique column there are 2L−1 − 1 columns any other columns are trivial or would replicate existing columns

Label LOC MISC ORG O

Cohn (Sheffield)

1 0 0 0

0 1 0 0

Code word 1 0 1 1 0 0 0 1 1 0 0 0

Scaling CRFs using ECCs

0 1 1 0

1 1 1 0

March 2010

30 / 47

One-vs-Rest Code

Each model learns to distinguish a single tag simplest code no error-correcting capacity good independence between models

Label LOC MISC ORG O

Cohn (Sheffield)

1 0 0 0

0 1 0 0

Code word 1 0 1 1 0 0 0 1 1 0 0 0

Scaling CRFs using ECCs

0 1 1 0

1 1 1 0

March 2010

31 / 47

Random Code

Random random subset of the exhaustive code desirable properties in the limit (Berger, 1999)

Label LOC MISC ORG O

Cohn (Sheffield)

1 0 0 0

0 1 0 0

Code word 1 0 1 1 0 0 0 1 1 0 0 0

Scaling CRFs using ECCs

0 1 1 0

1 1 1 0

March 2010

32 / 47

Algebraic Codes Large literature on block codes for communication over a noisy-channel. E.g., Hamming codes BCH (Bose, Ray-Chaudhuri, Hocquenghem) codes Designed to: maximise error correcting capacity maximise transmission rate approaching optimal tradeoff column separation not pertinent often inflexible with respect to message length and error correcting capacity

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

33 / 47

NER Performance NER F1 multiclass

one-vs-rest

hamming 7

exhaustive 82

standalone Cohn (Sheffield)

84

86

marginals Scaling CRFs using ECCs

88

90

product March 2010

34 / 47

Training Time multiclass

5516

one-vs-rest

1388

hamming 7

1925

exhaustive

32827 0

10000

20000

30000

40000

training time (secs) Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

35 / 47

POS Tagging Accuracy multiclass one-vs-rest H(10,6) BCH(31,6) random 200 95

standalone

Cohn (Sheffield)

95.45

95.9

marginals

Scaling CRFs using ECCs

96.35

96.8

97.25

product

March 2010

36 / 47

POS Tagging Training Time

multiclass

2305160

one-vs-rest

21429

H(10,6)

22162

BCH(31,6)

72081

random 200

437834 0

Cohn (Sheffield)

750000

1500000

Scaling CRFs using ECCs

2250000

3000000

March 2010

37 / 47

Outline

1

Introduction

2

Error Correcting Codes

3

Error Correcting Output Coding for Classification

4

Sequential ECOC

5

Experiments

6

Analysis

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

38 / 47

standalone

marginals

ECOC has Regularising Effect

product

NER F1

multiclass

one-vs-rest

exhaustive 87

87.75

88.5

MLE Cohn (Sheffield)

89.25

90

MAP Scaling CRFs using ECCs

March 2010

39 / 47

Choosing a Good Code

90.0



Pairs Exhaustive





89.8



BCH(11,3)

OvR ●

89.6



H(6,3)

● OvR+BI ● BI+EO+O



89.2

89.4

Product F1 score

H+(7,3)

OvR+EO



1k

2k

OvR+BI+EO 4k

8k

32k

Training time (s)

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

40 / 47

How to Minimize Error Correlation?

0.6 0.4 0.0

0.2

error correlation

0.8

1.0

Hamming distance is a fairly blunt metric.

1

2

3

4

column separation (bits) Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

41 / 47

How to Minimize Error Correlation?

0.6 0.4 0.0

0.2

error correlation

0.8

1.0

A better measure is the number of shared decision boundaries.

1

2

3

4

5

6

7

8

9

10 12

shared decision boundaries

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

42 / 47

Designing a Good Code Label distribution often highly skewed, e.g., NER O I-PER I-ORG I-LOC I-MISC B-PER B-ORG B-LOC B-MISC 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

Critical to predict common labels correctly Less important for rare labelsUntitled 1 Formalise as bound on probability of misclassifying a label Greedily construct code to maximise bound Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

43 / 47

Greedy Code Outperforms Random 89

F1 score

88 87 86 85 random minimum loss bound exhaustive code

84 83

10

15

20

25

30

35

40

45

50

code length Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

44 / 47

Training Non-Uniform Ensemble

Weight each binary model differently:

pLOP (y|x) ∝

Y

pj (bj (y)|x)αj

s.t.

αj = 1 ∧ αj ≥ 0, ∀j

j

j

= exp

X

X j

αj

X

λj,k Fj,k (bj (y), x)

k

optimise for {αj } given trained {pj } using soft-max transformation to enforce constraints objective has similar form of gradient to standard CRF

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

45 / 47

Non-Uniform Ensemble Results Learned weights for simplified POS tagging: label J N O R V α

0 1 0 0 0 0 0

1 0 1 0 0 0 0

2 1 1 0 0 0 .1

3 0 0 1 0 0 0

4 1 0 1 0 0 .1

5 0 1 1 0 0 .3

6 1 1 1 0 0 0+

7 0 0 0 1 0 0

8 1 0 0 1 0 0+

9 0 1 0 1 0 .1

10 1 1 0 1 0 .1

11 0 0 1 1 0 .1

12 1 0 1 1 0 .2

13 0 1 1 1 0 0+

14 1 1 1 1 0 0

doesn’t use single bit columns majority of weight on models which discriminate between common tags O>N>V J>R

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

46 / 47

Conclusions ECOC can reduce the training requirements of CRFs maintain generalisation performance make modelling tasks with large label sets feasible

Choice of a good code crucial high error-correcting capacity and low error correlation potential for further development of better codes e.g. better block codes, convolutional codes

Link to ensemble literature ensembles can dramatically improve accuracy c.f. single classifiers plethora of avenues for building good ensembles applications to other structured prediction models and to classifiers

Cohn (Sheffield)

Scaling CRFs using ECCs

March 2010

47 / 47