Machine Translation - Summer Semester 2018 - Raphael Rubino

Multilingual Language Technology. DFKI Saarbrücken. Page 2. Outline. 1 A Few Words on Embeddings. 2 Recurrent ... Word Embeddings – Starting point.
6MB taille 0 téléchargements 271 vues
Machine Translation Summer Semester 2018

Multilingual Language Technology DFKI Saarbrücken

Outline

1

A Few Words on Embeddings

2

Recurrent Neural Network

3

Encoder – Decoder

4

Encoder – Decoder with Attention

Outline

1

A Few Words on Embeddings

2

Recurrent Neural Network

3

Encoder – Decoder

4

Encoder – Decoder with Attention

Word Embeddings

I

How to represent words?

I

Hand crafted features: nb characters, uppercase, etc.

I

One-hot vector: sparse representation

I

Self-learned vector representation with real-numbered values: dense representation

Word Embeddings – Starting point

I

Starting point: one-hot encoding I I I I I

house . . . dog = (0, 0, 0, 0, 1, 0, . . . , 0, 0, 0) cat = (0, 0, 1, 0, 0, 0, . . . , 0, 0, 0) eat = (0, 0, 0, 0, 0, 0, . . . , 0, 1, 0) jump . . .

I

For each word, one vector, all 0 except one component (dimension)

I

Size of vector: vocabulary

Word Embeddings – Parameters Matrix

I I

Embedding vectors compose a shared matrix Definition of the embedding matrix shape: I I

height = vocabulary width = hyperparameter, usually ∈ [100; 1000]

I

Embeddings vs one-hot: dimensionality reduction

I

Embedding matrix shared between words

I

Embedding matrix usually considered as a layer of the NN

Word Embeddings – Parameters Matrix

From One-hot to Embedding

I I

I I

We start from one-hot vectors, we want embeddings From the one-hot vector ohi , to get the corresponding embedding xi : C(ohi ) = xi = C.ohTi This way, xi and thus C are learnable parameters of the model P P Remember: yk = sigmoid(( xi wij ) + bj )ukj j

i

Word Embeddings – Initialization

I

The embedding matrix C has to be initialized... with what?

I

Usually random small numbers following a normal distribution with zero mean

I

There are pre-trained embeddings available online!

NN LM – Forward Pass

1

2 3

One-hot to embedding: ~ k ) = C.oh ~ kT x~k = C(oh with k ∈ [0; n] Embedding to hidden layer: ~h = σ(x~k .W ) Hidden to output layer: ~y = ~h.U y

4

i Normalize output, for LM: pi = softmax(~ yi ) = Pe eyk k

with i ∈ [0; n] → This gives us a probability for each target word given the input word(s)

NN LM – Bengio et al. 2003

I

I

Our previous output calculation: ~y = ~h.U with ~h = σ(x~k .W ) Introduction of a direction connection with input: ~y = ~h.U + x~k .W2 with ~h = σ(x~k .W1 )

→ Direction connection aka residual aka skip aka highway connection

Outline

1

A Few Words on Embeddings

2

Recurrent Neural Network

3

Encoder – Decoder

4

Encoder – Decoder with Attention

Recurrent Neural Network

I

Machine translation requires modelling full sentences, not only n-gram

I

PB-SMT splits sentences into phrases, decoding (translating) involves re-combining phrases We could imagine a way to model phrases with feed-forward neural networks... but

I

I I I I

input and output lengths are fixed context is limited, long-distance dependencies are lost local information (position) needs to be encoded somehow etc.

Recurrent Neural Network – LM I I

I

A RNN LM aims at predicting a word given a context of variable length We are now talking about timesteps, every token in the input marks one timestep t The trick is to use the previous timestep ht−1 in additon to the embedding of the word wt : y~t = h~t .U with h~t = σ(x~k .W x + ht−1~.W h )

Recurrent Neural Network – LM I

For the first timestep, we input a begining of sequence token in addition to C(w1 )

I

For the last timestep, the network outputs p(w4 |w3 , w2 , w1 )

I

The network encodes large contexts and local positions

Recurrent Neural Network – Training I

The forward pass is the same as feed-forward network (plus previous hidden vector)

I

The backward pass involves back-propagation through time

I

Back-propagate the error over a few timesteps

I

Possible to back-propagate over the entire input sequence by unfolding the network

Recurrent Neural Network

I

With a RNN we can encode a variable length input sequence

I

An input sequence S = w1 , w2 , . . . , wn is encoded in final hidden state hn

I

This architecture is called an encoder

I

Other types of NNs are suitable as encoder

I

For NMT, the most popular encoders are variants of RNN

Gated Hidden Units

I

Some tokens in the input sequence are more important for a given task

I

Especially in natural language processing, dependencies between nouns and verbs, adjectives, etc.

I

Vanilla RNNs cannot choose what is important or not for each timestep

I

We need a mechanism to decide what to encode or not for each timestep

Long-short Term Memory I

One type of gated hidden unit is called LSTM

I

Introduction of three gates: input, memory and output

Long-short Term Memory

I

I I

the input gate parameter controles how much new input changes the memory state the forget gate parameter regulates how much of the memory state is kept the output gate parameter weights the memory state to be passed to the next layer

LSTM – Formally

I

I I I I

For each gate a ∈ (input, forget, output): gatea = σ(W xa .xt + W ha .ht−1 + W ma .memoryt−1 ) inputt = σ(W x .xt + W h .ht−1 memoryt = gateinput × inputt + gatef orget × memoryt−1 outputt = gateoutput × memoryt ht = σ(outputt )

Gated Recurrent Units I I I

Another populare gated hidden unit is called GRU Simpler (less parameters) than LSTM There is no clear performance difference between GRU and LSTM

GRU – Formally

I

updatet = σ(Wupdate .inputt + Uupdate .statet−1 + biasupdate )

I

resett = σ(Wreset .inputt + Ureset .statet−1 + biasreset )

I

combinationt = σ(W.inputt + U.(resett .statet−1 ))

I

statet = (1 − updatet ).statet−1 + (updatet .combinationt ) + bias

Outline

1

A Few Words on Embeddings

2

Recurrent Neural Network

3

Encoder – Decoder

4

Encoder – Decoder with Attention

Encoder – Decoder I

We have seen how to encode an input sequence

Encoder – Decoder I

Remember what was published in 1997...

Encoder – Decoder I

Remember what was published in 1997...

Encoder – Decoder I

Resulting in today’s popular architecture for NMT

Encoder Details

Decoder Details

Encoder – Decoder Limits

Raymond Mooney is a professor of computer science at the University of Texas, Austin, with more than 29k citations on Google Scholar.

Encoder – Decoder Limits I

RNN with LSTM or GRU cells fails at modelling long input sequences

Encoder – Decoder Limits I

Compared with PBSMT...

Outline

1

A Few Words on Embeddings

2

Recurrent Neural Network

3

Encoder – Decoder

4

Encoder – Decoder with Attention