Machine Translation - Summer Semester 2018 - Raphael Rubino

Hand crafted features: nb characters, uppercase, etc. ▻ One-hot vector: sparse ... Recurrent Neural Network (RNN). ▻ Machine .... Data Level. ▻ Vocabulary ...
4MB taille 5 téléchargements 249 vues
Machine Translation Summer Semester 2018

Multilingual Language Technology DFKI Saarbrücken

Outline

1

A Few Words on Embeddings

2

Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model

3

Performances and Recent Advances NMT Performances Recent NMT Advances

Outline

1

A Few Words on Embeddings

2

Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model

3

Performances and Recent Advances NMT Performances Recent NMT Advances

Word Embeddings

I

How to represent words?

I

Hand crafted features: nb characters, uppercase, etc. One-hot vector: sparse representation

I I

Self-learned vector representation with real-numbered values: dense representation

Word Embeddings – Starting point

I

Starting point: one-hot encoding I I I I I

house . . . dog = (0, 0, 0, 0, 1, 0, . . . , 0, 0, 0) cat = (0, 0, 1, 0, 0, 0, . . . , 0, 0, 0) eat = (0, 0, 0, 0, 0, 0, . . . , 0, 1, 0) jump . . .

I

For each word, one vector, all 0 except one component (dimension)

I

Size of vector: vocabulary

Word Embeddings – Parameters Matrix

I I

Embedding vectors compose a shared matrix Definition of the embedding matrix shape: I I

height = vocabulary width = hyperparameter, usually ∈ [100; 1000]

I

Embeddings vs one-hot: dimensionality reduction

I

Embedding matrix shared between words

I

Embedding matrix usually considered as a layer of the NN

Word Embeddings – Parameters Matrix

From One-hot to Embedding

I

We start from one-hot vectors, we want embeddings

I

From the one-hot vector ohi , to get the corresponding embedding xi : C(ohi ) = xi = C.ohTi

I

This way, xi and thus C are learnable parameters of the model P P Remember: yk = sigmoid(( xi wij ) + bj )ukj

I

j

i

Word Embeddings – Initialization

I

The embedding matrix C has to be initialized... with what?

I

Usually random small numbers following a normal distribution with zero mean

I

There are pre-trained embeddings available online!

NN LM – Forward Pass

1

One-hot to embedding: ~ k ) = C.oh ~k x~k = C(oh with k ∈ [0; n]

2 3

T

Embedding to hidden layer: ~h = σ(x~k .W ) Hidden to output layer: ~y = ~h.U y

4

i Normalize output, for LM: pi = softmax(~ yi ) = Pe eyk k

with i ∈ [0; n] → This gives us a probability for each target word given the input word(s)

NN LM – Bengio et al. 2003

I

I

Our previous output calculation: ~y = ~h.U with ~h = σ(x~k .W ) Introduction of a direction connection with input: ~y = ~h.U + x~k .W2 with ~h = σ(x~k .W1 )

→ Direction connection aka residual aka skip aka highway connection

Outline

1

A Few Words on Embeddings

2

Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model

3

Performances and Recent Advances NMT Performances Recent NMT Advances

Outline

1

A Few Words on Embeddings

2

Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model

3

Performances and Recent Advances NMT Performances Recent NMT Advances

Recurrent Neural Network (RNN)

I

Machine translation requires modelling full sentences, not only n-gram

I

PB-SMT splits sentences into phrases, decoding (translating) involves re-combining phrases We could imagine a way to model phrases with feed-forward neural networks... but

I

I I I I

input and output lengths are fixed context is limited, long-distance dependencies are lost local information (position) needs to be encoded somehow etc.

RNN – Motivation

I

Encode large contexts

I

Produces a fixed length vector for a whole sequence

I

Handles variable length of input and output

I

Allows an elegant end-to-end sequence-to-sequence model

Figure: Christopher Olah

RNN – Architecture

Es

nervt

,

wenn

Landkarten

nicht

Embedding Matrix Recurrent NN Encoded Sentence

aktuell

sind

.

RNN – Hidden Cell

Figure: Christopher Olah

RNN – Hidden Cell

Figure: Christopher Olah

I

From one-hot encoding to embedding: x~t = E.w ~t

RNN – Hidden Cell

Figure: Christopher Olah

I

I

From one-hot encoding to embedding: x~t = E.w ~t From embedding to hidden representation: h~t = σ(x~t .W x + ~ht−1 .W h + ~b) with σ a non-linear function (activation), here the hyperbolic tangent

RNN – Hidden Cell

I

One-liner to compute hidden state h at time-step t: ht = σ((E.wt ).W x + ht−1 .W h + b)

Figure: Christopher Olah

RNN – Hidden Cell

I

I

One-liner to compute hidden state h at time-step t: ht = σ((E.wt ).W x + ht−1 .W h + b) Parameters of the model: I I I I

Embedding matrix E Weight matrix from input to hidden W x Weight matrix for recurrence W h Bias vector from input to hidden b

Figure: Christopher Olah

RNN – Encoder

I I

From one-hot to continuous space representation (1) The final hidden state (full red circle) encodes the whole input sequence

Figure: Kyunghyun Cho

RNN – Sentence Representation

Figure: Kyunghyun Cho

RNN – Decoder

I I

Encoder final hidden state hn fed as input to the decoder (noted 1) The decoder hidden state zi is obtained following: zi = σ(hn .W hd + ui−1 .W u + zi−1 .W z )

Figure: Kyunghyun Cho

RNN – Decoder

I I I

For each decoder hidden state zi , we want to generate a target token We need a target vocabulary and a target embeddings matrix E Each words wk in the target vocabulary is compared to zi : e(wk ) = wk .zi + b

RNN – Decoder

I

Let’s normalise these scores e: k )) p(wi = k|w1 , . . . , wi−1 , hn ) = Pexp(e(w exp(e(w )) j

j

I I

This normalisation is called softmax Gives us a probability for each target word

RNN – Encoder Decoder

Figure: Kyunghyun Cho

RNN – Generating Target Sequence

Figure: Kyunghyun Cho

I

For each decoder time-step: 1 2 3

Produce a list of translation candidates Calculate the current sequence probability based on previous time-steps Prune candidates out of the beam size

RNN – Limits

I

Some tokens in the input or output sequences are more important

I

Especially in natural language processing, dependencies between nouns and verbs, adjectives, etc.

I

Vanilla RNNs cannot choose what is important or not for each timestep

I

We need a mechanism to decide what to encode or not for each timestep

Outline

1

A Few Words on Embeddings

2

Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model

3

Performances and Recent Advances NMT Performances Recent NMT Advances

RNN – Hidden Cell

Figure: Christopher Olah

I

From embedding to hidden representation: h~t = σ(x~t .W x + ~ht−1 .W h + ~b) with σ a non-linear function (activation), here the hyperbolic tangent

RNN – LSTM Cell

Figure: Christopher Olah

RNN – LSTM Cell

Figure: Christopher Olah

I

Four NN layers are added inside each hidden cell

RNN – LSTM Cell

Figure: Christopher Olah

I

Four NN layers are added inside each hidden cell

I

These layers (or gates) regulate the information flow

RNN – LSTM Cell

Figure: Christopher Olah

I

Four NN layers are added inside each hidden cell

I

These layers (or gates) regulate the information flow

I

A new context vector Ct is calculated for each timestep

RNN – LSTM Input Gate

Figure: Christopher Olah

RNN – LSTM Forget Gate

Figure: Christopher Olah

RNN – LSTM Context Vector

Figure: Christopher Olah

RNN – LSTM Output Gate

Figure: Christopher Olah

RNN – Gated Cells Variants

I

Several variants for gated hidden cells

I

Adding connections between the gates and the context vector (peephole connection)

I

Combining the input and forget gates as a single update gate (GRU)

→ Comparison of some variants in LSTM: A Search Space Odyssey, Greff et al. 2015 → Evaluation of (thousands of) RNN architectures in An Empirical Exploration of Recurrent Network Architectures, Jozefowicz et al. 2015

RNN – Performance

Figure: Kyunghyun Cho

RNN – Performance

Figure: Kyunghyun Cho

RNN – Limits

I

RNNs fail at modelling long sequences

I

They apparently forget early time-steps

I

Experiments show that reversing the input sequence helps a bit

I

Compressing an entire input sequence into a fixed-size vector → loss of information

I

Given a single vector to the decoder limits its access to the input data

RNN – Tricks and Variants I I

RNNs fail at modelling long sequences and forget early time-steps → Bi-directional RNNs Variants in the types NN used as encoder or decoder (convolution, skip connections, etc.)

RNN – Limits

I

Compressing an entire input sequence into a fixed-size vector → loss of information

This problem remains

RNN – Limits

I

Compressing an entire input sequence into a fixed-size vector → loss of information

This problem remains I

We need to let the decoder access the input sequence

Outline

1

A Few Words on Embeddings

2

Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model

3

Performances and Recent Advances NMT Performances Recent NMT Advances

Encoder – Decoder with Attention

Figure: Christopher Olah

Encoder – Decoder with Attention

Figure: Dzmitry Bahdanau

Encoder – Decoder with Attention

Figure: Dzmitry Bahdanau

I I I

Alignement scores calculated for each pair of source–target tokens eij = a(si−1 , hj ) = va σ(Wa .si−1 + Ua .hj ) exp(eij ) αij = P n exp(eik )

k=1

I

ci =

n P

αij hj

j=1 I

where i and j are decoder and encoder time-steps respectively The context vector ci is used to produce the target token yi

Encoder – Decoder with Attention

Figure: Dzmitry Bahdanau

I I I I

Intuition: we want to focus on what is important in the source to produce the target The context vector ci for each decoder time-step encodes the whole input sequence with a specific focus This focus is obtained with the weighted average of source hidden states Similar to a query – value operation: 1 2 3

The decoder generated the query (previous decoder time-step) Comparisons between the query and all possible values (source hidden states) The most relevant values are emphasized in the context vector

Encoder – Decoder with Attention

Figure: Kyunghyun Cho

Visualizing Attention

Figure: Dzmitry Bahdanau

Visualizing Attention

Figure: Kyunghyun Cho

Visualizing Attention

Figure: Kyunghyun Cho

1

A Few Words on Embeddings

2

Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model

3

Performances and Recent Advances NMT Performances Recent NMT Advances

Outline

1

A Few Words on Embeddings

2

Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model

3

Performances and Recent Advances NMT Performances Recent NMT Advances

Performances – Attention

Performances – Evolution

Performances – WMT17

Performances – Fine Grained

Performances – Google Results

Outline

1

A Few Words on Embeddings

2

Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model

3

Performances and Recent Advances NMT Performances Recent NMT Advances

Data Level

I

Vocabulary size is an issue for NMT (embedding matrix, softmax computation, etc.)

I

One solution: keep the n most frequent words

I

A better solution: sub-word units, between character and word-level tokens (Byte pair encoding)

Data Level I

I I I

The amount of data required to train NMT systems is large (dozens of millions of sentence pairs) Lack of domain and genre specific resources... One solution: mixing language pairs and domains A better solution: adding specific tokens to specify the language, domain, speaker, etc.

Data Level

I

For specific genres and domains, it would be nice to force the decoder into producing particular output

I

It is feasible using domain or genre specific target data

I

Monolingual corpora for specific domains and genres are available Idea:

I

1 2 3 4 5

take domain specific monolingual corpus in the target language take a translation systems for the pair target–source translate the monolingual data use the produced automatic translation as source and the monolingual specific data as target you got additional training data!

Neural Network Architecture I I I I I

RNNs require a lot of computing power to train The attention model appears to help improve performances Recent works focus on using attention only architectures In addition to the encoder–decoder attention, use of self-attention In addition to a single context vector, computation of multi-head attention