Machine Translation Summer Semester 2018
Multilingual Language Technology DFKI Saarbrücken
Outline
1
A Few Words on Embeddings
2
Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model
3
Performances and Recent Advances NMT Performances Recent NMT Advances
Outline
1
A Few Words on Embeddings
2
Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model
3
Performances and Recent Advances NMT Performances Recent NMT Advances
Word Embeddings
I
How to represent words?
I
Hand crafted features: nb characters, uppercase, etc. One-hot vector: sparse representation
I I
Self-learned vector representation with real-numbered values: dense representation
Word Embeddings – Starting point
I
Starting point: one-hot encoding I I I I I
house . . . dog = (0, 0, 0, 0, 1, 0, . . . , 0, 0, 0) cat = (0, 0, 1, 0, 0, 0, . . . , 0, 0, 0) eat = (0, 0, 0, 0, 0, 0, . . . , 0, 1, 0) jump . . .
I
For each word, one vector, all 0 except one component (dimension)
I
Size of vector: vocabulary
Word Embeddings – Parameters Matrix
I I
Embedding vectors compose a shared matrix Definition of the embedding matrix shape: I I
height = vocabulary width = hyperparameter, usually ∈ [100; 1000]
I
Embeddings vs one-hot: dimensionality reduction
I
Embedding matrix shared between words
I
Embedding matrix usually considered as a layer of the NN
Word Embeddings – Parameters Matrix
From One-hot to Embedding
I
We start from one-hot vectors, we want embeddings
I
From the one-hot vector ohi , to get the corresponding embedding xi : C(ohi ) = xi = C.ohTi
I
This way, xi and thus C are learnable parameters of the model P P Remember: yk = sigmoid(( xi wij ) + bj )ukj
I
j
i
Word Embeddings – Initialization
I
The embedding matrix C has to be initialized... with what?
I
Usually random small numbers following a normal distribution with zero mean
I
There are pre-trained embeddings available online!
NN LM – Forward Pass
1
One-hot to embedding: ~ k ) = C.oh ~k x~k = C(oh with k ∈ [0; n]
2 3
T
Embedding to hidden layer: ~h = σ(x~k .W ) Hidden to output layer: ~y = ~h.U y
4
i Normalize output, for LM: pi = softmax(~ yi ) = Pe eyk k
with i ∈ [0; n] → This gives us a probability for each target word given the input word(s)
NN LM – Bengio et al. 2003
I
I
Our previous output calculation: ~y = ~h.U with ~h = σ(x~k .W ) Introduction of a direction connection with input: ~y = ~h.U + x~k .W2 with ~h = σ(x~k .W1 )
→ Direction connection aka residual aka skip aka highway connection
Outline
1
A Few Words on Embeddings
2
Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model
3
Performances and Recent Advances NMT Performances Recent NMT Advances
Outline
1
A Few Words on Embeddings
2
Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model
3
Performances and Recent Advances NMT Performances Recent NMT Advances
Recurrent Neural Network (RNN)
I
Machine translation requires modelling full sentences, not only n-gram
I
PB-SMT splits sentences into phrases, decoding (translating) involves re-combining phrases We could imagine a way to model phrases with feed-forward neural networks... but
I
I I I I
input and output lengths are fixed context is limited, long-distance dependencies are lost local information (position) needs to be encoded somehow etc.
RNN – Motivation
I
Encode large contexts
I
Produces a fixed length vector for a whole sequence
I
Handles variable length of input and output
I
Allows an elegant end-to-end sequence-to-sequence model
Figure: Christopher Olah
RNN – Architecture
Es
nervt
,
wenn
Landkarten
nicht
Embedding Matrix Recurrent NN Encoded Sentence
aktuell
sind
.
RNN – Hidden Cell
Figure: Christopher Olah
RNN – Hidden Cell
Figure: Christopher Olah
I
From one-hot encoding to embedding: x~t = E.w ~t
RNN – Hidden Cell
Figure: Christopher Olah
I
I
From one-hot encoding to embedding: x~t = E.w ~t From embedding to hidden representation: h~t = σ(x~t .W x + ~ht−1 .W h + ~b) with σ a non-linear function (activation), here the hyperbolic tangent
RNN – Hidden Cell
I
One-liner to compute hidden state h at time-step t: ht = σ((E.wt ).W x + ht−1 .W h + b)
Figure: Christopher Olah
RNN – Hidden Cell
I
I
One-liner to compute hidden state h at time-step t: ht = σ((E.wt ).W x + ht−1 .W h + b) Parameters of the model: I I I I
Embedding matrix E Weight matrix from input to hidden W x Weight matrix for recurrence W h Bias vector from input to hidden b
Figure: Christopher Olah
RNN – Encoder
I I
From one-hot to continuous space representation (1) The final hidden state (full red circle) encodes the whole input sequence
Figure: Kyunghyun Cho
RNN – Sentence Representation
Figure: Kyunghyun Cho
RNN – Decoder
I I
Encoder final hidden state hn fed as input to the decoder (noted 1) The decoder hidden state zi is obtained following: zi = σ(hn .W hd + ui−1 .W u + zi−1 .W z )
Figure: Kyunghyun Cho
RNN – Decoder
I I I
For each decoder hidden state zi , we want to generate a target token We need a target vocabulary and a target embeddings matrix E Each words wk in the target vocabulary is compared to zi : e(wk ) = wk .zi + b
RNN – Decoder
I
Let’s normalise these scores e: k )) p(wi = k|w1 , . . . , wi−1 , hn ) = Pexp(e(w exp(e(w )) j
j
I I
This normalisation is called softmax Gives us a probability for each target word
RNN – Encoder Decoder
Figure: Kyunghyun Cho
RNN – Generating Target Sequence
Figure: Kyunghyun Cho
I
For each decoder time-step: 1 2 3
Produce a list of translation candidates Calculate the current sequence probability based on previous time-steps Prune candidates out of the beam size
RNN – Limits
I
Some tokens in the input or output sequences are more important
I
Especially in natural language processing, dependencies between nouns and verbs, adjectives, etc.
I
Vanilla RNNs cannot choose what is important or not for each timestep
I
We need a mechanism to decide what to encode or not for each timestep
Outline
1
A Few Words on Embeddings
2
Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model
3
Performances and Recent Advances NMT Performances Recent NMT Advances
RNN – Hidden Cell
Figure: Christopher Olah
I
From embedding to hidden representation: h~t = σ(x~t .W x + ~ht−1 .W h + ~b) with σ a non-linear function (activation), here the hyperbolic tangent
RNN – LSTM Cell
Figure: Christopher Olah
RNN – LSTM Cell
Figure: Christopher Olah
I
Four NN layers are added inside each hidden cell
RNN – LSTM Cell
Figure: Christopher Olah
I
Four NN layers are added inside each hidden cell
I
These layers (or gates) regulate the information flow
RNN – LSTM Cell
Figure: Christopher Olah
I
Four NN layers are added inside each hidden cell
I
These layers (or gates) regulate the information flow
I
A new context vector Ct is calculated for each timestep
RNN – LSTM Input Gate
Figure: Christopher Olah
RNN – LSTM Forget Gate
Figure: Christopher Olah
RNN – LSTM Context Vector
Figure: Christopher Olah
RNN – LSTM Output Gate
Figure: Christopher Olah
RNN – Gated Cells Variants
I
Several variants for gated hidden cells
I
Adding connections between the gates and the context vector (peephole connection)
I
Combining the input and forget gates as a single update gate (GRU)
→ Comparison of some variants in LSTM: A Search Space Odyssey, Greff et al. 2015 → Evaluation of (thousands of) RNN architectures in An Empirical Exploration of Recurrent Network Architectures, Jozefowicz et al. 2015
RNN – Performance
Figure: Kyunghyun Cho
RNN – Performance
Figure: Kyunghyun Cho
RNN – Limits
I
RNNs fail at modelling long sequences
I
They apparently forget early time-steps
I
Experiments show that reversing the input sequence helps a bit
I
Compressing an entire input sequence into a fixed-size vector → loss of information
I
Given a single vector to the decoder limits its access to the input data
RNN – Tricks and Variants I I
RNNs fail at modelling long sequences and forget early time-steps → Bi-directional RNNs Variants in the types NN used as encoder or decoder (convolution, skip connections, etc.)
RNN – Limits
I
Compressing an entire input sequence into a fixed-size vector → loss of information
This problem remains
RNN – Limits
I
Compressing an entire input sequence into a fixed-size vector → loss of information
This problem remains I
We need to let the decoder access the input sequence
Outline
1
A Few Words on Embeddings
2
Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model
3
Performances and Recent Advances NMT Performances Recent NMT Advances
Encoder – Decoder with Attention
Figure: Christopher Olah
Encoder – Decoder with Attention
Figure: Dzmitry Bahdanau
Encoder – Decoder with Attention
Figure: Dzmitry Bahdanau
I I I
Alignement scores calculated for each pair of source–target tokens eij = a(si−1 , hj ) = va σ(Wa .si−1 + Ua .hj ) exp(eij ) αij = P n exp(eik )
k=1
I
ci =
n P
αij hj
j=1 I
where i and j are decoder and encoder time-steps respectively The context vector ci is used to produce the target token yi
Encoder – Decoder with Attention
Figure: Dzmitry Bahdanau
I I I I
Intuition: we want to focus on what is important in the source to produce the target The context vector ci for each decoder time-step encodes the whole input sequence with a specific focus This focus is obtained with the weighted average of source hidden states Similar to a query – value operation: 1 2 3
The decoder generated the query (previous decoder time-step) Comparisons between the query and all possible values (source hidden states) The most relevant values are emphasized in the context vector
Encoder – Decoder with Attention
Figure: Kyunghyun Cho
Visualizing Attention
Figure: Dzmitry Bahdanau
Visualizing Attention
Figure: Kyunghyun Cho
Visualizing Attention
Figure: Kyunghyun Cho
1
A Few Words on Embeddings
2
Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model
3
Performances and Recent Advances NMT Performances Recent NMT Advances
Outline
1
A Few Words on Embeddings
2
Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model
3
Performances and Recent Advances NMT Performances Recent NMT Advances
Performances – Attention
Performances – Evolution
Performances – WMT17
Performances – Fine Grained
Performances – Google Results
Outline
1
A Few Words on Embeddings
2
Recurrent Neural Network Motivation & Architecture Gating in Hidden Units Attention Model
3
Performances and Recent Advances NMT Performances Recent NMT Advances
Data Level
I
Vocabulary size is an issue for NMT (embedding matrix, softmax computation, etc.)
I
One solution: keep the n most frequent words
I
A better solution: sub-word units, between character and word-level tokens (Byte pair encoding)
Data Level I
I I I
The amount of data required to train NMT systems is large (dozens of millions of sentence pairs) Lack of domain and genre specific resources... One solution: mixing language pairs and domains A better solution: adding specific tokens to specify the language, domain, speaker, etc.
Data Level
I
For specific genres and domains, it would be nice to force the decoder into producing particular output
I
It is feasible using domain or genre specific target data
I
Monolingual corpora for specific domains and genres are available Idea:
I
1 2 3 4 5
take domain specific monolingual corpus in the target language take a translation systems for the pair target–source translate the monolingual data use the produced automatic translation as source and the monolingual specific data as target you got additional training data!
Neural Network Architecture I I I I I
RNNs require a lot of computing power to train The attention model appears to help improve performances Recent works focus on using attention only architectures In addition to the encoder–decoder attention, use of self-attention In addition to a single context vector, computation of multi-head attention