Machine Translation Summer Semester 2018
Multilingual Language Technology DFKI Saarbrücken
Outline
1
Introduction
2
Modelling the Translation Process
3
Parametric Model of Translation
4
Neural Language Model
5
Neural Translation Model
6
Try it Yourself!
Introduction – Admin
I
Part II of the MT Summer Semester class
I
Contact:
[email protected]
I
Office hours: please write an email to set up a meeting
I
Evaluation: assignment and final exam
Introduction – MT
I
MT is one of the oldest NLP research fields
I
You know the history: rules, counts, neurons
I
Recent advances show better results with Neural MT
I
MT Summer Semester Part II is on NMT
Introduction – State-of-the-Art
I
How is defined the SotA?
I
On limits of MT evaluation
I
From research to industry: applications
I
What about other NLP fields?
Outline
1
Introduction
2
Modelling the Translation Process
3
Parametric Model of Translation
4
Neural Language Model
5
Neural Translation Model
6
Try it Yourself!
Modelling Translation
I
Goal: find the best translation T ? , T ? = arg maxT p(T |S)
I
Data: parallel texts containing pairs S, T with S a source language sentence of length m, (x1 , x2 , . . . , xm ) and T a target language sentence of length n, (y1 , y2 , . . . , yn )
I
How is it done with PB-SMT?
Modelling Translation
Modelling Translation
Modelling Translation
Modelling Translation
I
Several strong assumptions are made with this model: I I
I
Model parameters are count-based: data limitation Features are hand-crafted: human limitation
Interpretability: more features, more opaque model
Modelling Translation
I
Taking a step back, finding the best T ? , T ? = arg max p(T |S)
I
p(T |S) = p(y1 , y2 , . . . , yn |x1 , x2 , . . . , xm )
I
Using the chain rule, we have: n Q p(T |S) = p(yi |y1 , . . . , yi−1 , x1 , . . . , xm )
T
i=1
Modelling Translation
I
Let’s look carefuly: n Q p(T |S) = p(yi |y1 , . . . , yi−1 , x1 , . . . , xm ) i=1
Modelling Translation
I
Let’s look carefuly: n Q p(T |S) = p(yi |y1 , . . . , yi−1 , x1 , . . . , xm )
I
This is a language model...
i=1
Modelling Translation
I
Let’s look carefuly: n Q p(T |S) = p(yi |y1 , . . . , yi−1 , x1 , . . . , xm )
I
This is a language model...
I
Can we build a bilingual language model without too many strong assumptions?
i=1
Modelling Translation
I
Monolingual language model, non-parametric approach: t−n ,...,yt−1 ,yt ) p(yt |yt−n , . . . , yt−1 ) = count(y count(yt−n ,...,yt−1 )
Modelling Translation
I
Monolingual language model, non-parametric approach: t−n ,...,yt−1 ,yt ) p(yt |yt−n , . . . , yt−1 ) = count(y count(yt−n ,...,yt−1 )
I
Monolingual language model, parametric approach: p(yt |yt−n , . . . , yt−1 ) = fyt (yt−n , . . . , yt−1 )
Modelling Translation
I
Monolingual language model, non-parametric approach: t−n ,...,yt−1 ,yt ) p(yt |yt−n , . . . , yt−1 ) = count(y count(yt−n ,...,yt−1 )
I
Monolingual language model, parametric approach: p(yt |yt−n , . . . , yt−1 ) = fyt (yt−n , . . . , yt−1 )
I
Neural networks are parametric models
Modelling Translation
I
Why using parametric models?
Modelling Translation
I
Why using parametric models? I I I
Replacing hand-crafted features by learned parameters Modelling unseen events (n-grams) and large contexts Flexible and adaptive models with re-usable components
Modelling Translation
I
Why using parametric models? I I I
I
Replacing hand-crafted features by learned parameters Modelling unseen events (n-grams) and large contexts Flexible and adaptive models with re-usable components
How to learn parameters?
Modelling Translation
I
Why using parametric models? I I I
I
Replacing hand-crafted features by learned parameters Modelling unseen events (n-grams) and large contexts Flexible and adaptive models with re-usable components
How to learn parameters? I I
From data, large amount of parallel texts Using gradient descent training
Outline
1
Introduction
2
Modelling the Translation Process
3
Parametric Model of Translation
4
Neural Language Model
5
Neural Translation Model
6
Try it Yourself!
A Bit of Background
I
Neural networks are not new...
I
First computational model for neural networks in the 1940s (McCulloch and Pitts, 1943)
I
The Perceptron proposed by Rosenblatt in 1958
I
Multi-layered network introduced in 1965 by Ivakhnenko and Lapa
I
Algorithm for back-propagation described by Werbo in 1975
I
First suggestion of using neural network in 1997
A Bit of Background
A Bit of Background
A Bit of Background I
Since the 1990s, several attempts at neural language modelling
I
Thanks to the log-linear model of PB-SMT, possibility to add neural features
A Bit of Background I
From 1997 to 2014, a lot of work has been conducted in neural translation
I
Some neural architectures presented after 2014 outperform PB-SMT
Outline
1
Introduction
2
Modelling the Translation Process
3
Parametric Model of Translation
4
Neural Language Model
5
Neural Translation Model
6
Try it Yourself!
Neural Language Model as Feature I
I I
Formalism: p(yt |yt−n , . . . , yt−1 ) = fyt (yt−n , . . . , yt−1 ) Start with one-hot encoding of words Word vectors (embeddings) are learned during training
Neural Language Model as Feature
I
Input: context of n-1 previous words
I
Output: probability distribution of (next) word n
I
Embedding layer with shared weights
I
As many hidden layers as we want
Neural Language Model as Feature
I
It is possible to use a NLM as an additional feature in the log-linear model, the same way a count-based LM is used.
I
Researchers started using feed forward LMs (Bengio et al., 2000, 2003)
I
First integration of a NLM within PB-SMT log-linear model: Schwenk et al., 2006 However, several problems:
I
I I
very slow to train because of softmax layer on large vocabulary very slow to decode, used mostly for n-best reranking
Neural Language Model as Feature
I
The learned embeddings exhibit interesting properties
Neural Language Model as Feature
Neural Language Model as Feature
I
Mikolov et al., 2013, showed that using embeddings we can do: queen = king + ( woman - man ) queens = queen + ( kings - king )
I
Do embeddings capture topic or semantic related information?
Neural Language Model as Feature I
Bilingual NLM (Devlin et al., 2014)
I
Using alignment information obtained with IBM models
I
Still not end-to-end training but models source and target jointly
Outline
1
Introduction
2
Modelling the Translation Process
3
Parametric Model of Translation
4
Neural Language Model
5
Neural Translation Model
6
Try it Yourself!
Neural Translation Model I
The first NMT architecture which works was the encoder-decoder model
Neural Translation Model I
The current SotA in NMT is the encoder-decoder with attention model
Outline
1
Introduction
2
Modelling the Translation Process
3
Parametric Model of Translation
4
Neural Language Model
5
Neural Translation Model
6
Try it Yourself!
Implementations
I
In Pytorch: OpenNMT (also in Lua + Torch) https://github.com/OpenNMT In C++: Marian https://github.com/marian-nmt/marian
I
etc.
I
Data
I
Popular and free datasets: WMT http://www.statmt.org/wmt18/
I
More (not necessarily clean) data: Opus http://opus.nlpl.eu/