Machine Translation - Summer Semester 2018 - Raphael Rubino

Multilingual Language Technology ... Monolingual language model, non-parametric approach: ... Word vectors (embeddings) are learned during training ...
4MB taille 1 téléchargements 263 vues
Machine Translation Summer Semester 2018

Multilingual Language Technology DFKI Saarbrücken

Outline

1

Introduction

2

Modelling the Translation Process

3

Parametric Model of Translation

4

Neural Language Model

5

Neural Translation Model

6

Try it Yourself!

Introduction – Admin

I

Part II of the MT Summer Semester class

I

Contact: [email protected]

I

Office hours: please write an email to set up a meeting

I

Evaluation: assignment and final exam

Introduction – MT

I

MT is one of the oldest NLP research fields

I

You know the history: rules, counts, neurons

I

Recent advances show better results with Neural MT

I

MT Summer Semester Part II is on NMT

Introduction – State-of-the-Art

I

How is defined the SotA?

I

On limits of MT evaluation

I

From research to industry: applications

I

What about other NLP fields?

Outline

1

Introduction

2

Modelling the Translation Process

3

Parametric Model of Translation

4

Neural Language Model

5

Neural Translation Model

6

Try it Yourself!

Modelling Translation

I

Goal: find the best translation T ? , T ? = arg maxT p(T |S)

I

Data: parallel texts containing pairs S, T with S a source language sentence of length m, (x1 , x2 , . . . , xm ) and T a target language sentence of length n, (y1 , y2 , . . . , yn )

I

How is it done with PB-SMT?

Modelling Translation

Modelling Translation

Modelling Translation

Modelling Translation

I

Several strong assumptions are made with this model: I I

I

Model parameters are count-based: data limitation Features are hand-crafted: human limitation

Interpretability: more features, more opaque model

Modelling Translation

I

Taking a step back, finding the best T ? , T ? = arg max p(T |S)

I

p(T |S) = p(y1 , y2 , . . . , yn |x1 , x2 , . . . , xm )

I

Using the chain rule, we have: n Q p(T |S) = p(yi |y1 , . . . , yi−1 , x1 , . . . , xm )

T

i=1

Modelling Translation

I

Let’s look carefuly: n Q p(T |S) = p(yi |y1 , . . . , yi−1 , x1 , . . . , xm ) i=1

Modelling Translation

I

Let’s look carefuly: n Q p(T |S) = p(yi |y1 , . . . , yi−1 , x1 , . . . , xm )

I

This is a language model...

i=1

Modelling Translation

I

Let’s look carefuly: n Q p(T |S) = p(yi |y1 , . . . , yi−1 , x1 , . . . , xm )

I

This is a language model...

I

Can we build a bilingual language model without too many strong assumptions?

i=1

Modelling Translation

I

Monolingual language model, non-parametric approach: t−n ,...,yt−1 ,yt ) p(yt |yt−n , . . . , yt−1 ) = count(y count(yt−n ,...,yt−1 )

Modelling Translation

I

Monolingual language model, non-parametric approach: t−n ,...,yt−1 ,yt ) p(yt |yt−n , . . . , yt−1 ) = count(y count(yt−n ,...,yt−1 )

I

Monolingual language model, parametric approach: p(yt |yt−n , . . . , yt−1 ) = fyt (yt−n , . . . , yt−1 )

Modelling Translation

I

Monolingual language model, non-parametric approach: t−n ,...,yt−1 ,yt ) p(yt |yt−n , . . . , yt−1 ) = count(y count(yt−n ,...,yt−1 )

I

Monolingual language model, parametric approach: p(yt |yt−n , . . . , yt−1 ) = fyt (yt−n , . . . , yt−1 )

I

Neural networks are parametric models

Modelling Translation

I

Why using parametric models?

Modelling Translation

I

Why using parametric models? I I I

Replacing hand-crafted features by learned parameters Modelling unseen events (n-grams) and large contexts Flexible and adaptive models with re-usable components

Modelling Translation

I

Why using parametric models? I I I

I

Replacing hand-crafted features by learned parameters Modelling unseen events (n-grams) and large contexts Flexible and adaptive models with re-usable components

How to learn parameters?

Modelling Translation

I

Why using parametric models? I I I

I

Replacing hand-crafted features by learned parameters Modelling unseen events (n-grams) and large contexts Flexible and adaptive models with re-usable components

How to learn parameters? I I

From data, large amount of parallel texts Using gradient descent training

Outline

1

Introduction

2

Modelling the Translation Process

3

Parametric Model of Translation

4

Neural Language Model

5

Neural Translation Model

6

Try it Yourself!

A Bit of Background

I

Neural networks are not new...

I

First computational model for neural networks in the 1940s (McCulloch and Pitts, 1943)

I

The Perceptron proposed by Rosenblatt in 1958

I

Multi-layered network introduced in 1965 by Ivakhnenko and Lapa

I

Algorithm for back-propagation described by Werbo in 1975

I

First suggestion of using neural network in 1997

A Bit of Background

A Bit of Background

A Bit of Background I

Since the 1990s, several attempts at neural language modelling

I

Thanks to the log-linear model of PB-SMT, possibility to add neural features

A Bit of Background I

From 1997 to 2014, a lot of work has been conducted in neural translation

I

Some neural architectures presented after 2014 outperform PB-SMT

Outline

1

Introduction

2

Modelling the Translation Process

3

Parametric Model of Translation

4

Neural Language Model

5

Neural Translation Model

6

Try it Yourself!

Neural Language Model as Feature I

I I

Formalism: p(yt |yt−n , . . . , yt−1 ) = fyt (yt−n , . . . , yt−1 ) Start with one-hot encoding of words Word vectors (embeddings) are learned during training

Neural Language Model as Feature

I

Input: context of n-1 previous words

I

Output: probability distribution of (next) word n

I

Embedding layer with shared weights

I

As many hidden layers as we want

Neural Language Model as Feature

I

It is possible to use a NLM as an additional feature in the log-linear model, the same way a count-based LM is used.

I

Researchers started using feed forward LMs (Bengio et al., 2000, 2003)

I

First integration of a NLM within PB-SMT log-linear model: Schwenk et al., 2006 However, several problems:

I

I I

very slow to train because of softmax layer on large vocabulary very slow to decode, used mostly for n-best reranking

Neural Language Model as Feature

I

The learned embeddings exhibit interesting properties

Neural Language Model as Feature

Neural Language Model as Feature

I

Mikolov et al., 2013, showed that using embeddings we can do: queen = king + ( woman - man ) queens = queen + ( kings - king )

I

Do embeddings capture topic or semantic related information?

Neural Language Model as Feature I

Bilingual NLM (Devlin et al., 2014)

I

Using alignment information obtained with IBM models

I

Still not end-to-end training but models source and target jointly

Outline

1

Introduction

2

Modelling the Translation Process

3

Parametric Model of Translation

4

Neural Language Model

5

Neural Translation Model

6

Try it Yourself!

Neural Translation Model I

The first NMT architecture which works was the encoder-decoder model

Neural Translation Model I

The current SotA in NMT is the encoder-decoder with attention model

Outline

1

Introduction

2

Modelling the Translation Process

3

Parametric Model of Translation

4

Neural Language Model

5

Neural Translation Model

6

Try it Yourself!

Implementations

I

In Pytorch: OpenNMT (also in Lua + Torch) https://github.com/OpenNMT In C++: Marian https://github.com/marian-nmt/marian

I

etc.

I

Data

I

Popular and free datasets: WMT http://www.statmt.org/wmt18/

I

More (not necessarily clean) data: Opus http://opus.nlpl.eu/