Machine Translation - Summer Semester 2018 - Raphael Rubino

2 Feed-forward Neural Network. 3 Parameters Initialization ... Let's take a single layer feed-forward network for binary classification. ▻ For each hidden node ...
1MB taille 4 téléchargements 276 vues
Machine Translation Summer Semester 2018

Multilingual Language Technology DFKI Saarbrücken

Outline

1

A Few Definitions

2

Feed-forward Neural Network

3

Parameters Initialization

Outline

1

A Few Definitions

2

Feed-forward Neural Network

3

Parameters Initialization

Input, output, etc.

Let’s introduce a few variables and notations used in the NN literature, for a single hidden layer network: I The input vector is noted ~ x I

~x is composed of input values noted x1 , x2 , . . . , xn The hidden layer is noted ~h

I

~h is composed of values noted h1 , h2 , . . . , hm

I

I

NN output is noted ~y

I

~y is composed of output values noted y1 , y2 , . . . , yl

Weights

Let’s introduce a few variables and notations used in the NN literature, for a single hidden layer network: I

A weight matrix connecting input to hidden nodes is noted W

I

W is composed of parameters {wij }

I

A weight matrix connecting the hidden to the output nodes is noted U

I

U contains the parameters {uij }

Put together...

I

Remember, we want to find the best target sentence (translation) given a source sentence T ? = arg max p(T |S) T

I

In other words, finding a function to fit ~x ~y = f (~x)

I

With a feed-forward neural network: ~y = U.~h ~h = W.~x

Put together... ~y = U.~h ~h = W.~x I

Matrix–vector multiplication: ~h = W.~x P hj = xi wij i

with i ∈ [0; n] and j ∈ [0; m] ~ ~y = PU.h yk = hj ukj j

with k ∈ [0; l]

Put together...

yk =

P

hj ukj

j

yk =

PP j

xi wij ukj

i

I

The target is obtained here with two linear transformations of the input

I

Similar to a linear classifier, impossible to solve non-linearly separable input

I

More layers in this situation will not help...

I

Solution: non-linearity as activation function

Put together... yk =

P

hj ukj

j

yk =

P

f (hj )ukj

j

I

The function f is a non-linearity

I

Popular non-linear functions are: sigmoid, hyperbolic tangent, etc.

One last parameter

I

What happens if all input values are 0?

One last parameter

I

What happens if all input values are 0?

I

The network will always produce 0...

One last parameter

I I

To avoid producing 0 all the time, we had a bias The bias parameter is noted ~b with values (b1 , b2 , . . . , bm )

Summary

I

For a single layer feed-forward network using a sigmoid activation function: P P yk = sigmoid(( xi wij ) + bj )ukj j

I

Depending on the type of classifier, the output values are given to a function: I I

I

i

sigmoid for binary classification softmax for categorical classification

This is how the network produces its output, also called the forward pass.

Outline

1

A Few Definitions

2

Feed-forward Neural Network

3

Parameters Initialization

FFNN – Forward pass

I

Let’s take a single layer feed-forward network for binary classification

I

For each hidden node (neuron): hj = sigmoid(bxj +

P

xi wij )

i I

For each output node: yk = sigmoid(bhk +

P j

hj wkj )

FFNN – Forward pass I

For a network with two inputs (x0 and x1 ), two hidden nodes (h0 and h1 ) and one output node y0

I

We set ~x = (1, 0)

I I

w~0 = (3, 4) and w~1 = (2, 3) u~0 = (5, −5)

I

bx0 = −2, bx1 = −4 and bh0 = −2

FFNN – Forward pass

I

h0 = sigmoid(x0 .w00 + x1 .w01 + 1 ∗ bx0 )

I

h0 = sigmoid(1 × 3 + 0 × 4 + 1 × −2)

I

h0 = sigmoid(3 + 0 − 2)

I

h0 = sigmoid(1)

I

h0 = 0.731

FFNN – Forward pass

I

h1 = sigmoid(x0 .w10 + x1 .w11 + 1 ∗ bx1 )

I

h1 = sigmoid(1 × 2 + 0 × 3 + 1 × −4)

I

h1 = sigmoid(2 + 0 − 4)

I

h1 = sigmoid(−2)

I

h1 = 0.119

FFNN – Forward pass

I

y0 = sigmoid(h0 .u0 + h1 .u1 + 1 ∗ bh0 )

I

y0 = sigmoid(0.731 × 5 + 0.119 × −5 + 1 × −2)

I

y0 = sigmoid(3.655 − 0.595 − 2)

I

y0 = sigmoid(1.06)

I

y0 = 0.743 → y0 = 1

FFNN – Exercise

I

Calculate the output values for all possible inputs xi ∈ [0; 1] Input x0

Input x1

0 0 1 1

0 1 0 1

Hidden h0

Hidden h1

Output y0

FFNN – Exercise

I

Calculate the output values for all possible inputs xi ∈ [0; 1] Input x0

Input x1

Hidden h0

Hidden h1

0 0 1 1

0 1 0 1

0.119 0.881 0.731 0.993

0.018 0.269 0.119 0.731

Output y0 0.183 0.743 0.743 0.334

→ → → →

0 1 1 0

FFNN – Backward pass

I

Training a neural network means finding a set of parameters θ: ~y = f (~x, θ)

I

We have the input ~x and the network output ~y , as well as the target reference ~t The parameters θ: (W , U , ~b), have to be learned

I I

Based on the network output and the reference, we back-propagate the network output error

I

An error function is necessary, for instance the difference between the network output and the reference: P1 2 E = L2 (~t, ~y ) = 2 (ti − yi ) i

FFNN – Backward pass

I

The L2 norm (MSE) is popular for binary classification problems P1 2 E = L2 (~t, ~y ) = 2 (ti − yi ) i

I

Can be replaced by MAE (L1 norm), cross-entropy, etc.

I

The error E is defined in terms of the output values ~y obtained with the weights θ

FFNN – Backward pass – Output

I

Remember: yi = sigmoid(bhi +

P

wi←j hj )

j I

Let’s write si = bhi +

P

wi←j hj , yi = sigmoid(si )

j I

The derivative of the error on the output node yi is written: dE dwi←j

=

dE dyi dsi dyi dsi dwi←j

FFNN – Backward pass – Output

dE dwi←j I

=

dE dyi dsi dyi dsi dwi←j

Let’s decompose the three components:

FFNN – Backward pass – Output

dE dwi←j I

=

dE dyi dsi dyi dsi dwi←j

Let’s decompose the three components:

I dE dyi

=

d 1 dyi 2 (ti

− yi )2 = −(ti − yi )

FFNN – Backward pass – Output

dE dwi←j I

=

dE dyi dsi dyi dsi dwi←j

Let’s decompose the three components:

I dE dyi

=

d 1 dyi 2 (ti

I dyi dsi

=

d sigmoid(si ) dsi

− yi )2 = −(ti − yi ) = sigmoid(si )(1 − sigmoid(si )) = yi (1 − yi )

FFNN – Backward pass – Output

dE dwi←j I

=

dE dyi dsi dyi dsi dwi←j

Let’s decompose the three components:

I dE dyi

=

d 1 dyi 2 (ti

− yi )2 = −(ti − yi )

d sigmoid(si ) I dyi = = sigmoid(si )(1 dsi dsi P dsi d I wi←j hj = hj dwi←j = dwi←j j

− sigmoid(si )) = yi (1 − yi )

FFNN – Backward pass – Output

dE dwi←j I

=

dE dyi dsi dyi dsi dwi←j

Let’s replace the derivative of the sigmoid function to generalize to all activation functions: dE dwi←j

I

= −(ti − yi )(yi (1 − yi ))hj

= −(ti − yi )yi0 hj

Let’s add a learning rate parameter µ for more flexibility, we obtain the following update formula:

∆wi←j = µ(ti − yi )yi0 hj

FFNN – Backward pass – Hidden

I

P Remember: hj = sigmoid( uj←k xk ) k P Let’s write zj = uj←k xk , hj = sigmoid(zj )

I

The derivative of E with respect to the weights U is written:

I

k

dE duj←k

=

dE dhj dzj dhj dzj duj←k

FFNN – Backward pass – Hidden

dE duj←k I I

dE dhj dzj dhj dzj duj←k

Let’s decompose this derivative: P dE dyi dsi P dE −(ti − yi )yi0 wi←j dhj = dyi dsi dhj =

dhj I dzj I

=

=

dzj duj←k

i dsigmoid(zj ) dzj

=

d duj←k

i

= sigmoid(zj )(1 − sigmoid(zj )) = hj (1 − hj ) = h0j P uj←k xk = xk k

FFNN – Backward pass – Hidden

dE duj←k I

=

dE dhj dzj dhj dzj duj←k

=

(−(ti − yi )yi0 wi←j )h0j xk

P i

Thus, the update formula for the weights of the hidden nodes is: P ∆uj←k = µ ((ti − yi )yi0 wi←j )h0j xk i

FFNN – Training – Summary

I

Training of a neural network is done by processing training examples

I

For each training example, the input is fed into the neural network

I I

The network produces an output given the input: forward pass The network output is compared to the target reference: error computation

I

The error is back-propagated through the network: backward pass

I

The parameters of the network are updated following gradient descent

I

One iteration over the whole training set is called an epoch

I

Usually a network is trained over many epochs

NN – Summary

I

Define the classification problem: binary, categorical

I

Choose an error (loss) function Choose an optimizer, gradient descent variants exist

I

Outline

1

A Few Definitions

2

Feed-forward Neural Network

3

Parameters Initialization

Parameters Initialization

1

A Few Definitions

2

Feed-forward Neural Network

3

Parameters Initialization