Machine Translation Summer Semester 2018
Multilingual Language Technology DFKI Saarbrücken
Outline
1
A Few Definitions
2
Feed-forward Neural Network
3
Parameters Initialization
Outline
1
A Few Definitions
2
Feed-forward Neural Network
3
Parameters Initialization
Input, output, etc.
Let’s introduce a few variables and notations used in the NN literature, for a single hidden layer network: I The input vector is noted ~ x I
~x is composed of input values noted x1 , x2 , . . . , xn The hidden layer is noted ~h
I
~h is composed of values noted h1 , h2 , . . . , hm
I
I
NN output is noted ~y
I
~y is composed of output values noted y1 , y2 , . . . , yl
Weights
Let’s introduce a few variables and notations used in the NN literature, for a single hidden layer network: I
A weight matrix connecting input to hidden nodes is noted W
I
W is composed of parameters {wij }
I
A weight matrix connecting the hidden to the output nodes is noted U
I
U contains the parameters {uij }
Put together...
I
Remember, we want to find the best target sentence (translation) given a source sentence T ? = arg max p(T |S) T
I
In other words, finding a function to fit ~x ~y = f (~x)
I
With a feed-forward neural network: ~y = U.~h ~h = W.~x
Put together... ~y = U.~h ~h = W.~x I
Matrix–vector multiplication: ~h = W.~x P hj = xi wij i
with i ∈ [0; n] and j ∈ [0; m] ~ ~y = PU.h yk = hj ukj j
with k ∈ [0; l]
Put together...
yk =
P
hj ukj
j
yk =
PP j
xi wij ukj
i
I
The target is obtained here with two linear transformations of the input
I
Similar to a linear classifier, impossible to solve non-linearly separable input
I
More layers in this situation will not help...
I
Solution: non-linearity as activation function
Put together... yk =
P
hj ukj
j
yk =
P
f (hj )ukj
j
I
The function f is a non-linearity
I
Popular non-linear functions are: sigmoid, hyperbolic tangent, etc.
One last parameter
I
What happens if all input values are 0?
One last parameter
I
What happens if all input values are 0?
I
The network will always produce 0...
One last parameter
I I
To avoid producing 0 all the time, we had a bias The bias parameter is noted ~b with values (b1 , b2 , . . . , bm )
Summary
I
For a single layer feed-forward network using a sigmoid activation function: P P yk = sigmoid(( xi wij ) + bj )ukj j
I
Depending on the type of classifier, the output values are given to a function: I I
I
i
sigmoid for binary classification softmax for categorical classification
This is how the network produces its output, also called the forward pass.
Outline
1
A Few Definitions
2
Feed-forward Neural Network
3
Parameters Initialization
FFNN – Forward pass
I
Let’s take a single layer feed-forward network for binary classification
I
For each hidden node (neuron): hj = sigmoid(bxj +
P
xi wij )
i I
For each output node: yk = sigmoid(bhk +
P j
hj wkj )
FFNN – Forward pass I
For a network with two inputs (x0 and x1 ), two hidden nodes (h0 and h1 ) and one output node y0
I
We set ~x = (1, 0)
I I
w~0 = (3, 4) and w~1 = (2, 3) u~0 = (5, −5)
I
bx0 = −2, bx1 = −4 and bh0 = −2
FFNN – Forward pass
I
h0 = sigmoid(x0 .w00 + x1 .w01 + 1 ∗ bx0 )
I
h0 = sigmoid(1 × 3 + 0 × 4 + 1 × −2)
I
h0 = sigmoid(3 + 0 − 2)
I
h0 = sigmoid(1)
I
h0 = 0.731
FFNN – Forward pass
I
h1 = sigmoid(x0 .w10 + x1 .w11 + 1 ∗ bx1 )
I
h1 = sigmoid(1 × 2 + 0 × 3 + 1 × −4)
I
h1 = sigmoid(2 + 0 − 4)
I
h1 = sigmoid(−2)
I
h1 = 0.119
FFNN – Forward pass
I
y0 = sigmoid(h0 .u0 + h1 .u1 + 1 ∗ bh0 )
I
y0 = sigmoid(0.731 × 5 + 0.119 × −5 + 1 × −2)
I
y0 = sigmoid(3.655 − 0.595 − 2)
I
y0 = sigmoid(1.06)
I
y0 = 0.743 → y0 = 1
FFNN – Exercise
I
Calculate the output values for all possible inputs xi ∈ [0; 1] Input x0
Input x1
0 0 1 1
0 1 0 1
Hidden h0
Hidden h1
Output y0
FFNN – Exercise
I
Calculate the output values for all possible inputs xi ∈ [0; 1] Input x0
Input x1
Hidden h0
Hidden h1
0 0 1 1
0 1 0 1
0.119 0.881 0.731 0.993
0.018 0.269 0.119 0.731
Output y0 0.183 0.743 0.743 0.334
→ → → →
0 1 1 0
FFNN – Backward pass
I
Training a neural network means finding a set of parameters θ: ~y = f (~x, θ)
I
We have the input ~x and the network output ~y , as well as the target reference ~t The parameters θ: (W , U , ~b), have to be learned
I I
Based on the network output and the reference, we back-propagate the network output error
I
An error function is necessary, for instance the difference between the network output and the reference: P1 2 E = L2 (~t, ~y ) = 2 (ti − yi ) i
FFNN – Backward pass
I
The L2 norm (MSE) is popular for binary classification problems P1 2 E = L2 (~t, ~y ) = 2 (ti − yi ) i
I
Can be replaced by MAE (L1 norm), cross-entropy, etc.
I
The error E is defined in terms of the output values ~y obtained with the weights θ
FFNN – Backward pass – Output
I
Remember: yi = sigmoid(bhi +
P
wi←j hj )
j I
Let’s write si = bhi +
P
wi←j hj , yi = sigmoid(si )
j I
The derivative of the error on the output node yi is written: dE dwi←j
=
dE dyi dsi dyi dsi dwi←j
FFNN – Backward pass – Output
dE dwi←j I
=
dE dyi dsi dyi dsi dwi←j
Let’s decompose the three components:
FFNN – Backward pass – Output
dE dwi←j I
=
dE dyi dsi dyi dsi dwi←j
Let’s decompose the three components:
I dE dyi
=
d 1 dyi 2 (ti
− yi )2 = −(ti − yi )
FFNN – Backward pass – Output
dE dwi←j I
=
dE dyi dsi dyi dsi dwi←j
Let’s decompose the three components:
I dE dyi
=
d 1 dyi 2 (ti
I dyi dsi
=
d sigmoid(si ) dsi
− yi )2 = −(ti − yi ) = sigmoid(si )(1 − sigmoid(si )) = yi (1 − yi )
FFNN – Backward pass – Output
dE dwi←j I
=
dE dyi dsi dyi dsi dwi←j
Let’s decompose the three components:
I dE dyi
=
d 1 dyi 2 (ti
− yi )2 = −(ti − yi )
d sigmoid(si ) I dyi = = sigmoid(si )(1 dsi dsi P dsi d I wi←j hj = hj dwi←j = dwi←j j
− sigmoid(si )) = yi (1 − yi )
FFNN – Backward pass – Output
dE dwi←j I
=
dE dyi dsi dyi dsi dwi←j
Let’s replace the derivative of the sigmoid function to generalize to all activation functions: dE dwi←j
I
= −(ti − yi )(yi (1 − yi ))hj
= −(ti − yi )yi0 hj
Let’s add a learning rate parameter µ for more flexibility, we obtain the following update formula:
∆wi←j = µ(ti − yi )yi0 hj
FFNN – Backward pass – Hidden
I
P Remember: hj = sigmoid( uj←k xk ) k P Let’s write zj = uj←k xk , hj = sigmoid(zj )
I
The derivative of E with respect to the weights U is written:
I
k
dE duj←k
=
dE dhj dzj dhj dzj duj←k
FFNN – Backward pass – Hidden
dE duj←k I I
dE dhj dzj dhj dzj duj←k
Let’s decompose this derivative: P dE dyi dsi P dE −(ti − yi )yi0 wi←j dhj = dyi dsi dhj =
dhj I dzj I
=
=
dzj duj←k
i dsigmoid(zj ) dzj
=
d duj←k
i
= sigmoid(zj )(1 − sigmoid(zj )) = hj (1 − hj ) = h0j P uj←k xk = xk k
FFNN – Backward pass – Hidden
dE duj←k I
=
dE dhj dzj dhj dzj duj←k
=
(−(ti − yi )yi0 wi←j )h0j xk
P i
Thus, the update formula for the weights of the hidden nodes is: P ∆uj←k = µ ((ti − yi )yi0 wi←j )h0j xk i
FFNN – Training – Summary
I
Training of a neural network is done by processing training examples
I
For each training example, the input is fed into the neural network
I I
The network produces an output given the input: forward pass The network output is compared to the target reference: error computation
I
The error is back-propagated through the network: backward pass
I
The parameters of the network are updated following gradient descent
I
One iteration over the whole training set is called an epoch
I
Usually a network is trained over many epochs
NN – Summary
I
Define the classification problem: binary, categorical
I
Choose an error (loss) function Choose an optimizer, gradient descent variants exist
I
Outline
1
A Few Definitions
2
Feed-forward Neural Network
3
Parameters Initialization
Parameters Initialization
1
A Few Definitions
2
Feed-forward Neural Network
3
Parameters Initialization