Deep neural network

Batch normalization:

Feature scaling in hidden layer: During the feed forward, change in 1 layer may influence the 2 layer. When 2 layer receive feedback from 3 layer to increase output, 1 layer’s update tell 2 layer to decrease output, we may encounter internal covariate shift — layer facing opposing change.

To make output of 2 layer static, we do feature scaling in each hidden layer output, to make the subtle change in 1 layer and 3 layer work properly on 2 layer.

However, to maintain the layer output be static is not easy — — different sample may have different effect on the output, thus we hope the output reflect on the whole dataset. To do normalization on whole data set in each layer cost time, using subset which is a batch of data is substitute.

Batch normalization: apply a batch of data to the model, to normalization at the output of hidden layer. This normalization if often applied before activation function, for some activation function like relu contains saturate region that might miss some information, we apply normalization before activation function to avoid the output to locate in saturate region.