What?
Proposes a new activation function allowing the robust training of very deep vanilla neural networks.
Why?
Most of the successes of Deep Learning have come either from Recurrent (RNN) or Convolutional (CNN) forms: they are stable through weight-sharing. Vanilla Neural Networks suffer from training instabilities mainly due to SGD being unstable after a few layers (vanishing or exploding gradients).
Some methods have been proposed, notably batch normalization 1 which brings activations to zero-mean and unit variance, but they are perturbed by stochastic regularisation methods (e.g. dropout 2).
Normalizing activations avoids a bias in the inputs of the following layer.
How?
An activation function is designed along the following requirements:
- Negative and Positive values for controlling the mean
- Saturation regions (zero-gradient) to control the exploding gradient problem
- Slope larger than 1 to increase the variance if necessary
- Continuous curve
Derive a mapping between layers that satisfies these requirements. Interestingly the properties are verified through a computer-aided proof. The starting point is mix between Exponential Linear Units (ELUs) and Leaky ReLUs.
Parameters are obtained by finding a fixed point of the mapping function that satisfies the normalization requirements. They propose:
They also derive a parametrised dropout which does not suffer from the same issues as regular dropout as it is designed to preserve zero-mean and unit-variance in the layer’s activations.
Figure 1: SELU activation function
Evaluation
- 121 Tasks from UCI dataset
- Drug discovery task
- Astronomy task
Figure 2: Image from Klambauer et al, https://arxiv.org/abs/1706.02515 [license http://arxiv.org/licenses/nonexclusive-distrib/1.0/]
Compare against many baselines: Batch Norm, Layer Norm, Weight Norm, Highway, ResNet.
Comments
- Interesting use of a computer generated proof to give bounds on the variance when weights don’t satisfy the zero-mean, unit-variance property
- Very smooth accuracy even for very deep networks (see Fig. 2)
- Sepp Hochreiter 3
- Makes batchnorm obsolete, which means big training speedup
Questions
- Evaluated with SGD. How does it behave with e.g. Adam?
- Evaluated with small learning rates, can we be more agressive in learning?
- Comparison between Spectral Normalization 4 and SNN for mode collapse in GANs.
Resources
Code
Discussion
References
-
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe, Christian Szegedy ↩
-
Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov; 15(Jun):1929−1958, 2014. ↩
-
Long Short-Term Memory. Sepp Hochreiter and Jürgen Schmidhuber. Neural Comput. 9, 8 (November 1997), 1735-1780 ↩
-
Spectral Normalization for Generative Adversarial Networks, Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida ↩