Proposes a new activation function allowing the robust training of very deep vanilla neural networks.
Most of the successes of Deep Learning have come either from Recurrent (RNN) or Convolutional (CNN) forms: they are stable through weight-sharing. Vanilla Neural Networks suffer from training instabilities mainly due to SGD being unstable after a few layers (vanishing or exploding gradients).
Normalizing activations avoids a bias in the inputs of the following layer.
An activation function is designed along the following requirements:
- Negative and Positive values for controlling the mean
- Saturation regions (zero-gradient) to control the exploding gradient problem
- Slope larger than 1 to increase the variance if necessary
- Continuous curve
Derive a mapping between layers that satisfies these requirements. Interestingly the properties are verified through a computer-aided proof. The starting point is mix between Exponential Linear Units (ELUs) and Leaky ReLUs.
Parameters are obtained by finding a fixed point of the mapping function that satisfies the normalization requirements. They propose:
They also derive a parametrised dropout which does not suffer from the same issues as regular dropout as it is designed to preserve zero-mean and unit-variance in the layer’s activations.
Figure 1: SELU activation function
- 121 Tasks from UCI dataset
- Drug discovery task
- Astronomy task
Figure 2: Image from Klambauer et al, https://arxiv.org/abs/1706.02515 [license http://arxiv.org/licenses/nonexclusive-distrib/1.0/]
Compare against many baselines: Batch Norm, Layer Norm, Weight Norm, Highway, ResNet.
- Interesting use of a computer generated proof to give bounds on the variance when weights don’t satisfy the zero-mean, unit-variance property
- Very smooth accuracy even for very deep networks (see Fig. 2)
- Sepp Hochreiter 3
- Makes batchnorm obsolete, which means big training speedup
- Evaluated with SGD. How does it behave with e.g. Adam?
- Evaluated with small learning rates, can we be more agressive in learning?
- Comparison between Spectral Normalization 4 and SNN for mode collapse in GANs.