[NOTES] Self-Normalizing Neural Networks (SELU)

by G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter

March 23, 2018 - 3 minute read -
ml deeplearning notes

arXiv

What?

Proposes a new activation function allowing the robust training of very deep vanilla neural networks.

Why?

Most of the successes of Deep Learning have come either from Recurrent (RNN) or Convolutional (CNN) forms: they are stable through weight-sharing. Vanilla Neural Networks suffer from training instabilities mainly due to SGD being unstable after a few layers (vanishing or exploding gradients).

Some methods have been proposed, notably batch normalization 1 which brings activations to zero-mean and unit variance, but they are perturbed by stochastic regularisation methods (e.g. dropout 2).

Normalizing activations avoids a bias in the inputs of the following layer.

How?

An activation function is designed along the following requirements:

  1. Negative and Positive values for controlling the mean
  2. Saturation regions (zero-gradient) to control the exploding gradient problem
  3. Slope larger than 1 to increase the variance if necessary
  4. Continuous curve

Derive a mapping between layers that satisfies these requirements. Interestingly the properties are verified through a computer-aided proof. The starting point is mix between Exponential Linear Units (ELUs) and Leaky ReLUs.

Parameters are obtained by finding a fixed point of the mapping function that satisfies the normalization requirements. They propose:

They also derive a parametrised dropout which does not suffer from the same issues as regular dropout as it is designed to preserve zero-mean and unit-variance in the layer’s activations.

Selu

Figure 1: SELU activation function

Evaluation

  • 121 Tasks from UCI dataset
  • Drug discovery task
  • Astronomy task

accuracy

Figure 2: Image from Klambauer et al, https://arxiv.org/abs/1706.02515 [license http://arxiv.org/licenses/nonexclusive-distrib/1.0/]

Compare against many baselines: Batch Norm, Layer Norm, Weight Norm, Highway, ResNet.

Comments

  • Interesting use of a computer generated proof to give bounds on the variance when weights don’t satisfy the zero-mean, unit-variance property
  • Very smooth accuracy even for very deep networks (see Fig. 2)
  • Sepp Hochreiter 3
  • Makes batchnorm obsolete, which means big training speedup

Questions

  • Evaluated with SGD. How does it behave with e.g. Adam?
  • Evaluated with small learning rates, can we be more agressive in learning?
  • Comparison between Spectral Normalization 4 and SNN for mode collapse in GANs.

Resources

Code

Discussion

References