[NOTES] Self-Normalizing Neural Networks (SELU)

by G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter

March 23, 2018 - 3 minute read -
ml deeplearning notes



Proposes a new activation function allowing the robust training of very deep vanilla neural networks.


Most of the successes of Deep Learning have come either from Recurrent (RNN) or Convolutional (CNN) forms: they are stable through weight-sharing. Vanilla Neural Networks suffer from training instabilities mainly due to SGD being unstable after a few layers (vanishing or exploding gradients).

Some methods have been proposed, notably batch normalization 1 which brings activations to zero-mean and unit variance, but they are perturbed by stochastic regularisation methods (e.g. dropout 2).

Normalizing activations avoids a bias in the inputs of the following layer.


An activation function is designed along the following requirements:

  1. Negative and Positive values for controlling the mean
  2. Saturation regions (zero-gradient) to control the exploding gradient problem
  3. Slope larger than 1 to increase the variance if necessary
  4. Continuous curve

Derive a mapping between layers that satisfies these requirements. Interestingly the properties are verified through a computer-aided proof. The starting point is mix between Exponential Linear Units (ELUs) and Leaky ReLUs.

Parameters are obtained by finding a fixed point of the mapping function that satisfies the normalization requirements. They propose:

They also derive a parametrised dropout which does not suffer from the same issues as regular dropout as it is designed to preserve zero-mean and unit-variance in the layer’s activations.


Figure 1: SELU activation function


  • 121 Tasks from UCI dataset
  • Drug discovery task
  • Astronomy task


Figure 2: Image from Klambauer et al, https://arxiv.org/abs/1706.02515 [license http://arxiv.org/licenses/nonexclusive-distrib/1.0/]

Compare against many baselines: Batch Norm, Layer Norm, Weight Norm, Highway, ResNet.


  • Interesting use of a computer generated proof to give bounds on the variance when weights don’t satisfy the zero-mean, unit-variance property
  • Very smooth accuracy even for very deep networks (see Fig. 2)
  • Sepp Hochreiter 3
  • Makes batchnorm obsolete, which means big training speedup


  • Evaluated with SGD. How does it behave with e.g. Adam?
  • Evaluated with small learning rates, can we be more agressive in learning?
  • Comparison between Spectral Normalization 4 and SNN for mode collapse in GANs.