What?
Proposes a new activation function allowing the robust training of very deep vanilla neural networks.
Why?
Most of the successes of Deep Learning have come either from Recurrent (RNN) or Convolutional (CNN) forms: they are stable through weightsharing. Vanilla Neural Networks suffer from training instabilities mainly due to SGD being unstable after a few layers (vanishing or exploding gradients).
Some methods have been proposed, notably batch normalization ^{1} which brings activations to zeromean and unit variance, but they are perturbed by stochastic regularisation methods (e.g. dropout ^{2}).
Normalizing activations avoids a bias in the inputs of the following layer.
How?
An activation function is designed along the following requirements:
 Negative and Positive values for controlling the mean
 Saturation regions (zerogradient) to control the exploding gradient problem
 Slope larger than 1 to increase the variance if necessary
 Continuous curve
Derive a mapping between layers that satisfies these requirements. Interestingly the properties are verified through a computeraided proof. The starting point is mix between Exponential Linear Units (ELUs) and Leaky ReLUs.
Parameters are obtained by finding a fixed point of the mapping function that satisfies the normalization requirements. They propose:
They also derive a parametrised dropout which does not suffer from the same issues as regular dropout as it is designed to preserve zeromean and unitvariance in the layer’s activations.
Figure 1: SELU activation function
Evaluation
 121 Tasks from UCI dataset
 Drug discovery task
 Astronomy task
Figure 2: Image from Klambauer et al, https://arxiv.org/abs/1706.02515 [license http://arxiv.org/licenses/nonexclusivedistrib/1.0/]
Compare against many baselines: Batch Norm, Layer Norm, Weight Norm, Highway, ResNet.
Comments
 Interesting use of a computer generated proof to give bounds on the variance when weights don’t satisfy the zeromean, unitvariance property
 Very smooth accuracy even for very deep networks (see Fig. 2)
 Sepp Hochreiter ^{3}
 Makes batchnorm obsolete, which means big training speedup
Questions
 Evaluated with SGD. How does it behave with e.g. Adam?
 Evaluated with small learning rates, can we be more agressive in learning?
 Comparison between Spectral Normalization ^{4} and SNN for mode collapse in GANs.
Resources
Code
Discussion
References

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Sergey Ioffe, Christian Szegedy ↩

Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov; 15(Jun):1929−1958, 2014. ↩

Long ShortTerm Memory. Sepp Hochreiter and Jürgen Schmidhuber. Neural Comput. 9, 8 (November 1997), 17351780 ↩

Spectral Normalization for Generative Adversarial Networks, Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida ↩