1 Motivation and Setting
In this survey, we summarize results on the expressivity of deep neural networks from [1]. Neural network expressivity looks at how the architecture of the network (width, depth, connectivity) affects the properties of the resulting function.
Being a fundamental step to better understanding neural networks, there is much prior work in this area. Many of the existing results rely on comparing achievable functions of a particular network architecture, ([2, 3], [4, 5, 6, 7]). While compelling, these results also highlight limitations of much of the existing work on expressivity – unrealistic assumptions are sometimes made about the architectural shape e.g. exponentially large width, and networks are often compared via their ability to approximate one specific function, which, in isolation, cannot result in a more general conclusion.
To overcome this, we start by analyzing expressiveness in a setting which is both more general than one of hardcoded functions, and immediately related to practice – networks after random initialization. Not only does this mean conclusions are independent of specific weight settings, but understanding behavior at random initialization provides a natural baseline to compare to the effects of training and trained networks, which we summarize in Sections 3, 4.
Companion Paper
In a companion paper, [8], the propagation of Riemannian curvature through random networks is studied by developing a mean field theory approach, which quantitatively supports the conjecture that deep networks can disentangle curved manifolds in input space.
2 Random networks
The results on networks after random initialization examine the effect of depth and width of a network architecture on its expressive power after random initialization via three natural measures of functional richness, number of transitions, activation patterns, and dichotomies. More precisely, fully connected networks of input dimension , depth and width are studied, with weights, bias randomly initialized as .
2.1 Measures of Expressivity
In more detail, the measures of expressivity are:
Transitions:
Counting neuron transitions is introduced indirectly via linear regions in
[9], and provides a tractable method to estimate the nonlinearity of the computed function.
Activation Patterns: Transitions of a single neuron can be extended to the outputs of all neurons in all layers, leading to the (global) definition of a network activation pattern, also a measure of nonlinearity. Network activation patterns directly show how the network partitions input space (into convex polytopes), through connections to the theory of hyperplane arrangements, Figure 1.
Dichotomies: The heterogeneity of a generic class of functions from a particular architecture is also measured, by counting the number of dichotomies seen for a fixed set of inputs. This measure is ‘statistically dual’ to sweeping input in some cases.
The paper shows that all three measures grow exponentially with the depth of the network, but not with the width.
Connection to Trajectory Length
In fact, this is due to an underlying connection of all three measures to another quantity, trajectory length – how a 1D curve in input space changes in length as it propagates through the network. It is proved [1] that the trajectory length of an input grows exponentially in the depth of a network but not the width:
Theorem 1.
Bound on Growth of Trajectory Length Let be a hard tanh random neural network and a one dimensional trajectory in input space. Define to be the image of the trajectory in layer of , and let be the arc length of . Then
This is also verified empirically (Figure 2).
(a)[width=0.4]figures/varyscl_distance_v_depth_io.pdf  (b)[width=0.4]figures/varywidth_distance_v_depth_io.pdf 
(c)[width=0.4]figures/bounds_vs_experiment_varyK.pdf  (d)[width=0.4]figures/bounds_vs_experiment_varysigma.pdf 
The exponential growth of trajectory length with depth, in a random deep network with hardtanh nonlinearities. A circular trajectory is chosen between two random vectors. The image of that trajectory is taken at each layer of the network, and its length measured.
(a,b) The trajectory length vs. layer, in terms of the network widthand weight variance
, both of which determine its growth rate. (c,d) The average ratio of a trajectory’s length in layer relative to its length in layer . The solid line shows simulated data, while the dashed lines show upper and lower bounds (Theorem 1). Growth rate is a function of layer width , and weight variance .Theoretical intuition is the provided for the direct proportionality of transitions, activation patterns and dichotomies to trajectory length, and is further confirmed through experiments ([1]).
3 The effect of Training: Trading Off Expressivity and Stability
The paper then ([1]) explores the effect of training on the measures of expressivity. Most importantly, note that an exponential depth dependence, as demonstrated at the start of training, makes the resulting function very sensitive to perturbations, not a desired feature in a trained network.
When weights are initialized with large , training increases stability by reducing trajectory length and transitions during the training process (Figure 4).
When the network is initialized with too small a however, this also has the potential to adversely affect performance as the function at initialization might not offer enough expressiveness to fit the target. In this case, we see that the training process monotonically increases the trajectory length and number of transitions (Figure 5.)
In summary, the paper [1] concludes that training trades off between achieving enough expressiveness and simultaneously trying to maintain stability.
4 Trained Networks: Power of Remaining Depth
The expanding trajectory length suggests that the effect of parameter choices earlier in earlier layers is amplified by later layers. Combining this with the exponential increase in dichotomies with depth, this suggests that the expressive power of the parameters, and thus layers, is related to the remaining depth of the network after that layer. The paper demonstrates this in practice, with experiments on MNIST and CIFAR10 (Figure 6).
References
 Raghu et al. [2016] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha SohlDickstein. On the expressive power of deep neural networks. ArXiv eprints, June 2016. URL https://arxiv.org/abs/1606.05336.
 Hornik et al. [1989] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.

Cybenko [1989]
George Cybenko.
Approximation by superpositions of a sigmoidal function.
Mathematics of control, signals and systems, 2(4):303–314, 1989.  Eldan and Shamir [2015] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. arXiv preprint arXiv:1512.03965, 2015.
 Telgarsky [2015] Matus Telgarsky. Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101, 2015.

Martens et al. [2013]
James Martens, Arkadev Chattopadhya, Toni Pitassi, and Richard Zemel.
On the representational efficiency of restricted boltzmann machines.
In Advances in Neural Information Processing Systems, pages 2877–2885, 2013. 
Bianchini and Scarselli [2014]
Monica Bianchini and Franco Scarselli.
On the complexity of neural network classifiers: A comparison between shallow and deep architectures.
Neural Networks and Learning Systems, IEEE Transactions on, 25(8):1553–1565, 2014.  Poole et al. [2016] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha SohlDickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances In Neural Information Processing Systems, pages 3360–3368, 2016.
 Pascanu et al. [2013] Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On the number of response regions of deep feed forward networks with piecewise linear activations. arXiv preprint arXiv:1312.6098, 2013.
Comments
There are no comments yet.