Deep neural networks have proved astoundingly effective at a wide range of empirical tasks, from image classification (Krizhevsky et al., 2012) to identifying particles in high energy physics (Baldi et al., 2014), playing Go (Silver et al., 2016), and even modeling human student learning (Piech et al., 2015).
Despite these successes, understanding of how and why neural network architectures achieve their empirical successes is still lacking. This includes even the fundamental question of neural network expressivity, how the architectural properties of a neural network (depth, width, layer type) affect the resulting functions it can compute.
This is a foundational question, and there is a rich history of prior work addressing expressivity in neural networks. However, it has been challenging to derive conclusions that provide both theoretical generality with respect to choices of architecture as well as suggestions for meaningful practical consequences.
Indeed, the very first results on this question take a highly theoretical approach, from using functional analysis to show universal approximation results (Hornik et al., 1989; Cybenko, 1989), to analysing expressivity via comparisons to boolean circuits (Maass et al., 1994) and studying network VC dimension (Bartlett et al., 1998). While these results provided theoretically general conclusions, the shallow networks they studied are very different from the deep models that have proven so successful in recent years.
In response, several recent papers have focused on understanding the benefits of depth for neural networks (Pascanu et al., 2013; Montufar et al., 2014; Eldan and Shamir, 2015; Telgarsky, 2015; Martens et al., 2013; Bianchini and Scarselli, 2014). These results are compelling and take modern architectural changes into account, but they only show that a specific choice of weights for a deeper network results in inapproximability by a shallow (typically one or two hidden layers) network.
In particular, the goal of this new line of work has been to establish lower bounds — showing separations between shallow and deep networks — and as such they are based on hand-coded constructions of specific network weights. Even if the weight values used in these constructions are robust to small perturbations (as in (Pascanu et al., 2013; Montufar et al., 2014)), the functions that arise from these constructions tend toward extremal properties by design, and there is no evidence that a network trained on data ever resembles such a function.
This has meant that a set of fundamental questions about neural network expressivity has remained largely unanswered. First, we lack a good understanding of the “typical” case rather than the worst case in these bounds for deep networks, and consequently have no way to evaluate whether the hand-coded extremal constructions provide a reflection of the complexity encountered in more standard settings. Second, we lack an understanding of upper bounds to match the lower bounds produced by this prior work; do the constructions used to date place us near the limit of the expressive power of neural networks, or are there still large gaps? Finally, if we had an understanding of these two issues, we might begin to address the implications for the behavior of trained networks in applications.
Our contributions: Measures of Expressivity and their Applications
In this paper, we address this set of challenges by defining and analyzing an interrelated set of measures of expressivity for neural networks; our framework applies to a wide range of standard architectures, independent of specific weight choices. We begin our analysis with neural networks after random initialization; as random initialization is the start of all popular optimization methods, this gives a natural baseline to compare and contrast with the expressive behavior of trained networks.
Our first measure of expressivity is based on the notion of an activation pattern: in a network where the units compute functions based on discrete thresholds, we can ask which units are above or below their thresholds (i.e. which units are “active” and which are not). For the range of standard architectures that we consider, the network is essentially computing a linear function once we fix the activation pattern; thus, counting the number of possible activation patterns provides a concrete way of measuring the complexity beyond linearity that the network provides. We give an upper bound on the number of possible activation patterns, over any setting of the weights. This bound is tight as it matches the hand-constructed lower bounds of earlier work (Pascanu et al., 2013; Montufar et al., 2014).
Key to our analysis is the notion of a transition, in which changing an input to a nearby input changes the activation pattern. We study the behavior of transitions as we pass the input along a one-dimensional parametrized trajectory . Our central finding is that the trajectory length grows exponentially in the depth of the network.
Trajectory length serves as a unifying notion in our measures of expressivity, and it leads to insights into the behavior of trained networks as well. Specifically, we find that the exponential growth in trajectory length as a function of depth implies that small adjustments in parameters lower in the network induce larger changes than comparable adjustments higher in the network. We demonstrate this phenomenon through experiments on MNIST and CIFAR-10, where the network displays much less robustness to noise in the lower layers, and better performance when they are trained well.
The contributions of this paper are thus threefold:
Measures of expressivity: We propose easily computable measures of neural network expressivity that capture the expressive power inherent in different neural network architectures, independent of specific weight settings.
Exponential trajectories: We find an exponential depth dependence displayed by these measures, through a unifying analysis in which we study how the network transforms its input by measuring trajectory length
All weights are not equal (the lower layers matter more): Finally, we show how these results on trajectory length suggest that optimizing weights in lower layers of the network is particularly important, and we demonstrate this effect in experiments with trained networks.
In prior work, we studied the propagation of Riemannian curvature through random networks by developing a mean field theory approach, which quantitatively supports the conjecture that deep networks can disentangle curved manifolds in input space. Here, we take an approach grounded in computational geometry, presenting measures with a combinatorial flavor and exploring their implications for the behavior of randomly initialized and trained networks.
2 Measures of Expressivity
Given a neural network of a certain architecture (some depth, width, layer types), we have an associated function, , where is an input and represents all the parameters of the network. Our goal is to understand how the behavior of changes as changes, for values of that we might encounter during training, and across inputs .
The first major difficulty comes from the high dimensionality of the input. Precisely quantifying the properties of over the entire input space is intractable. As a tractable alternative, we study simple one dimensional trajectories through input space. More formally:
Definition: Given two points, , we say is a trajectory (between and ) if is a curve parametrized by a scalar , with and .
Simple examples of a trajectory would be a line () or a circular arc (), but in general may be more complicated, and potentially not expressible in closed form.
Armed with this notion of trajectories, we can begin to define measures of expressivity of a network over trajectories .
2.1 Neuron Transitions and Activation Patterns
In (Montufar et al., 2014)
the notion of a “linear region” is introduced. Given a neural network with piecewise linear activations (such as ReLU or hard tanh), the function it computes is also piecewise linear, a consequence of the fact that composing piecewise linear functions results in a piecewise linear function. So one way to measure the “expressive power” of different architecturesis to count the number of linear pieces (regions), which determines how nonlinear the function is.
In fact, a change in linear region is caused by a neuron transition in the output layer. More precisely:
Definition For fixed , we say a neuron with piecewise linear region transitions between inputs
if its activation function switches linear region betweenand .
So a ReLU transition would be given by a neuron switching from off to on (or vice versa) and for hard tanh by switching between saturation at to its linear middle region to saturation at . For any generic trajectory , we can thus define to be the number of transitions undergone by output neurons (i.e. the number of linear regions) as we sweep the input . Instead of just concentrating on the output neurons however, we can look at this pattern over the entire network. We call this an activation patten:
Definition We can define to be the activation pattern – a string of form (for ReLUs) and (for hard tanh) of the network encoding the linear region of the activation function of every neuron, for an input and weights .
Overloading notation slightly, we can also define (similarly to transitions) as the number of distinct activation patterns as we sweep along . As each distinct activation pattern corresponds to a different linear function of the input, this combinatorial measure captures how much more expressive is over a simple linear mapping.
Returning to Montufar et al, they provide a construction i.e. a specific set of weights , that results in an exponential increase of linear regions with the depth of the architectures. They also appeal to Zaslavsky’s theorem (Stanley, 2011)
from the theory of hyperplane arrangements to show that a shallow network, i.e.one hidden layer, with the same number of parameters as a deep network, has a much smaller number of linear regions than the number achieved by their choice of weights for the deep network.
More formally, letting be a fully connected network with one hidden layer, and a fully connected network with the same number of parameters, but hidden layers, they show
We derive a much more general result by considering the ‘global’ activation patterns over the entire input space, and prove that for any fully connected network, with any number of hidden layers, we can upper bound the number of linear regions it can achieve, over all possible weight settings . This upper bound is asymptotically tight, matched by the construction given in (Montufar et al., 2014). Our result can be written formally as:
(Tight) Upper Bound for Number of Activation Patterns Let denote a fully connected network with hidden layers of width , and inputs in . Then the number of activation patterns is upper bounded by for ReLU activations, and for hard tanh.
From this we can derive a chain of inequalities. Firstly, from the theorem above we find an upper bound of over all , i.e.
Next, suppose we have parameters in total. Then we want to compare (for wlog ReLUs), quantities like for different .
But , and so, noting that the maxima of (for ) is , we get, (for ), in comparison to (*),
We prove this via an inductive proof on regions in a hyperplane arrangement. The proof can be found in the Appendix. As noted in the introduction, this result differs from earlier lower-bound constructions in that it is an upper bound that applies to all possible sets of weights. Via our analysis, we also prove
Regions in Input Space Given the corresponding function of a neural network with ReLU or hard tanh activations, the input space is partitioned into convex polytopes, with corresponding to a different linear function on each region.
This result is of independent interest for optimization – a linear function over a convex polytope results in a well behaved loss function and an easy optimization problem. Thus better understanding the density of these regions during the training process would likely shed light on properties of the loss surface, and improved optimization methods.
A picture of this, computed from a real network, is depicted in Figure 1.
2.1.1 Empirically Counting Transitions
We empirically tested the growth of the number of activations and transitions as we varied along on real networks to understand their behavior. We found that for bounded non linearities, especially tanh and hard-tanh, not only do we observe exponential growth with depth (as hinted at by the upper bound) but that the scale of parameter initialization also affects the observations (Figure 2).
input vectors achieved by sweeping the first layer weights in a hard-tanh network along a one-dimensional great circle trajectory. We show this(a) as a function of depth for several widths, and (b)
as a function of width for several depths. All networks were generated with weight variance, and bias variance .
We also experimented with sweeping the weights of a layer through a trajectory , and counting the different labellings output by the network. This ‘dichotomies’ measure is discussed further in the Appendix, and also exhibits the same exponential growth with network depth, but not with width, Figure 3.
2.2 Trajectory Length
In fact, there turns out to be a reason for the exponential growth with depth, and the sensitivity to initialization scale. Returning to our definition of trajectory, we can define an immediately related quantity, trajectory length
Definition: Given a trajectory, , we define its length, , to be the standard arc length:
Intuitively, the arc length breaks up into infinitesimal intervals and sums together the Euclidean length of these intervals.
If we let denote, as before, fully connected networks with hidden layers each of width , and initializing with weights (accounting for input scaling as typical), and biases , we find that:
Bound on Growth of Trajectory Length Let be a ReLU or hard tanh random neural network and a one dimensional trajectory with having a non trival perpendicular component to for all (i.e, not a line). Then defining to be the image of the trajectory in layer of the network, we have
for hard tanh
That is, grows exponentially with the depth of the network, but the width only appears as a base (of the exponent). This bound is in fact tight in the limits of large and .
A schematic image depicting this can be seen in Figure 4 and the proof can be found in the Appendix. A rough outline is as follows: we look at the expected growth of the difference between a point on the curve and a small perturbation , from layer to layer . Denoting this quantity , we derive a recurrence relating and which can be composed to give the desired growth rate.
The analysis is complicated by the statistical dependence on the image of the input . So we instead form a recursion by looking at the component of the difference perpendicular to the image of the input in that layer, i.e. , which results in the condition on in the statement.
In Figures 5, 6, we see the growth of an input trajectory for ReLU networks on CIFAR-10 and MNIST. The CIFAR-10 network is convolutional but we observe that these layers also result in similar rates of trajectory length increases to the fully connected layers. We also see, as would be expected, that pooling layers act to reduce the trajectory length.
In Figure 7 we plot the lower and upper (see Appendix) bounds for trajectory growth with a hard tanh network.
For the hard tanh case (and more generally any bounded non-linearity), we can formally prove the relation of trajectory length and transitions under an assumption: assume that while we sweep all neurons are saturated unless transitioning saturation endpoints, which happens very rapidly. (This is the case for e.g. large initialization scales). Then we have:
Transitions proportional to trajectory length Let be a hard tanh network with hidden layers each of width . And let
Then for initialized with weight and bias scales .
Note that the expression for is exactly the expression given by Theorem 3 when is very large and dominates . We can also verify this experimentally in settings where the simpilfying assumption does not hold. Figure 8 shows a direct proportionality between trajectory length and transitions, for many scale choices.
3 Trained Networks
The analysis so far offers a surprising takeaway for trained networks, which we summarize as all weights are not equal (initial layers matter more). In particular, we find that trained networks are most sensitive to changes and the parameter choices of their initial layers, as is predicted by the trajectory length growth.
From the proof of Theorem 3, we saw that a perturbation to the input would grow exponentially in the depth of the network, for sufficiently large (which is often well in the standard initialization scale – e.g. for ReLU networks, any ). It is easy to see that this analysis is not limited to the input layer, but can be applied to any layer. In this form, it would say
A perturbation at a layer grows exponentially in the remaining depth after that layer.
This means that perturbations to weights in lower layers are far more costly than perturbations in the upper layers, due to exponentially increasing magnitude of noise, and should result in a severe drop of accuracy. Figure 9, in which we train a network on CIFAR-10 and add noise of varying magnitudes to exactly one layer, shows exactly this.
Even more surprisingly, Figure 9 is initialized with which is not in an exponential growth regime for this CIFAR-10 architecture, as shown by Figure 5. It turns out that when the network is initialized with lower weights, training tends to increase the variance of the weights (Figure 10) pushing the network into the exponential growth regime. Further proof of this is seen in Figure 11, where we measure the change in trajectory length through training for a smaller initialization, and find that trajectory length increases during training.
As the initial layers are crucial to the accuracy of the trained network, we also experimented to see how well the network could perform relying on just these layers. In particular, we tried initializing a network, and then only training a single layer, at different depths in the network. For simpler tasks, e.g. MNIST, we found that there was monotonically increasing performance for both training and generalization with training the more initial layers, Figure 12. This was also mostly the case for CIFAR-10, Figure 13, but performance was impacted more here.
Characterizing the expressiveness of neural networks, and understanding how expressiveness varies with parameters of the architecture, has been a challenging problem due to the difficulty in identifying meaningful notions of expressivity and in linking their analysis to implications for these networks in practice. In this paper we have presented an interrelated set of expressivity measures; we have shown tight exponential bounds on the growth of these measures in the depth of the networks, and we have offered a unifying view of the analysis through the notion of trajectory length. Our analysis of trajectories provides insights for the performance of trained networks as well, suggesting that networks in practice may be more sensitive to small perturbations in weights at lower layers; these insights are borne out by computational experiments.
This work raises many interesting directions for further work. At a general level, it would be interesting to link measures of expressivity to further properties of neural network performance. There is also a natural connection between the effects of small perturbations to weights in lower layers, noted in the paper, and the active line of work on adversarial examples, which consists of small perturbations to values at the inputs (Goodfellow et al., 2014). Finally, our finding that trajectories tend to increase during neural network training could potentially be used as part of an analysis of the success of batch normalization as a regularizer (Ioffe and Szegedy, 2015).
- Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
Baldi et al. 
Pierre Baldi, Peter Sadowski, and Daniel Whiteson.
Searching for exotic particles in high-energy physics with deep learning.Nature communications, 5, 2014.
- Silver et al.  David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
- Piech et al.  Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J Guibas, and Jascha Sohl-Dickstein. Deep knowledge tracing. In Advances in Neural Information Processing Systems, pages 505–513, 2015.
- Hornik et al.  Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
Approximation by superpositions of a sigmoidal function.Mathematics of control, signals and systems, 2(4):303–314, 1989.
- Maass et al.  Wolfgang Maass, Georg Schnitger, and Eduardo D Sontag. A comparison of the computational power of sigmoid and Boolean threshold circuits. Springer, 1994.
- Bartlett et al.  Peter L Bartlett, Vitaly Maiorov, and Ron Meir. Almost linear vc-dimension bounds for piecewise polynomial networks. Neural computation, 10(8):2159–2173, 1998.
- Pascanu et al.  Razvan Pascanu, Guido Montufar, and Yoshua Bengio. On the number of response regions of deep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098, 2013.
- Montufar et al.  Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.
- Eldan and Shamir  Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. arXiv preprint arXiv:1512.03965, 2015.
- Telgarsky  Matus Telgarsky. Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101, 2015.
Martens et al. 
James Martens, Arkadev Chattopadhya, Toni Pitassi, and Richard Zemel.
On the representational efficiency of restricted boltzmann machines.In Advances in Neural Information Processing Systems, pages 2877–2885, 2013.
Bianchini and Scarselli 
Monica Bianchini and Franco Scarselli.
On the complexity of neural network classifiers: A comparison between shallow and deep architectures.Neural Networks and Learning Systems, IEEE Transactions on, 25(8):1553–1565, 2014.
- Stanley  Richard Stanley. Hyperplane arrangements. Enumerative Combinatorics, 2011.
- Goodfellow et al.  Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. CoRR, abs/1412.6572, 2014.
Ioffe and Szegedy 
Sergey Ioffe and Christian Szegedy.
Batch normalization: Accelerating deep network training by reducing
internal covariate shift.
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 448–456, 2015.
- Kershaw  D. Kershaw. Some extensions of w. gautschi’s inequalities for the gamma function. Mathematics of Computation, 41(164):607–611, 1983.
- Laforgia and Natalini  Andrea Laforgia and Pierpaolo Natalini. On some inequalities for the gamma function. Advances in Dynamical Systems and Applications, 8(2):261–267, 2013.
- Sauer  Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A, 13(1):145–147, 1972.
Appendix A Proofs and additional results from Section 2.1
Proof of Theorem 2
We show inductively that partitions the input space into convex polytopes via hyperplanes. Consider the image of the input space under the first hidden layer. Each neuron defines hyperplane(s) on the input space: letting be the th row of , the bias, we have the hyperplane for a ReLU and hyperplanes for a hard-tanh. Considering all such hyperplanes over neurons in the first layer, we get a hyperplane arrangement in the input space, each polytope corresponding to a specific activation pattern in the first hidden layer.
Now, assume we have partitioned our input space into convex polytopes with hyperplanes from layers . Consider and a specific polytope . Then the activation pattern on layers is constant on , and so the input to on is a linear function of the inputs and some constant term, comprising of the bias and the output of saturated units. Setting this expression to zero (for ReLUs) or to (for hard-tanh) again gives a hyperplane equation, but this time, the equation is only valid in (as we get a different linear function of the inputs in a different region.) So the defined hyperplane(s) either partition (if they intersect ) or the output pattern of is also constant on . The theorem then follows. ∎
This implies that any one dimensional trajectory , that does not ‘double back’ on itself (i.e. reenter a polytope it has previously passed through), will not repeat activation patterns. In particular, after seeing a transition (crossing a hyperplane to a different region in input space) we will never return to the region we left. A simple example of such a trajectory is a straight line:
Transitions and Output Patterns in an Affine Trajectory For any affine one dimensional trajectory input into a neural network , we partition into intervals every time a neuron transitions. Every interval has a unique network activation pattern on .
Generalizing from a one dimensional trajectory, we can ask how many regions are achieved over the entire input – i.e. how many distinct activation patterns are seen? We first prove a bound on the number of regions formed by hyperplanes in (in a purely elementary fashion, unlike the proof presented in [Stanley, 2011])
Upper Bound on Regions in a Hyperplane Arrangement Suppose we have hyperplanes in - i.e. equations of form . for , . Let the number of regions (connected open sets bounded on some sides by the hyperplanes) be . Then
Proof of Theorem 5
Let the hyperplane arrangement be denoted , and let be one specific hyperplane. Then the number of regions in is precisely the number of regions in plus the number of regions in . (This follows from the fact that subdivides into two regions exactly all of the regions in , and does not affect any of the other regions.)
In particular, we have the recursive formula
We now induct on to assert the claim. The base cases of are trivial, and assuming the claim for as the induction hypothesis, we have
where the last equality follows by the well known identity
This concludes the proof. ∎
With this result, we can easily prove Theorem 1 as follows:
Proof of Theorem 1
First consider the ReLU case. Each neuron has one hyperplane associated with it, and so by Theorem 5, the first hidden layer divides up the inputs space into regions, with .
Now consider the second hidden layer. For every region in the first hidden layer, there is a different activation pattern in the first layer, and so (as described in the proof of Theorem 2) a different hyperplane arrangement of hyperplanes in an dimensional space, contributing at most regions.
In particular, the total number of regions in input space as a result of the first and second hidden layers is . Continuing in this way for each of the hidden layers gives the bound.
A very similar method works for hard tanh, but here each neuron produces two hyperplanes, resulting in a bound of .
Appendix B Proofs and additional results from Section 2.2
Proof of Theorem 3
b.1 Notation and Preliminary Results
Difference of points on trajectory Given in the trajectory, let
Parallel and Perpendicular Components: Given vectors , we can write where is the component of perpendicular to , and is the component parallel to . (Strictly speaking, these components should also have a subscript , but we suppress it as the direction with respect to which parallel and perpendicular components are being taken will be explicitly stated.)
This notation can also be used with a matrix , see Lemma 1.
Before stating and proving the main theorem, we need a few preliminary results.
Matrix Decomposition Let be fixed non-zero vectors, and let be a (full rank) matrix. Then, we can write
i.e. the row space of is decomposed to perpendicular and parallel components with respect to (subscript on right), and the column space is decomposed to perpendicular and parallel components of (superscript on left).
Let be rotations such that and . Now let , and let , with having non-zero term exactly , having non-zero entries exactly for . Finally, we let have non-zero entries exactly , with and have the remaining entries non-zero.
If we define and , then we see that
as have only one non-zero term, which does not correspond to a non-zero term in the components of in the equations.
Then, defining , and the other components analogously, we get equations of the form
Given as before, and considering , with respect to (wlog a unit vector) we can express them directly in terms of as follows: Letting be the th row of , we have
i.e. the projection of each row in the direction of . And of course
The motivation to consider such a decomposition of is for the resulting independence between different components, as shown in the following lemma.
There are two possible proof methods:
We use the rotational invariance of random Gaussian matrices, i.e. if is a Gaussian matrix, iid entries , and is a rotation, then is also iid Gaussian, entries . (This follows easily from affine transformation rules for multivariate Gaussians.)
Let be a rotation as in Lemma 1. Then is also iid Gaussian, and furthermore, and partition the entries of , so are evidently independent. But then and are also independent.
From the observation note that and
have a centered multivariate joint Gaussian distribution (both consist of linear combinations of the entriesin .) So it suffices to show that and have covariance . Because both are centered Gaussians, this is equivalent to showing . We have that
As any two rows of are independent, we see from the observation that is a diagonal matrix, with the th diagonal entry just . But similarly, is also a diagonal matrix, with the same diagonal entries - so the claim follows.
In the following two lemmas, we use the rotational invariance of Gaussians as well as the chi distribution to prove results about the expected norm of a random Gaussian vector.
Norm of a Gaussian vector Let be a random Gaussian vector, with iid, . Then
We use the fact that if is a random Gaussian, and then follows a chi distribution. This means that , the mean of a chi distribution with degrees of freedom, and the result follows by noting that the expectation in the lemma is multiplied by the above expectation. ∎
We will find it useful to bound ratios of the Gamma function (as appear in Lemma 3) and so introduce the following inequality, from [Kershaw, 1983] that provides an extension of Gautschi’s Inequality.
An Extension of Gautschi’s Inequality For , we have
We now show:
Let be as in Lemma 1. As are rotations, is also iid Gaussian. Furthermore for any fixed , with , by taking inner products, and square-rooting, we see that . So in particular
But from the definition of non-zero entries of , and the form of (a zero entry in the first coordinate), it follows that has exactly non zero entries, each a centered Gaussian with variance . By Lemma 3, the expected norm is as in the statement. We then apply Theorem 6 to get the lower bound.
First note we can view . (Projecting down to a random (as is random) subspace of fixed size and then making perpendicular commutes with making perpendicular and then projecting everything down to the subspace.)
So we can view as a random by matrix, and for as in Lemma 1 (with projected down onto dimensions), we can again define as by and by rotation matrices respectively, and , with analogous properties to Lemma 1. Now we can finish as in part (a), except that may have only entries, (depending on whether is annihilated by projecting down by) each of variance .
Norm and Translation Let be a centered multivariate Gaussian, with diagonal covariance matrix, and a constant vector.
The inequality can be seen intuitively geometrically: as has diagonal covariance matrix, the contours of the pdf of are circular centered at , decreasing radially. However, the contours of the pdf of are shifted to be centered around , and so shifting back to reduces the norm.
A more formal proof can be seen as follows: let the pdf of be . Then we wish to show
Now we can pair points , using the fact that and the triangle inequality on the integrand to get
b.2 Proof of Theorem
We use to denote the neuron in hidden layer . We also let be an input,
be the hidden representation at layer, and the non-linearity. The weights and bias are called and respectively. So we have the relations
We first prove the zero bias case. To do so, it is sufficient to prove that
as integrating over gives us the statement of the theorem.
For ease of notation, we will suppress the in .
We first write
where the division is done with respect to . Note that this means as the other component annihilates (maps to ) .
We can also define i.e. the set of indices for which the hidden representation is not saturated. Letting denote the th row of matrix , we now claim that:
Indeed, by Lemma 2 we first split the expectation over into a tower of expectations over the two independent parts of to get
But conditioning on in the inner expectation gives us and , allowing us to replace the norm over with the sum in the term on the right hand side of the claim.
Till now, we have mostly focused on partitioning the matrix . But we can also set where the perpendicular and parallel are with respect to . In fact, to get the expression in (**), we derive a recurrence as below:
To get this, we first need to define - the latent vector with all saturated units zeroed out.
We then split the column space of , where the split is with respect to . Letting be the part perpendicular to , and the set of units that are unsaturated, we have an important relation:
(where the indicator in the right hand side zeros out coordinates not in the active set.)
To see this, first note, by definition,
where the indicates a unit vector.
Now note that for any index , the right hand sides of (1) and (2) are identical, and so the vectors on the left hand side agree for all . In particular,
Now the claim follows easily by noting that .
Returning to (*), we split , (and analogously), and after some cancellation, we have
We would like a recurrence in terms of only perpendicular components however, so we first drop the (which can be done without decreasing the norm as they are perpendicular to the remaining terms) and using the above claim, have
But in the inner expectation, the term is just a constant, as we are conditioning on . So using Lemma 5 we have
We can then apply Lemma 4 to get
The outer expectation on the right hand side only affects the term in the expectation through the size of the active set of units. For ReLUs, and for hard tanh, we have , and noting that we get a non-zero norm only if (else we cannot project down a dimension), and for ,