1 Introduction
Deep feedforward neural networks, with multiple hidden layers, have achieved remarkable performance across many domains Krizhevsky et al. (2012); Mnih et al. (2013); Hannun et al. (2014); Piech et al. (2015). A key factor thought to underlie their success is their high expressivity
. This informal notion has manifested itself primarily in two forms of intuition. The first is that deep networks can compactly express highly complex functions over input space in a way that shallow networks with one hidden layer and the same number of neurons cannot. The second piece of intuition, which has captured the imagination of machine learning
Bengio et al. (2013) and neuroscience DiCarlo and Cox (2007) alike, is that deep neural networks can disentangle highly curved manifolds in input space into flattened manifolds in hidden space, to aid the performance of simple linear readouts. These intuitions, while attractive, have been difficult to formalize mathematically, and thereby rigorously test. ^{†}^{†}Code to reproduce all results available at: https://github.com/gangulilab/deepchaosFor the first intuition, seminal works have exhibited examples of particular functions that can be computed with a polynomial number of neurons (in the input dimension) in a deep network but require an exponential number of neurons in a shallow network Montufar et al. (2014); Delalleau and Bengio (2011); Eldan and Shamir (2015); Telgarsky (2015); Martens et al. (2013). This raises a central open question: are such functions merely rare curiosities, or is any function computed by a generic deep network not efficiently computable by a shallow network? The theoretical techniques employed in prior work both limited the applicability of theory to specific nonlinearities and dictated the particular measure of deep functional complexity involved. For example Montufar et al. (2014)
focused on ReLu nonlinearities and number of linear regions as a complexity measure, while
Delalleau and Bengio (2011) focused on sumproduct networks and the number of monomials as complexity measure, and Bianchini and Scarselli (2014) focused on Pfaffian nonlinearities and topological measures of complexity, like the sum of Betti numbers of a decision boundary. However, see Mhaskar et al. (2016) for an interesting analysis of a general class of compositional functions. The limits of prior theoretical techniques raise another central question: is there a unifying theoretical framework for deep neural expressivity that is simultaneously applicable to arbitrary nonlinearities, generic networks, and a natural, general measure of functional complexity?Here we attack both central problems of deep neural expressivity by combining a very different set of tools, namely Riemannian geometry Lee (2006) and dynamical mean field theory Sompolinsky et al. (1988). This novel combination enables us to show that for very broad classes of nonlinearities, even random deep neural networks can construct hidden internal representations whose global extrinsic curvature grows exponentially with depth but not width. Our geometric framework enables us to quantitatively define a notion of disentangling and verify this notion even in deep random networks. Furthermore, our methods yield insights into the emergent, deterministic nature of signal propagation through large random feedforward networks, revealing the existence of an order to chaos transition as a function of the statistics of weights and biases. We find that the transient, finite depth evolution in the chaotic regime underlies the origins of exponential expressivity in deep random networks.
In our companion paper Raghu et al. (2016), we study several related measures of expressivity in deep random neural networks with piecewise linear activations.
2 A mean field theory of deep nonlinear signal propagation
Consider a deep feedforward network with layers of weights and
layers of neural activity vectors
, with neurons in each layer , so that and is an weight matrix. The feedforward dynamics elicited by an input is given by(1) 
where is a vector of biases, is the pattern of inputs to neurons at layer , and is a single neuron scalar nonlinearity that acts componentwise to transform inputs to activities . We wish to understand the nature of typical functions computable by such networks, as a consequence of their depth. We therefore study ensembles of random networks in which each of the synaptic weights
are drawn i.i.d. from a zero mean Gaussian with variance
, while the biases are drawn i.i.d. from a zero mean Gaussian with variance . This weight scaling ensures that the input contribution to each individual neuron at layer from activities in layer remains , independent of the layer width . This ensemble constitutes a maximum entropy distribution over deep neural networks, subject to constraints on the means and variances of weights and biases. This ensemble induces no further structure in the resulting set of deep functions, so its analysis provides an opportunity to understand the specific contribution of depth alone to the nature of typical functions computed by deep networks.In the limit of large layer widths, , certain aspects of signal propagation through deep random neural networks take on an essentially deterministic character. This emergent determinism in large random neural networks enables us to understand how the Riemannian geometry of simple manifolds in the input layer is typically modified as the manifold propagates into the deep layers. For example, consider the simplest case of a single input vector . As it propagates through the network, its length in downstream layers will change. We track this changing length by computing the normalized squared length of the input vector at each layer:
(2) 
This length is the second moment of the empirical distribution of inputs
across all neurons in layer . For large , this empirical distribution converges to a zero mean Gaussian since eachis a weighted sum of a large number of uncorrelated random variables  i.e. the weights
and biases, which are independent of the activity in previous layers. By propagating this Gaussian distribution across one layer, we obtain an iterative map for
in (2):(3) 
where is the standard Gaussian measure, and the initial condition is , where is the length in the initial activity layer. See Supplementary Material (SM) for a derivation of (3). Intuitively, the integral over in (3) replaces an average over the empirical distribution of across neurons in layer at large layer width .
The function in (3) is an iterative variance, or length, map that predicts how the length of an input in (2) changes as it propagates through the network. This length map is plotted in Fig. 1
A for the special case of a sigmoidal nonlinearity,
. For monotonic nonlinearities, this length map is a monotonically increasing, concave function whose intersections with the unity line determine its fixed points . For and , the only intersection is at . In this biasfree, small weight regime, the network shrinks all inputs to the origin. For and , the fixed point becomes unstable and the length map acquires a second nonzero fixed point, which is stable. In this biasfree, large weight regime, the network expands small inputs and contracts large inputs. Also, for any nonzero bias , the length map has a single stable nonzero fixed point. In such a regime, even with small weights, the injected biases at each layer prevent signals from decaying to . The dynamics of the length map leads to rapid convergence of length to its fixed point with depth (Fig. 1B,D), often within only layers. The fixed points are shown in Fig. 1C.3 Transient chaos in deep networks
Now consider the layerwise propagation of two inputs and . The geometry of these two inputs as they propagate through the network is captured by the by matrix of inner products:
(4) 
The dynamics of the two diagonal terms are each theoretically predicted by the length map in (3). We derive (see SM) a correlation map that predicts the layerwise dynamics of :
(5)  
where is the correlation coefficient. Here and are independent standard Gaussian variables, while and are correlated Gaussian variables with covariance matrix . Together, (3) and (5) constitute a theoretical prediction for the typical evolution of the geometry of points in (4) in a fixed large network.
Analysis of these equations reveals an interesting order to chaos transition in the and plane. In particular, what happens to two nearby points as they propagate through the layers? Their relation to each other can be tracked by the correlation coefficient between the two points, which approaches a fixed point at large depth. Since the length of each point rapidly converges to , as shown in Fig. 1BD, we can compute by simply setting in (5) and dividing by to obtain an iterative correlation coefficient map, or map, for :
(6) 
This map is shown in Fig. 2A. It always has a fixed point at as can be checked by direct calculation. However, the stability of this fixed point depends on the slope of the map at , which is
(7) 
See SM for a derivation of (7). If the slope is less than , then the map is above the unity line, the fixed point at under the map in (6) is stable, and
nearby points become more similar over time. Conversely, if then this fixed point is unstable, and nearby points separate as they propagate through the layers. Thus we can intuitively understand as a multiplicative stretch factor. This intuition can be made precise by considering the Jacobian at a point with length . is a linear approximation of the network map from layer to in the vicinity of . Therefore a small random perturbation will map to . The growth of the perturbation, becomes after averaging over the random perturbation , weight matrix , and Gaussian distribution of across . Thus directly reflects the typical multiplicative growth or shrinkage of a random perturbation across one layer.
The dynamics of the iterative map and its agreement with network simulations is shown in Fig. 2B. The correlation dynamics are much slower than the length dynamics because the map is closer to the unity line (Fig. 2A) than the length map (Fig. 1A). Thus correlations typically take about layers to approach the fixed point, while lengths need only . The fixed point and slope of the map are shown in Fig. 2CD. For any fixed, finite , as increases three qualitative regions occur. For small , is the only fixed point, and it is stable because . In this strong bias regime, any two input points converge to each other as they propagate through the network. As increases, increases and crosses , destabilizing the fixed point. In this intermediate regime, a new stable fixed point appears, which decreases as increases. Here an equal footing competition between weights and nonlinearities (which decorrelate inputs) and the biases (which correlate them), leads to a finite . At larger , the strong weights overwhelm the biases and maximally decorrelate inputs to make them orthogonal, leading to a stable fixed point at .
Thus the equation yields a phase transition boundary in the plane, separating it into a chaotic (or ordered) phase, in which nearby points separate (or converge). In dynamical systems theory, the logarithm of is related to the well known Lyapunov exponent which is positive (or negative) for chaotic (or ordered) dynamics. However, in a feedforward network, the dynamics is truncated at a finite depth , and hence the dynamics are a form of transient chaos.
4 The propagation of manifold geometry through deep networks
Now consider a dimensional manifold in input space, where is an intrinsic scalar coordinate on the manifold. This manifold propagates to a new manifold in the vector space of inputs to layer . The typical geometry of the manifold in the ’th layer is summarized by , which for any and is defined by (4) with the choice and . The theory for the propagation of pairs of points applies to all pairs of points on the manifold, so intuitively, we expect that in the chaotic phase of a sigmoidal network, the manifold should in some sense decorrelate, and become more complex, while in the ordered phase the manifold should contract around a central point. This theoretical prediction of equations (3) and (5) is quantitatively confirmed in simulations in Fig. 3, when the input is a simple manifold, the circle, , where and form an orthonormal basis for a dimensional subspace of in which the circle lives. The scaling is chosen so that each neuron has input activity . Also, for simplicity, we choose the fixed point radius in Fig. 3.
To quantitatively understand the layerwise growth of complexity of this manifold, it is useful to turn to concepts in Riemannian geometry Lee (2006). First, at each point , the manifold (we temporarily suppress the layer index ) has a tangent, or velocity vector . Intuitively, curvature is related to how quickly this tangent vector rotates in the ambient space as one moves along the manifold, or in essence the acceleration vector . Now at each point , when both are nonzero, and span a 2 dimensional subspace of . Within this subspace, there is a unique circle of radius that has the same position, velocity and acceleration vector as the curve at . This circle is known as the osculating circle (Fig. 4A), and the extrinsic curvature of the curve is defined as . Thus, intuitively, small radii of curvature imply high extrinsic curvature . The extrinsic
curvature of a curve depends only on its image in and is invariant with respect to the particular parameterization . For any parameterization, an explicit expression for is given by Lee (2006). Note that under a unit speed parameterization of the curve, so that , we have , and is simply the norm of the acceleration vector.
Another measure of the curve’s complexity is the length of its image in the ambient Euclidean space. The Euclidean metric in induces a metric on the curve, so that the distance moved in as one moves from to on the curve is . The total curve length is . However, even straight line segments can have a large Euclidean length. Another interesting measure of length that takes into account curvature, is the length of the image of the curve under the Gauss map. For a dimensional manifold embedded in , the Gauss map (Fig. 4B) maps a point to its dimensional tangent plane , where is the Grassmannian manifold of all dimensional subspaces in . In the special case of , is the sphere with antipodal points identified, since a dimensional subspace can be identified with a unit vector, modulo sign. The Gauss map takes a point on the curve and maps it to the unit velocity vector . In particular, the natural metric on induces a Gauss metric on the curve, given by , which measures how quickly the unit tangent vector changes as changes. Thus the distance moved in the Grassmannian as one moves from to on the curve is , and the length of the curve under the Gauss map is . Furthermore, the Gauss metric is related to the extrinsic curvature and the Euclidean metric via the relation Lee (2006).
To illustrate these concepts, it is useful to compute all of them for the circle defined above: , , , , and . As expected, is the inverse of the radius of curvature, which is . Now consider how these quantities change if the circle is scaled up so that . The length and radius scale up by , but the curvature scales down as , and so does not change. Thus linear expansion increases length and decreases curvature, thereby maintaining constant Grassmannian length .
We now show that nonlinear propagation of this same circle through a deep network can behave very differently from linear expansion: in the chaotic regime, length can increase without any decrease in extrinsic curvature! To remove the scaling with in the above quantities, we will work with the renormalized quantities , , and . Thus, can be thought of as a radius of curvature squared per neuron of the osculating circle, while is the squared Euclidean length of the curve per neuron. For the circle, these quantities are and respectively. For simplicity, in the inputs to the first layer of neurons, we begin with a circle with squared radius per neuron , so this radius is already at the fixed point of the length map in (3). In the SM, we derive an iterative formula for the extrinsic curvature and Euclidean metric of this manifold as it propagates through the layers of a deep network:
(8) 
where is the stretch factor defined in (7) and is defined analogously as
(9) 
is closely related to the second derivative of the map in (6) at ; this second derivative is . See SM for a derivation of the evolution equations (8) for the extrinsic geometry of a curve as it propagates through a deep network.
Intriguingly for a sigmoidal neural network, these evolution equations behave very differently in the chaotic () versus ordered () phase. In the chaotic phase, the Euclidean metric grows exponentially with depth due to multiplicative stretching through . This stretching does multiplicatively attenuate any curvature in layer by a factor (see the update equation for in (8)), but new curvature is added in due to a nonzero , which originates from the curvature of the single neuron nonlinearity in (9). Thus, unlike in linear expansion, extrinsic curvature is not lost, but maintained, and ultimately approaches a fixed point . This implies that the global curvature measure grows exponentially with depth. These highly nontrivial predictions of the metric and curvature evolution equations in (8) are quantitatively confirmed in simulations in Figure 4CE.
Intuitively, this exponential growth of global curvature
in the chaotic phase implies that the curve explores many different tangent directions in hidden representation space. This further implies that the coordinate functions of the embedding
become highly complex curved basis functions on the input manifold coordinate , allowing a deep network to compute exponentially complex functions over simple low dimensional manifolds (Figure 5AC, details in SM). In our companion paper Raghu et al. (2016), we further develop the relationship between length and expressivity in terms of the number of achievable classification patterns on a set of inputs. Moreover, we explore how training a single layer at different depths from the output affects network performance.5 Shallow networks cannot achieve exponential expressivity
Consider a shallow network with hidden layer , one input layer , with , and a linear readout layer. How complex can the hidden representation be as a function of its width , relative to the results above for depth? We prove a general upper bound on (see SM):
Theorem 1.
Suppose is monotonically nondecreasing with bounded dynamic range , i.e. . Further suppose that is a curve in input space such that no 1D projection of changes sign more than times over the range of . Then for any choice of and the Euclidean length of , satisfies .
For the circle input, and for the nonlinearity, , so in this special case, the normalized length . In contrast, for deep networks in the chaotic regime grows exponentially with depth in space, and so consequently also in space. Therefore the length of curves typically expand exponentially in depth even for random deep networks, but can only expand as the square root of width no matter what shallow network is chosen. Moreover, as we have seen above, it is the exponential growth of that fundamentally drives the exponential growth of with depth. Indeed shallow random networks exhibit minimal growth in expressivity even at large widths (Figure 5D).
6 Classification boundaries acquire exponential local curvature with depth
We have focused so far on how simple manifolds in input space can acquire both exponential Euclidean and Grassmannian length with depth, thereby exponentially decorrelating and filling up hidden representation space. Another natural question is how the complexity of a decision boundary grows as it is backpropagated to the input layer. Consider a linear classifier
acting on the final layer. In this layer, thedimensional decision boundary is the hyperplane
. However, in the input layer , the decision boundary is a curved dimensional manifold that arises as the solution set of the nonlinear equation , where is the nonlinear feedforward map from input to output.At any point on the decision boundary in layer , the gradient is perpendicular to the dimensional tangent plane (see Fig. 4F). The normal vector , along with any unit tangent vector , spans a dimensional subspace whose intersection with yields a geodesic curve in passing through with velocity vector . This geodesic will have extrinsic curvature . Maximizing this curvature over yields the first principal curvature . A sequence of successive maximizations of , while constraining to be perpendicular to all previous solutions, yields the sequence of principal curvatures
. These principal curvatures arise as the eigenvalues of a normalized Hessian operator projected onto the tangent plane
: , where is the projection operator onto and is the unit normal vector Lee (2006). Intuitively, near , the decision boundary can be approximated as a paraboloid with a quadratic form whose eigenvalues are the principal curvatures (Fig. 4F).We compute these curvatures numerically as a function of depth in Fig. 4G (see SM for details). We find, remarkably, that a subset of principal curvatures grow exponentially with depth. Here the principal curvatures are signed, with positive (negative) curvature indicating that the associated geodesic curves towards (away from) the normal vector . Thus the decision boundary can become exponentially curved with depth, enabling highly complex classifications. Moreover, this exponentially curved boundary is disentangled and mapped to a flat boundary in the output layer.
7 Discussion
Fundamentally, neural networks compute nonlinear maps between high dimensional spaces, for example from , and it is unclear what the most appropriate mathematics is for understanding such daunting spaces of maps. Previous works have attacked this problem by restricting the nature of the nonlinearity involved (e.g. piecewise linear, sumproduct, or Pfaffian) and thereby restricting the space of maps to those amenable to special theoretical analysis methods (combinatorics, polynomial relations, or topological invariants). We have begun a preliminary exploration of the expressivity of such deep functions based on Riemannian geometry and dynamical mean field theory. We demonstrate that networks in a chaotic phase compactly exhibit functions that exponentially grow the global curvature of simple one dimensional manifolds from input to output and the local curvature of simple codimension one manifolds from output to input. The former captures the notion that deep neural networks can efficiently compute highly expressive functions in ways that shallow networks cannot, while the latter quantifies and demonstrates the power of deep neural networks to disentangle curved input manifolds, an attractive idea that has eluded formal quantification.
Moreover, our analysis of a maximum entropy distribution over deep networks constitutes an important null model of deep signal propagation that can be used to assess and understand different behavior in trained networks. For example, the metrics we have adapted from Riemannian geometry, combined with an understanding of their behavior in random networks, may provide a basis for understanding what is special about trained networks. Furthermore, while we have focused on the notion of inputoutput chaos, the duality between inputs and synaptic weights imply a form of weight chaos, in which deep neural networks rapidly traverse function space as weights change (see SM). Indeed, just as autocorrelation lengths between outputs as a function of inputs shrink exponentially with depth, so too will autocorrelations between outputs as a function of weights.
But more generally, to understand functions, we often look to their graphs. The graph of a map from is an dimensional submanifold of
, and therefore has both high dimension and codimension. We speculate that many of the secrets of deep learning may be uncovered by studying the geometry of this graph as a Riemannian manifold, and understanding how it changes with both depth and learning.
References
 Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Hannun et al. [2014] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up endtoend speech recognition. arXiv preprint arXiv:1412.5567, 2014.
 Piech et al. [2015] Chris Piech, Jonathan Bassen, Jonathan Huang, Surya Ganguli, Mehran Sahami, Leonidas J Guibas, and Jascha SohlDickstein. Deep knowledge tracing. In Advances in Neural Information Processing Systems, pages 505–513, 2015.
 Bengio et al. [2013] Yoshua Bengio, Aaron Courville, and Pierre Vincent. Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1798–1828, 2013.
 DiCarlo and Cox [2007] James J DiCarlo and David D Cox. Untangling invariant object recognition. Trends in cognitive sciences, 11(8):333–341, 2007.
 Montufar et al. [2014] Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932, 2014.
 Delalleau and Bengio [2011] Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sumproduct networks. In Advances in Neural Information Processing Systems, pages 666–674, 2011.
 Eldan and Shamir [2015] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. arXiv preprint arXiv:1512.03965, 2015.
 Telgarsky [2015] Matus Telgarsky. Representation benefits of deep feedforward networks. arXiv preprint arXiv:1509.08101, 2015.

Martens et al. [2013]
James Martens, Arkadev Chattopadhya, Toni Pitassi, and Richard Zemel.
On the representational efficiency of restricted boltzmann machines.
In Advances in Neural Information Processing Systems, pages 2877–2885, 2013.  Bianchini and Scarselli [2014] Monica Bianchini and Franco Scarselli. On the complexity of neural network classifiers: A comparison between shallow and deep architectures. Neural Networks and Learning Systems, IEEE Transactions on, 25(8):1553–1565, 2014.
 Mhaskar et al. [2016] Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. Learning real and boolean functions: When is deep better than shallow. arXiv preprint arXiv:1603.00988, 2016.
 Lee [2006] John M Lee. Riemannian manifolds: an introduction to curvature, volume 176. Springer Science & Business Media, 2006.
 Sompolinsky et al. [1988] Haim Sompolinsky, A Crisanti, and HJ Sommers. Chaos in random neural networks. Physical Review Letters, 61(3):259, 1988.
 Raghu et al. [2016] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha SohlDickstein. On the expressive power of deep neural networks. arXiv preprint, 2016.
Appendix A Derivation of a transient dynamical mean field theory for deep networks
We study a deep feedforward network with layers of weights and layers of neural activity vectors , with neurons in each layer , so that and is an weight matrix. The feedforward dynamics elicited by an input is given by
(10) 
where is a vector of biases, is the pattern of inputs to neurons at layer , and is a single neuron scalar nonlinearity that acts componentwise to transform inputs to activities . The synaptic weights are drawn i.i.d. from a zero mean Gaussian with variance , while the biases are drawn i.i.d. from a zero mean Gaussian with variance . This weight scaling ensures that the input contribution to each individual neuron at layer from activities in layer remains , independent of the layer width .
a.1 Derivation of the length map
As a single input point propagates through the network, it’s length in downstream layers can either grow or shrink. To track the propagation of this length, we track the normalized squared length of the input vector at each layer,
(11) 
This length is the second moment of the empirical distribution of inputs across all neurons in layer for a fixed set of weights. This empirical distribution is expected to be Gaussian for large , since each individual is Gaussian distributed, as a sum of a large number of independent random variables, and each is independent of for because the synaptic weights vectors and biases into each neuron are chosen independently.
While the mean of this Gaussian is , its variance can be computed by considering the variance of the input to a single neuron:
(12) 
where denotes an average over the distribution of weights and biases into neuron at layer . Here we have used the identity . Now the empirical distribution of inputs across layer is also Gaussian, with mean zero and variance . Therefore we can replace the average over neurons in layer in (12) with an integral over a Gaussian random variable, obtaining
(13) 
where is the standard Gaussian measure, and the initial condition for the variance map is , where is the length in the initial activity layer. The function in (13) is an iterative variance map that predicts how the length of an input in (11) changes as it propagates through the network. Its derivation relies on the wellknown selfaveraging assumption in the statistical physics of disordered systems, which, in our context, means that the empirical distribution of inputs across neurons for a fixed network converges for large width, to the distribution of inputs to a single neuron across random networks.
a.2 Derivation of a correlation map for the propagation of two points
Now consider the layerwise propagation of two inputs and . The geometry of these two inputs as they propagate through the layers is captured by the by matrix of inner products
(14) 
The joint empirical distribution of and across at large will converge to a 2 dimensional Gaussian distribution with covariance
. Propagating this joint distribution forward one layer using ideas similar to the derivation above for
input yields(15)  
where is the correlation coefficient (CC). Here and are independent standard Gaussian variables, while and are correlated Gaussian variables with covariance matrix . The integration over and can be thought of as the large limit of sums over and .
When both input points are at their fixed point length, , the dynamics of their correlation coefficient can be obtained by simply setting in (15) and dividing by to obtain a recursion relation for :
(16) 
Direct calculation reveals that as expected. Of particular interest is the slope of this map at . A direct, if tedious calculation shows that
(17) 
To obtain this result, one has to apply the chain rule and product rule from calculus, as well as employ the identity
(18) 
which can be obtained via integration by parts. Evaluating the derivative at yields
(19) 
Appendix B Derivation of evolution equations for Riemannian curvature
Here we derive recursion relations for Riemannian curvature quantitites.
b.1 Curvature and length in terms of inner products
Consider a translation invariant manifold, or 1D curve that is on some constant radius sphere so that
(20) 
with . At large , the innerproduct structure of translation invariant manifolds remains approximately translation invariant as it propagates through the network. Therefore, at large , we can express inner products of derivatives of in terms of derivatives of . For example, the Euclidean metric is given by
(21) 
Here, each dot is a short hand notation for derivative w.r.t. . Also, the extrinsic curvature
(22) 
where and , simplifies to
(23) 
Now if the translation invariant manifold lives on a sphere of radius where is the fixed point radius of the length map, then its radius does not change as it propagates through the system. Then we can also express and in terms of the correlation coefficient function (up to a factor of ). Thus to understand the propagation of local quantities like Euclidean length and curvature, we need to understand the propagation of derivatives of at under the map in (16). Note that is symmetric and achieves a maximum value of at . Thus the function is symmetric with a minimum at . We consider the propagation of though the map. But first we consider the propagation of derivatives under function composition in general.
b.2 Behavior of first and second derivatives under function composition
Assume is an even function and , so that its Taylor expansion can be written as . We are interested in determining how the second and fourth derivatives of propagate under composition with another function , so that . We assume . We can use the chain rule and the product rule to derive:
(24)  
(25) 
b.3 Evolution equations for curvature and length
We now apply the above iterations with and . Clearly, the symmetric obeys , satisfying the above iterations of second and fourth derivatives. Taking into account these derivative recursions, using the expressions for and in terms of derivatives of at , and carefully accounting for factors of and , we obtain the final evolution equations that have been successfully tested against experiments:
(26)  
(27) 
where is the stretch factor defined in (19) and is defined analogously as
(28) 
is closely related to the second derivative of the correlation coefficient map in (16) at . Indeed this second derivative is .
Appendix C Upper bounds on the complexity of shallow neural representations
Consider a shallow network with hidden layer and one input layer , so that . The network can compute functions through a linear readout of the hidden layer . We are interested in how complex these neural representations can get, with one layer of synaptic weights and nonlinearities, as a function the number of hidden units . In particular, we are interested in how the length and curvature of an input manifold changes as it propagates to become in the hidden layer. We would like to upper bound the maximal achievable length and curvature over all possible choices of and .
c.1 Upper bound on Euclidean length
Here, we derive such an upper bound on the Euclidean length for a very general class of nonlinearities . We simply assume that (1) is monotonically nondecreasing (so that ) and (2) has with bounded dynamic range R, i.e. The Euclidean length in hidden space is
(29) 
where the inequality follows from the triangle inequality. Now suppose that for any , never changes sign across . Furthermore, assume that ranges from to . Then
(30) 
More generally, let denote the maximal number of times that any one neuron has a change in sign of the derivative across . Then applying the above argument to each segment of constant sign yields
(31) 
Now how many times can change sign? Since , where , and is monotonically increasing, the number of times changes sign equals the number of times the input changes sign. In turn, suppose is the maximal number of times any one dimensional projection of the derivative vector changes sign across . Then the number of times the sign of changes for any cannot exceed because is a linear projection of . Together this implies . We have thus proven:
(32) 
Appendix D Simulation details
All neural network simulations were implemented in Keras and Theano. For all simulations (except Figure 5C), we used inputs and hidden layers with a width of 1,000 and tanh activations. We found that our results were mostly insensitive to width, but using larger widths decreased the fluctuations in the averaged quantities. Simulation error bars are all standard deviations, with the variance computed across the different inputs,
. If not mentioned, the weights in the network are initialized in the chaotic regime with , .Computing requires the computation of the velocity and acceleration vectors, corresponding to the first and second derivatives of the neural network with respect to . As is always onedimensional, we can greatly speed up these computations by using forwardmode autodifferentiation, evaluating the Jacobian and Hessian in a feedforward manner. We implemented this using the Rop in Theano.
d.1 Details on Figure 4G: backpropagating curvature
To identify the curvature of the decision boundary, we first had to identify points that lied along the decision boundary. We randomly initialized data points and then optimized with respect to the input using Adam. This yields a set of inputs where we compute the Jacobian and Hessian of
Comments
There are no comments yet.