Reference caffe implementation of LSUV initialization
Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.READ FULL TEXT VIEW PDF
In recent years, a critical initialization scheme with orthogonal
The selection of initial parameter values for gradient-based optimizatio...
An extensive body of empirical research has revealed remarkable regulari...
While a lot of progress has been made in recent years, the dynamics of
A leading hypothesis for the surprising generalization of neural network...
The ability to achieve coordinated behavior -- engineered or emergent --...
Two distinct limits for deep learning as the net width h→∞ have been
Reference caffe implementation of LSUV initialization
Torch implementation of LSUV weight init from http://arxiv.org/abs/1511.06422
Torch Linear Unit with Orthogonal Weight Initialization
We begin by analyzing learning in a three layer network (input, hidden, and output) with linear activation functions (Fig1). We let
be the number of neurons in layer. The input-output map of the network is . We wish to train the network to learn a particular input-output map from a set of training examples . Training is accomplished via gradient descent on the squared error between the desired feature output, and the network’s feature output. This gradient descent procedure yields the batch learning rule
where is a small learning rate. As long as is sufficiently small, we can take a continuous time limit to obtain the dynamics,
where is an input correlation matrix, is an input-output correlation matrix, and Here measures time in units of iterations; as varies from 0 to 1, the network has seen examples corresponding to one iteration. Despite the linearity of the network’s input-output map, the gradient descent learning dynamics given in Eqn (2) constitutes a complex set of coupled nonlinear differential equations with up to cubic interactions in the weights.
Our fundamental goal is to understand the dynamics of learning in (2) as a function of the input statistics and input-output statistics . In general, the outcome of learning will reflect an interplay between input correlations, described by , and the input-output correlations described by . To begin, though, we further simplify the analysis by focusing on the case of orthogonal input representations where . This assumption will hold exactly for whitened input data, a widely used preprocessing step.
Because we have assumed orthogonal input representations (
), the input-output correlation matrix contains all of the information about the dataset used in learning, and it plays a pivotal role in the learning dynamics. We consider its singular value decomposition (SVD)
which will be central in our analysis. Here is an orthogonal matrix whose columns contain input-analyzing
singular vectorsthat reflect independent modes of variation in the input, is an orthogonal matrix whose columns contain output-analyzing singular vectors that reflect independent modes of variation in the output, and is an matrix whose only nonzero elements are on the diagonal; these elements are the singular values ordered so that .
Now, performing the change of variables on synaptic weight space, , , the dynamics in (2) simplify to
To gain intuition for these equations, note that while the matrix elements of and connected neurons in one layer to neurons in the next layer, we can think of the matrix element as connecting input mode to hidden neuron , and the matrix element as connecting hidden neuron to output mode . Let be the column of , and let be the row of . Intuitively, is a column vector of synaptic weights presynaptic to the hidden layer coming from input mode , and is a column vector of synaptic weights postsynaptic to the hidden layer going to output mode . In terms of these variables, or connectivity modes, the learning dynamics in (4) become
Note that for . These dynamics arise from gradient descent on the energy function
and display an interesting combination of cooperative and competitive interactions. Consider the first terms in each equation. In these terms, the connectivity modes from the two layers, and associated with the same input-output mode of strength , cooperate with each other to drive each other to larger magnitudes as well as point in similar directions in the space of hidden units; in this fashion these terms drive the product of connectivity modes to reflect the input-output mode strength . The second terms describe competition between the connectivity modes in the first () and second () layers associated with different input modes and . This yields a symmetric, pairwise repulsive force between all distinct pairs of first and second layer connectivity modes, driving the network to a decoupled regime in which the different connectivity modes become orthogonal.
The fixed point structure of gradient descent learning in linear networks was worked out in . In the language of the connectivity modes, a necessary condition for a fixed point is , while and are zero whenever . To satisfy these relations for undercomplete hidden layers (), and can be nonzero for at most values of . Since there are nonzero values of , there are families of fixed points. However, all of these fixed points are unstable, except for the one in which only the first strongest modes, i.e. and for are active. Thus remarkably, the dynamics in (5) has only saddle points and no non-global local minima . In terms of the original synaptic variables and , all globally stable fixed points satisfy
Hence when learning has converged, the network will represent the closest rank approximation to the true input-output correlation matrix. In this work, we are interested in understanding the dynamical weight trajectories and learning time scales that lead to this final fixed point.
It is difficult though to exactly solve (5) starting from arbitrary initial conditions because of the competitive interactions between different input-output modes. Therefore, to gain intuition for the general dynamics, we restrict our attention to a special class of initial conditions of the form and for , where , with all other connectivity modes and set to zero (see  for solutions to a partially overlapping but distinct set of initial conditions, further discussed in Supplementary Appendix A).
Here is a fixed collection of vectors that form an orthonormal basis for synaptic connections from an input or output mode onto the set of hidden units. Thus for this set of initial conditions, and point in the same direction for each alpha and differ only in their scalar magnitudes, and are orthogonal to all other connectivity modes. Such an initialization can be obtained by computing the SVD of and taking where are diagonal, and is an arbitrary orthogonal matrix; however, as we show
in subsequent experiments, the solutions we find are also excellent approximations to trajectories from small random initial conditions. It is straightforward to verify that starting from these initial conditions, and will remain parallel to for all future time. Furthermore, because the different active modes are orthogonal to each other, they do not compete, or even interact with each other (all dot products in the second terms of (5)-(6) are ).
Thus this class of conditions defines an invariant manifold in weight space where the modes evolve independently of each other.
If we let , , and , then the dynamics of the scalar projections obeys,
Thus our ability to decouple the connectivity modes yields a dramatically simplified two dimensional nonlinear system. These equations can by solved by noting that they arise from gradient descent on the error,
This implies that the product monotonically approaches the fixed point from its initial value. Moreover, satisfies a symmetry under the one parameter family of scaling transformations , . This symmetry implies, through Noether’s theorem, the existence of a conserved quantity, namely , which is a constant of motion. Thus the dynamics simply follows hyperbolas of constant in the plane until it approaches the hyperbolic manifold of fixed points, . The origin is also a fixed point, but is unstable. Fig. 2 shows a typical phase portrait for these dynamics.
As a measure of the timescale of learning, we are interested in how long it takes for to approach from any given initial condition. The case of unequal and is treated in the Supplementary Appendix A due to space constraints. Here we pursue an explicit solution with the assumption that , a reasonable limit when starting with small random initial conditions. We can then track the dynamics of , which from (8) obeys
This equation is separable and can be integrated to yield
Here is the time it takes for to travel from to . If we assume a small initial condition , and ask when is within of the fixed point , i.e. , then the learning timescale in the limit is (with a weak logarithmic dependence on the cutoff). This yields a key result: the timescale of learning of each input-output mode of the correlation matrix is inversely proportional to the correlation strength of the mode. Thus the stronger an input-output relationship, the quicker it is learned.
We can also find the entire time course of learning by inverting (11) to obtain
This time course describes the temporal evolution of the product of the magnitudes of all weights from an input mode (with correlation strength ) into the hidden layers, and from the hidden layers to the same output mode. If this product starts at a small value , then it displays a sigmoidal rise which asymptotes to as . This sigmoid can exhibit sharp transitions from a state of no learning to full learning. This analytical sigmoid learning curve is shown in Fig. 3 to yield a reasonable approximation to learning curves in linear networks that start from random initial conditions that are not on the orthogonal, decoupled invariant manifold–and that therefore exhibit competitive dynamics between connectivity modes–as well as in nonlinear networks solving the same task. We note that though the nonlinear networks behaved similarly to the linear case for this particular task, this is likely to be problem dependent.
The network analyzed in Section 1 is the minimal example of a multilayer net, with just a single layer of hidden units. How does gradient descent act in much deeper networks? We make an initial attempt in this direction based on initial conditions that yield particularly simple gradient descent dynamics.
In a linear neural network with layers and hence weight matrices indexed by , the gradient descent dynamics can be written as
where with the special case that , the identity, if .
To describe the initial conditions, we suppose that there are orthogonal matrices that diagonalize the starting weight matrices, that is, for all , with the special case that and . This requirement essentially demands that the output singular vectors of layer be the input singular vectors of the next layer , so that a change in mode strength at any layer propagates to the output without mixing into other modes. We note that this formulation does not restrict hidden layer size; each hidden layer can be of a different size, and may be undercomplete or overcomplete. Making the change of variables along with the assumption that leads to a set of decoupled connectivity modes that evolve independently of each other. In analogy to the simplification occurring in the three layer network from (2) to (8), each connectivity mode in the layered network can be described by scalars , whose dynamics obeys gradient descent on the energy function (the analog of (9)),
This dynamics also has a set of conserved quantities arising from the energetic symmetry w.r.t. the transformation , , and hence can be solved exactly. We focus on the invariant submanifold in which for all , and track the dynamics of , the overall strength of this mode, which obeys (i.e. the generalization of (10)),
This can be integrated for any positive integer , though the expression is complicated. Once the overall strength increases sufficiently, learning explodes rapidly.
Eqn. (15) lets us study the dynamics of learning as depth limits to infinity. In particular, as we have the dynamics
which can be integrated to obtain
Remarkably this implies that, for a fixed learning rate, the learning time as measured by the number of iterations required tends to zero as
goes to infinity. This result depends on the continuous time formulation, however. Any implementation will operate in discrete time and must choose a finite learning rate that yields stable dynamics. An estimate of the optimal learning rate can be derived from the maximum eigenvalue of the Hessian over the region of interest.
For linear networks with , this optimal learning rate decays with depth as for large (see Supplementary Appendix B). Incorporating this dependence of the learning rate on depth, the learning time as depth approaches infinity still surprisingly remains finite: with the optimal learning rate, the difference between learning times for an network and an network is for small (see Supplementary Appendix B.1). We emphasize that our analysis of learning speed is based on the number of iterations required, not the amount of computation–computing one iteration of a deep network will require more time than doing so in a shallow network.
To verify these predictions, we trained deep linear networks on the MNIST classification task with depths ranging from to . We used hidden layers of size 1000, and calculated the iteration at which training error fell below a fixed threshold corresponding to nearly complete learning. We optimized the learning rate separately for each depth by training each network with twenty rates logarithmically spaced between and and picking the fastest. See Supplementary Appendix C for full experimental details. Networks were initialized with decoupled initial conditions and starting initial mode strength . Fig. 4 shows the resulting learning times, which saturate, and the empirically optimal learning rates, which scale like as predicted.
Thus learning times in deep linear networks that start with decoupled initial conditions are only a finite amount slower than a shallow network regardless of depth. Moreover, the delay incurred by depth scales inversely with the size of the initial strength of the association. Hence finding a way to initialize the mode strengths to large values is crucial for fast deep learning.
The previous subsection revealed the existence of a decoupled submanifold in weight space in which connectivity modes evolve independently of each other during learning, and learning times can be independent of depth, even for arbitrarily deep networks, as long as the initial composite, end to end mode strength, denoted by above, of every connectivity mode is . What numerical weight initilization procedures can get us close to this weight manifold, so that we can exploit its rapid learning properties?
A breakthrough in training deep neural networks started with the discovery that greedy layer-wise unsupervised pretraining could substantially speed up and improve the generalization performance of standard gradient descent [17, 18]. Unsupervised pretraining has been shown to speed the optimization of deep networks, and also to act as a special regularizer towards solutions with better generalization performance [18, 12, 13, 14]. At the same time, recent results have obtained excellent performance starting from carefully-scaled random initializations, though interestingly, pretrained initializations still exhibit faster convergence [21, 13, 22, 3, 4, 1, 23] (see Supplementary Appendix D for discussion). Here we examine analytically how unsupervised pretraining achieves an optimization advantage, at least in deep linear networks, by finding the special class of orthogonalized, decoupled initial conditions in the previous section that allow for rapid supervised deep learning, for input-output tasks with a certain precise structure. Subsequently, we analyze the properties of random initilizations.
We consider the following pretraining and finetuning procedure: First, using autoencoders as the unsupervised pretraining module[18, 12], the network is trained to produce its input as its output (). Subsequently, the network is finetuned on the ultimate input-output task of interest (e.g., a classification task). In the following we consider the case for simplicity.
During the pretraining phase, the input-output correlation matrix is simply the input correlation matrix . Hence the SVD of is PCA on the input correlation matrix, since , where
are eigenvectors ofand
is a diagonal matrix of variances. Our analysis of the learning dynamics in Section1.1 does not directly apply, because here the input correlation matrix is not white. In Supplementary Appendix E we generalize our results to handle this case. During pretraining, the weights approach but since they do not reach the fixed point in finite time, they will end at where
is a diagonal matrix that is approaching the identity matrix during learning. Hence in general,and where
is any invertible matrix. When starting from small random weights, though, each weight matrix will end up with a roughly balanced contribution to the overall map. This corresponds to havingwhere is orthogonal. Hence at the end of the pretraining phase, the input-to-hidden mapping will be where is an arbitrary orthogonal matrix.
Now consider the fine-tuning phase. Here the weights are trained on the ultimate task of interest with input-output correlations . The matrix begins from the pretrained initial condition . For the fine-tuning task, a decoupled initial condition for is one that can be written as (see Section 2). Clearly, this will be possible only if
Then the initial condition obtained from pretraining will also be a decoupled initial condition for the finetuning phase, with initial mode strengths near one. Hence we can state the underlying condition required for successful greedy pretraining in deep linear networks: the right singular vectors of the ultimate input-ouput task of interest must be similar to the principal components of the input data
. This is a quantitatively precise instantiation of the intuitive idea that unsupervised pretraining can help in a subsequent supervised learning task if (and only if) the statistical structure of the input is consistent with the structure of input-output map to be learned. Moreover, this quantitative instantiation of this intuitive idea gives a simple empirical criterion that can be evaluated on any new dataset: given the input-output correlationand input correlation , compute the right singular vectors of and check that is approximately diagonal. If the condition in Eqn. (18) holds, autoencoder pretraining will have properly set up decoupled initial conditions for , with an appreciable initial association strength near . This argument also goes through straightforwardly for layer-wise pretraining of deeper networks. Fig. 5 shows that this consistency condition empirically holds on MNIST, and that a pretrained deep linear neural network learns faster than one started from small random initial conditions, even accounting for pretraining time (see Supplementary Appendix F for experimental details). We note that this analysis is unlikely to carry over completely to nonlinear networks. Some nonlinear networks are approximately linear (e.g., tanh nonlinearities) after initialization with small random initializations, and hence our solutions may describe these dynamics well early in learning. However as the network enters its nonlinear regime, our solutions should not be expected to remain accurate.
As an alternative to greedy layerwise pre-training,  proposed choosing appropriately scaled initial conditions on weights that would preserve the norm of typical error vectors as they were backpropagated through the deep network. In our context, the appropriate norm-preserving scaling for the initial condition of an by connectivity matrix between any two layers corresponds to choosing each weight i.i.d. from a zero mean Gaussian with standard deviation . With this choice, , where
denotes an average over distribution of the random matrix. Moreover, the distribution of concentrates about its mean for large . Thus with this scaling, in linear networks, both the forward propagation of activity, and backpropagation of gradients is typically norm-preserving. However, with this initialization, the learning time with depth on linear networks trained on MNIST grows with depth (Fig. 6A, left, blue). This growth is in distinct contradiction with the theoretical prediction, made above, of depth independent learning times starting from the decoupled submanifold of weights with composite mode strength . This suggests that the scaled random initialization scheme, despite its norm-preserving nature, does not find this submanifold in weight space. In contrast, learning times with greedy layerwise pre-training do not grow with depth (Fig. 6A, left, green curve hiding under red curve), consistent with the predictions of our theory (as a technical point: note that learning times under greedy pre-training initialization in Fig. 6A are faster than those obtained in Fig. 4 by explicitly choosing a point on the decoupled submanifold, because there the initial mode strength was chosen to be small () whereas greedy pre-training finds a composite mode strength closer to ).
Is there a simple random initialization scheme that does enjoy the rapid learning properties of greedy-layerwise pre-training? We empirically show (Fig. 6A, left, red curve) that if we choose the initial weights in each layer to be a random orthogonal matrix (satisifying ), instead of a scaled random Gaussian matrix, then this orthogonal random initialization condition yields depth independent learning times just like greedy layerwise pre-training (indeed the red and green curves are indistinguishable). Theoretically, why do random orthogonal initializations yield depth independent learning times, but not scaled random Gaussian initializations, despite their norm preserving nature?
The answer lies in the eigenvalue and singular value spectra of products of Gaussian versus orthgonal random matrices. While a single random orthogonal matrix has eigenvalue spectra lying exactly on the unit circle in the complex plane (Fig. 6A right), the eigenvalue spectra of random Gaussian matrices, whose elements have variance
, form a uniform distribution on a solid disk of radius 1 the complex plane (Fig.6C left). Moreover the singular values of an orthogonal matrix are all exactly , while the squared singular values of a scaled Gaussian random matrix have the well known Marcenko-Pasteur distribution, with a nontrivial spread even as , (Fig. 6B left shows the distribution of singular values themselves). Now consider a product of these matrices across all layers, representing the total end to end propagation of activity across a deep linear network:
Due to the random choice of weights in each layer, is itself a random matrix. On average, it preserves the norm of a typical vector no matter whether the matrices in each layer are Gaussian or orthogonal. However, the singular value spectra of differ markedly in the two cases. Under random orthogonal initilization in each layer, is itself an orthogonal matrix and therefore has all singular values equal to . However, under random Gaussian initialization in each layer, there is as of yet no complete theoretical characterization of the singular value distribution of . We have computed it numerically as a function of different depths in Fig. 6B, and we find that it develops a highly kurtotic nature as the depth increases. Most of the singular values become vanishingly small, while a long tail of very large singular values remain. Thus preserves the norm of a typical, randomly chosen vector , but in a highly anisotropic manner, by strongly amplifying the projection of onto a very small subset of singular vectors and attenuating in all other directions. Intuitively , as well as the linear operator that would be closely related to backpropagation of gradients to early layers, act as amplifying projection operators at large depth . In contrast, all of the eigenvalues of in the scaled Gaussian case concentrate closer to the origin as depth increases. This discrepancy between the behavior of the eigenvalues and singular values of , a phenomenon that could occur only if the eigenvectors of are highly non-orthogonal, reflects the highly non-normal nature of products of random Gaussian matrices (a non-normal matrix is by definition a matrix whose eigenvectors are non-orthogonal).
While the combination of amplification and projection in can preserve norm, it is clear that it is not a good way to backpropagate errors; the projection of error vectors onto a high dimensional subspace corresponding to small singular values would be strongly attenuated, yielding vanishingly small gradient signals corresponding to these directions in the early layers. This effect, which is not present for random orthogonal initializations or greedy pretraining, would naturally explain the long learning times starting from scaled random Gaussian initial conditions relative to the other initilizations in Fig. 6A left. For both linear and nonlinear networks, a more likely appropriate condition on weights for generating fast learning times would be that of dynamical isometry. By this we mean that the product of Jacobians associated with error signal backpropagation should act as a near isometry, up to some overall global scaling, on a subspace of as high a dimension as possible. This is equivalent to having as many singular values of the product of Jacobians as possible within a small range around an constant, and is closely related to the notion of restricted isometry in compressed sensing and random projections. Preserving norms is a necessary but not sufficient condition for achieving dynamical isometry at large depths, as demonstrated in Fig. 6B, and we have shown that for linear networks, orthogonal initializations achieve exact dynamical isometry with all singular values at , while greedy pre-training achieves it approximately.
We note that the discrepancy in learning times between the scaled Gaussian initialization and the orthogonal or pre-training initializations is modest for the depths of around used in large scale applications, but is magnified at larger depths (Fig. 6A left). This may explain the modest improvement in learning times with greedy pre-training versus random scaled Gaussian initializations observed in applications (see discussion in Supplementary Appendix D). We predict that this modest improvement will be magnified at higher depths, even in nonlinear networks. Finally, we note that in recurrent networks, which can be thought of as infinitely deep feed-forward networks with tied weights, a very promising approach is a modification to the training objective that partially promotes dynamical isometry for the set of gradients currently being back-propagated .
We have shown above that deep random orthogonal linear networks achieve perfect dynamical isometry. Here we show that nonlinear versions of these networks can also achieve good dynamical isometry properties. Consider the nonlinear feedforward dynamics
where denotes the activity of neuron in layer , is a random orthogonal connectivity matrix from layer to , is a scalar gain factor, and is any nonlinearity that saturates as . We show in Supplementary appendix G that there exists a critical value of the gain such that if , activity will decay away to zero as it propagates through the layers, while if
, the strong linear positive gain will combat the damping due to the saturating nonlinearity, and activity will propagate indefinitely without decay, no matter how deep the network is. When the nonlinearity is odd (), so that the mean activity in each layer is approximately , these dynamical properties can be quantitatively captured by the neural population variance in layer ,
. Thus these nonlinear feedforward networks exhibit a phase-transition at the critical gain; above the critical gain, infinitely deep networks exhibit chaotic percolating activity propagation, so we call the critical gainthe edge of chaos, in analogy with terminology for recurrent networks.
Now we are interested in how errors at the final layer backpropagate back to earlier layers, and whether or not these gradients explode or decay with depth. To quantify this, for simplicity we consider the end to end Jacobian
which captures how input perturbations propagate to the output. If the singular value distribution of this Jacobian is well-behaved, with few extremely large or small singular values, then the backpropagation of gradients will also be well-behaved, and exhibit little explosion or decay. The Jacobian is evaluated at a particular point in the space of output layer activations, and this point is in turn obtained by iterating (20) starting from an initial input layer activation vector . Thus the singular value distribution of the Jacobian will depend not only on the gain , but also on the initial condition . By rotational symmetry, we expect this distribution to depend on , only through its population variance . Thus for large , the singular value distribution of the end-to-end Jacobian in (22) (the analog of in (19) in the linear case), depends on only two parameters: gain and input population variance .
We have numerically computed this singular value distribution as a function of these two parameters in Fig. 7, for a single random orthogonal nonlinear network with and . These results are typical; replotting the results for different random networks and different initial conditions (with the same input variance) yield very similar results. We see that below the edge of chaos, when , the linear dampening over many layers yields extremely small singular values. Above the edge of chaos, when , the combination of positive linear amplification, and saturating nonlinear dampening yields an anisotropic distribution of singular values. At the edge of chaos, , an fraction of the singular value distribution is concentrated in a range that remains despite layers of propagation, reflecting appoximate dynamical isometry. Moreover, this nice property at remains valid even as the input variance is increased far beyond , where the function enters its nonlinear regime. Thus the right column of Fig. 7 at near indicates that the useful dynamical isometry properties of random orthogonal linear networks described above survives in nonlinear networks, even when activity patterns enter deeply into the nonlinear regime in the input layers. Interestingly, the singular value spectrum is more robust to perturbations that increase from relative to those that decrease . Indeed, the anisotropy in the singular value distribution at is relatively mild compared to that of random linear networks with scaled Gaussian initial conditions (compare the bottom row of Fig. 7 with the right column of panel B in Fig. 6). Thus overall, these numerical results suggest that being just beyond the edge of orthogonal chaos may be a good regime for learning in deep nonlinear networks.
In summary, despite the simplicity of their input-output map, the dynamics of learning in deep linear networks reveals a surprising amount of rich mathematical structure, including nonlinear hyperbolic dynamics, plateaus and sudden performance transitions, a proliferation of saddle points, symmetries and conserved quantities, invariant submanifolds of independently evolving connectivity modes subserving rapid learning, and most importantly, a sensitive but computable dependence of learning time scales on input statistics, initial weight conditions, and network depth. With the right initial conditions, deep linear networks can be only a finite amount slower than shallow networks, and unsupervised pretraining can find these initial conditions for tasks with the right structure. Moreover, we introduce a mathematical condition for faithful backpropagation of error signals, namely dynamical isometry, and show, surprisingly that random scaled Gaussian initializations cannot achieve this condition despite their norm-preserving nature, while greedy pre-training and random orthogonal initialization can, thereby achieving depth independent learning times. Finally, we show that the property of dynamical isometry survives to good approximation even in extremely deep nonlinear random orthogonal networks operating just beyond the edge of chaos. At the cost of expressivity, deep linear networks gain theoretical tractability and may prove fertile for addressing other phenomena in deep learning, such as the impact of carefully-scaled initializations [13, 23], momentum , dropout regularization , and sparsity constraints . While a full analytical treatment of learning in deep nonlinear networks currently remains open, one cannot reasonably hope to move towards such a theory without first completely understanding the linear case. In this sense, our work fulfills an essential pre-requisite for progress towards a general, quantitative theory of deep learning.
Building high-level features using large scale unsupervised learning.In
29th International Conference on Machine Learning, 2012.
Acoustic Modeling Using Deep Belief Networks.IEEE Transactions on Audio, Speech, and Language Processing, 20(1):14–22, January 2012.
12th International Conference on Artificial Intelligence and Statistics, volume 5, 2009.
Neural networks and principal component analysis: Learning from examples without local minima.Neural Networks, 2(1):53–58, January 1989.
On the difficulty of training recurrent neural networks.Technical report, Universite de Montreal, 2012.
In Section 1.3 of the main text we treat the dynamics of learning in three layer networks where mode strengths in each layer are equal, i.e, , a reasonable limit when starting with small random initial conditions. More generally, though, we are interested in how long it takes for to approach from any given initial condition. To access this, given the hyperbolic nature of the dynamics, it is useful to make the hyperbolic change of coordinates,
Thus parametrizes the dynamically invariant manifolds . For any and , this coordinate system covers the region , which is the basin of attraction of the upper right component of the hyperbola . A symmetric situation exists for , which is attracted to the lower left component of . We use as a coordinate to follow the dynamics of the product , and using the relations and , we obtain
This differential equation is separable in and and can be integrated to yield
Here is the amount of time it takes to travel from to along the hyperbola . The fixed point lies at , but the dynamics cannot reach the fixed point in finite time. Therefore we introduce a cutoff to mark the endpoint of learning, so that obeys (i.e. is close to by a factor ). We can then average over the initial conditions and to obtain the expected learning time of an input-output relation that has a correlation strength . Rather than doing this, it is easier to obtain a rough estimate of the timescale of learning under the assumption that the initial weights are small, so that and are close to . In this case (with a weak logarithmic dependence on the cutoff (i.e. ). This modestly generalizes the result given in the main text: the timescale of learning of each input-output mode of the correlation matrix is inversely proportional to the correlation strength of the mode even when and differ slightly, i.e., small. This is not an unreasonable limit for random initial conditions because where and are random vectors of synaptic weights into and out of the hidden units. Thus we expect the lengths of the two random vectors to be approximately equal and therefore will be small relative to the length of each vector.
These solutions are distinctly different from solutions for learning dynamics in three layer networks found in . In our notation, in , it was shown that if the initial vectors and satisfy the matrix identity then the dynamics of learning becomes equivalent to a matrix Riccatti equation. However, the hyperbolic dynamics derived here arises from a set of initial conditions that do not satisfy the restrictions of  and therefore do not arise through a solution to a matrix Ricatti equation. Moreover, in going beyond a statement of the matrix Riccatti solution, our analysis provides intuition about the time-scales over which the learning dynamics unfolds, and crucially, our methods extend beyond the three layer case to the arbitrary layer case, not studied in .
In Section 2 we state results on the optimal learning rate as a function of depth in a deep linear network, which we derive here. Starting from the decoupled initial conditions given in the main text, the dynamics arise from gradient descent on
Hence for each we have
The elements of the Hessian are thus
for , and
We now assume that we start on the symmetric manifold, such that for all . Thus we have
The Hessian is
One eigenvector is with eigenvalue , or
Now consider the second order update (Newton-Raphson) (here we use to denote a vector of ones)
Note that the basin of attraction does not include small initial conditions, because for small the Hessian is not positive definite.
To determine the optimal learning rate for first order gradient descent, we compute the maximum of over the range of mode strengths that can be visited during learning, i.e., . This occurs at the optimum, . Hence substituting this into (37) we have
The optimal learning rate is proportional to , and hence scales as
for large .
How does the optimal learning rate impact learning speeds? We compare the three layer learning time to the infinite depth limit learning time, with learning rate set inversely proportional to Eqn. (41) with proportionality constant .
This yields a three layer learning time of
and an infinite layer learning time of
Hence the difference is
where the final approximation is for , and small. Thus very deep networks incur only a finite delay relative to shallow networks.
We trained deep linear networks on the MNIST dataset with fifteen different depths . Given a 784-dimensional input example, the network tried to predict a 10-dimensional output vector containing a 1 in the index for the correct class, and zeros elsewhere. The network was trained using batch gradient descent via Eqn. (13) on the 50,000 sample MNIST training dataset. We note that Eqn. (13) makes use of the linearity of the network to speed training and reduce memory requirements. Instead of forward propagating all 50,000 training examples, we precompute and forward propagate only it. This enables experiments on very deep networks that otherwise would be computationally infeasible. Experiments were accelerated on GPU hardware using the GPUmat package. We used overcomplete hidden layers of size 1000. Here the overcompleteness is simply to demonstrate the applicability of the theory to this case; overcompleteness does not improve the representational power of the network. Networks were initialized with decoupled initial conditions and starting initial mode strength , as described in the text. The random orthogonal matrices
were selected by generating random Gaussian matrices and computing a QR decomposition to obtain an orthogonal matrix. Learning times were calculated as the iteration at which training error fell below a fixed threshold ofcorresponding to nearly complete learning. Note that this level of performance is grossly inferior to what can be obtained using nonlinear networks, which reflects the limited capacity of a linear network. We optimized the learning rate separately for each depth by training each network with twenty rates logarithmically spaced between and and picking the one that yielded the minimum learning time according to our threshold criterion. The range and was selected via preliminary experiments to ensure that the optimal learning rate always lay in the interior of the range for all depths.
Recently high performance has been demonstrated in deep networks trained from random initial conditions [21, 13, 22, 3, 4, 1, 23], suggesting that deep networks may not be as hard to train as previously thought. These results show that pretraining is not necessary to obtain state-of-the-art performance, and to achieve this they make use of a variety of techniques including carefully-scaled random initializations, more sophisticated second order or momentum-based optimization methods, and specialized convolutional architectures. It is therefore important to evaluate whether unsupervised pretraining is still useful, even if it is no longer necessary, for training deep networks. In particular, does pretraining still confer an optimization advantage and generalization advantage when used in conjunction with these new techniques? Here we review results from a variety of papers, which collectively show that unsupervised pretraining still confers an optimization advantage and a generalization advantage.
The optimization advatage of pretraining refers to faster convergence to the local optimum (i.e., faster learning speeds) when starting from pretrained initializations as compared to random initializations. Faster learning speeds starting from pretrained initial conditions have been consistently found with Hessian free optimization [21, 22]. This finding holds for two carefully-chosen random initialization schemes, the sparse connectivity scheme of , and the dense scaled scheme of  (as used by 
). Hence pretraining still confers a convergence speed advantage with second order methods. Pretrained initial conditions also result in faster convergence than carefully-chosen random initializations when optimizing with stochastic gradient descent[22, 13]. In light of this, it appears that pretrained initial conditions confer an optimization advantage beyond what can be obtained currently with carefully-scaled random initializations, regardless of optimization technique. If run to convergence, second order methods and well-chosen scalings can erase the discrepancy between the final objective value obtained on the training set for pretrained relative to random initializations [21, 22]. The optimization advantage is thus purely one of convergence speed, not of finding a better local minimum. This coincides with the situation in linear networks, where all methods will eventually attain the same global minimum, but the rate of convergence can vary. Our analysis shows why this optimization advantage due to pretraining persists over well-chosen random initializations.
Finally, we note that Sutskever et al. show that careful random initialization paired with carefully-tuned momentum can achieve excellent performance , but these experiments did not try pretrained initial conditions. Krizhevsky et al. used convolutional architectures and did not attempt pretraining . Thus the possible utility of pretraining in combination with momentum, and in combination with convolutional architectures, dropout, and large supervised datasets, remains unclear.
Pretraining can also act as a special regularizer, improving generalization error in certain instances. This generalization advantage appears to persist with new second order methods [21, 22], and in comparison to gradient descent with careful random initializations [13, 22, 25, 4]. An analysis of this effect in deep linear networks is out of the scope of this work, though promising tools have been developed for the three layer linear case .
In the main text we focused on orthogonal input correlations () for simplicity, and to draw out the main intuitions. However our analysis can be extended to input correlations with a very particular structure. Recall that we decompose the input output correlations using the SVD as . We can generalize our solutions to allow input correlations of the form . Intuitively, this condition requires the axes of variation in the input to coincide with the axes of variation in the input-output task, though the variances may differ. If we take then we recover the whitened case , and if we take , then we can treat the autoencoding case. The final fixed points of the weights are given by the best rank approximation to . Making the same change of variables as in Eqn. (4) we now obtain
which, again, is decoupled if and begin diagonal. Based on this it is straightforward to generalize our results for the learning dynamics.
We trained networks of depth 5 on the MNIST classification task with 200 hidden units per layer, starting either from small random initial conditions with each weight drawn independently from a Gaussian distribution with standard deviation 0.01, or from greedy layerwise pretrained initial conditions. For the pretrained network, each layer was trained to reconstruct the output of the next lower layer. In the finetuning stage, the network tried to predict a 10-dimensional output vector containing a 1 in the index for the correct class, and zeros elsewhere. The network was trained using batch gradient descent via Eqn. (13) on the 50,000 sample MNIST training dataset. Since the network is linear, pretraining initializes the network with principal components of the input data, and, to the extent that the consistency condition of Eqn. (18) holds, decouples these modes throughout the deep network, as described in the main text.
due to the dynamics in (20) and the orthogonality of . Now we know that by definition, the layer population has normalized variance . If we further assume that the distribution of activity across neurons in layer is well approximated by a Gaussian distribution, we can replace the sum over neurons with an integral over a zero mean unit variance Gaussian variable :
where is the standard Gaussian measure.
This map from input to output variance is numerically computed for and in Fig. 8, left (other values of yield a simple multiplicative scaling of this map). This recursion relation has a stable fixed point obtained by solving the nonlinear fixed point equation
Graphically, solving this equation corresponds to scaling the curve in Fig. 8 lef