Gradient-based optimization is a foundational part of Deep Learning (DL) in that it makes it possible to train neural networks in an efficient manner. Training then requires calculating the gradients of the loss function with respect to the weights at each layer of the network. How these gradients are calculated, therefore, is very important. Today, this is done using backpropagation
, which is a version of the chain rule from calculus. Backpropagation is highly efficient, but it is typically only used to calculate first-order derivatives; higher-order derivatives are generally considered too expensive to calculate. Essentially, backpropagation exploits the directed structure of the network to make the calculation of these gradients easier. This begs the question, of course, of whether there is more structure that could be exploited to improve algorithm performance or reduce computational cost.
Such structure does exist for feedforward networks. As we will show in Section 2, these derivatives have, on a per-sample basis, an outer product structure to them. We will demonstrate this for first-order derivatives and show how it extends to higher-order derivatives. Because of this outer product structure, the widely held assumption that the Hessian is too large to use may not actually hold.
We are not the first to note or be interested in these derivative properties. In describing the backpropagation algorithm, Rumelhart et al.  use index notation (see Section 2) in writing out their equations. The index notation clearly shows the outer product structure for first-order derivatives, but the authors do not comment on this. Mizutani and Dreyfus  and Naumov  note this structure, while Xie et al.  and Zhang et al.  explicitly take advantage of it to speed up their gradient computations in what they refer to as Sufficient Factor Broadcasting.
This structure extends to feedforward network Hessians. Surprisingly, though, this is not noted in most of the papers that look at calculating the Hessian or approximations thereto (e.g., [7, 8]). Naumov 
does note this, but the derivations presented there are incorrect by being incomplete. The Hessian calculations there only consider ‘block-diagonal’ elements of the Hessian – i.e., Hessian components for weights in the same layer – and not cross-derivatives between weights from different layers. Those terms add some complexity to the Hessian structure, as will be apparent in our calculations. Unfortunately, this omission means that Naumov’s eigenvalue calculations also do not hold, though they could be considered as a kind of block-diagonal approximation. Naumov also focuses on a sum-of-squared error loss function. We will provide a more general and complete analysis here.
2 The Outer Product Structure of Deep Network Derivatives
Second-order derivatives are not widely used in DL, and where they are
used, they are typically estimated. These derivatives can be calculated analytically, but this is not often done because of scalability constraints. If we write out the first and second derivatives, though, we can see that they have an outer product structure to them. When a matrix has low rank (or less than full rank), it means that the information contained in that matrix (or the operations performed by that matrix) can be fully represented without needing to know every entry of that matrix. An outer product structure is a special case of this, where anx matrix
can be fully represented by two vectors. We can then calculate, store, and use second-order derivatives exactly in an efficient manner by only dealing with the components needed to represent the full Hessians rather than dealing with those Hessians themselves. Doing this involves some extra calculations, but the storage costs are comparable to those of gradient calculations.
In this section, we will illustrate this structure for a feedforward network, of arbitrary depth and layer widths, consisting of ReLUs in the hidden layers. A feedforward network with arbitrary activation functions has somewhat more complicated derivative formulae, but those derivatives still exhibit an outer product structure. That structure also does not depend on the form of the objective function or whether a softmax is used, and it is present for recurrent layers as well. The complete derivations for these cases are given in AppendixA.
In our calculations, we make extensive use of index notation with the summation convention . In index notation, a scalar has no indices (), a vector has one index ( as or ), a matrix has two ( as , , or ), and so on. The summation convention holds that repeated indices in a given expression are summed over unless otherwise indicated. For example, . The pair of indices being summed over will often consist of a superscript and a subscript; this is a bookkeeping technique used in differential geometry, but in this context, the subscripting or superscripting of indices will not indicate covariance or contravariance. We have also adapted index notation slightly to suit the structure of deep networks better: indices placed in brackets (e.g. the in
) are not summed over, even if repeated, unless explicitly indicated by a summation sign. A tensor convention that wewill use, however, is the Kronecker delta: , , or
. The Kronecker delta is the identity matrix represented in index notation: it is 1 forand 0 otherwise. The summation convention can sometimes be employed to simplify expressions containing Kronecker deltas. For example, and .
2.1 Feedforward Network Derivative Calculations
Our example here is a generic feedforward network with ReLU activation functions inhidden layers, a softmax at the output layer, and categorical cross-entropy as the objective function. The softmax and categorical cross-entropy are not important for our calculations, but they are commonly used and we include them here for concreteness’ sake; we consider ReLUs because they are widely used and have some convenient properties. Table 1 provides a nomenclature for our deep network definition, and Equations 1-7 define the network.
|Number of hidden layers|
|Vector of inputs for a single sample|
|Vector output of layer|
|Matrix of weights for layer|
|Matrix of output layer weights|
|Vector of intermediate variables for the output layer|
|Vector of outputs for a single sample|
|Vector of labels for a single sample|
|Scalar objective function value for a single sample|
|Scalar objective function|
The relevant first derivatives for this deep network are
where there is no summation over in Equation 9. We now define several intermediate quantities to simplify the derivation process:
In calculating these expressions, we have deliberately left
unevaluated. This keeps the expression relatively simple, and programs like TensorFlow can easily calculate this for us. Leaving it in this form also preserves the generality of the expression – there is no outer product structure contained in , and the outer product structure of the network as a whole is therefore shown to be independent of the objective function and whether or not a softmax is used. In fact, as long as Equation 4 holds, any sufficiently smooth function of may be used in place of a softmax without disrupting the structure. The one quantity that needs to be stored here is for ; it will be needed in the second derivative calculations. Note, however, that this is roughly the same size as the gradient itself.
We can now see the outer product structure: is the outer product (or tensor product) of the vectors and , and is the outer product of (which ends up being a rank-1 tensor) and . The index notation makes the outer product structure clear. Vectorizing the weights, as is often done  makes this structure much more difficult to see.
We then start our second derivative calculations by considering some intermediate quantities:
The second derivative of the ReLU vanishes, which simplifies the second derivative calculations significantly. Technically, the second derivative is undefined at the origin, but the singularity is removable, and thus we can define the second derivative to be 0 at the origin. We can then calculate the second-order objective function derivatives:
Calculating all of the second derivatives requires the repeated use of . Evaluating that Hessian is straightforward given knowledge of the activation functions and objective used in the network, and storing it is also likely not an issue as long as the number of categories is small relative to the number of weights. For example, consider a small network with 10 categories and 1000 weights. In such a case, would only contain 100 entries – the gradient would be 10 times larger. We also have to store values in order to calculate the derivatives. In , we also end up needing for . In a network with hidden layers, we would then have of the matrices to store. For , this would be 45, for , this would be 190, and so on.
This aspect of the calculations does not seem to scale well, but in practice, it is relatively simple to work around. It is still necessary to store , , but , , only actually shows up in one place, and thus it is possible to calculate each matrix, use it, and discard it without needing to store it for future calculations. Moreover, each term has approximately the same number of entries as each matrix, which means that even if it were necessary to store each , this would only be times as much information as a full gradient. In other words, the amount of information involved would be , where is the total number of weights, rather than , as would be expected in calculating a Hessian.
The key thing to note about these second derivatives is that they retain an outer product structure – they are now tensor products (or the sums of tensor products) of matrices and vectors. For example,
It is important to note that this outer prodcut structure only exists for each sample. An average of a small number of samples (relative to the number of weights) will produce a low-rank structure; the rank will increase, generally, with the number of samples being considered. As such, manipulating the entire Hessian may not be as computationally feasible; this will depend on how large the mini-batch size is relative to the number of weights. When it comes to using the Hessian, the computational savings provided by this approach will be most salient when calculating scalar or vector quantities on a sample-by-sample basis and then taking a weighted sum of the results.
In principle, we could calculate third derivatives, but the formulae would likely become unwieldy, and they may require memory usage significantly greater than that involved in storing gradient information. If a use arose for third derivatives, calculating them would be conceivable, though. Thus far, we have not included bias terms. Including bias terms as trainable weights would increase the overall size of the gradient (by adding additional variables), but it would not change the overall structure. Using the calculations provided in Appendix A, it would not be difficult to produce the appropriate derivations.
This outer product structure also provides a way to calculate matrix-vector products with the Hessian. Pearlmutter , for example, shows how these products can be calculated efficiently but does not distinguish between weights from different layers and thus does not observe or exploit the outer product structure.
2.2 Convolutional Neural Networks
Recurrent neural network derivatives exhibit an outer product structure similar to that of feedforward networks, but convolutional network derivatives do not; see Appendix A for more details. Convolutional layers do not destroy the outer product structure of subsequent feedforward or recurrent layers’ derivatives, but they themselves do not have an outer product structure. In fact, a convolutional layer is a different way of compressing weight information, as noted by Arora et al. . A convolutional layer is equivalent to a feedforward layer where the layer weights form a doubly block circulant matrix . The information contained in that (very large) matrix can be completely expressed by a much smaller matrix – the convolution kernel, in fact. This kernel enforces structure on weights and their derivatives, but the outer product structure on feedforward and recurrent layers only applies to the derivatives; convolutional layers tend to have fewer weights as a result. Both types of structure, however, make it possible to represent derivative information in a compact fashion. Since convolutional layers tend to have far fewer weights than feedforward layers, calculating higher-order derivatives with respect to these layers’ weights should also be feasible.
3.1 Training Methods and Second-Order Information
Deep learning (DL) provides a set of problems that can be tackled with gradient-based optimization methods, but it has a number of unique features and challenges. Firstly, DL problems can be extremely large, and storing the Hessian, or even a full matrix approximation thereto, has not been considered feasible for such problems. Secondly, DL problems are often highly nonconvex. In practice, neural networks tend to have many saddle points and relatively few local minima that provide significantly suboptimal performance 
. Thirdly, training deep networks via mini-batch sampling results in a stochastic optimization problem. Even if the necessary expectations can be calculated (in an unbiased way), the variance associated with the batch sample calculations produces noise, and this noise can make it more difficult to perform the optimization. Finally, deep networks consist of the composition of analytic functions whose forms are known. As such, we can calculate derivative information analytically via back-propagation.
These special characteristics of DL have motivated researchers to develop training methods specifically designed to overcome the challenges with training a deep neural network. One such approach is layer-wise pretraining 
, where pretraining a neural network layer-by-layer encourages the weights to initialize close to a optimal minimum. Transfer learning works by a similar mechanism, relying on knowledge gained through previous tasks to encourage nice training on a novel task. Outside of pretraining, a class of optimization algorithms have been specifically designed for training deep networks. The Adam, Adagrad, and Adamax set of algorithms provide examples of using history-dependent learning rate adjustment 17]. One could possibly argue that these methods implicitly leverage second order information via their history dependence, but the stochastic nature of mini-batching prevents this from becoming explicit.
Some researchers have sought to use second-order information explicitly to improve the training process. Most of these methods have used an approximation to the Hessian. For example, the L-BFGS method can estimate the Hessian (or its inverse) in a way that is feasible with respect to memory requirements; however, the noise associated with the sampling techniques can either overwhelm the estimation or require special modifications to the L-BFGS method to prevent it from diverging . There have been two primary ways to deal with this: subsampling [18, 19] and mini-batch reuse [20, 21]. Subsampling involves updating the Hessian approximation every iterations rather than every iteration, as would normally be done. Mini-batch reuse consists of using the same mini-batch on subsequent iterations when calculating the difference in gradients between those two iterations. These approximate second-order methods typically have a computational cost that is higher than, though on the same order of, gradient descent, and that cost can be further reduced by using a smaller mini-batch for the Hessian approximation calculations than for the gradient calculation . There is also the question of bias: it is possible to produce unbiased low-rank Hessian approximations , but if the Hessian is indefinite, then quasi-Newton methods will prefer biased estimates – ones that are positive definite. Other work has foregone these kinds of Hessian approximations in favor of using finite differences .
This outer product structure could be useful for helping with these kinds of calculations. Storing the outer product components would make it possible to represent and manipulate the Hessian exactly. For example, Dauphin et al.  describe a way to use Newton’s method and avoid saddle points when doing so. However, they are forced to use an approximation: “The exact implementation of this algorithm is intractable in a high dimensional problem, because it requires the exact computation of the Hessian. Instead we use an approach similar to Krylov subspace descent[.]” [13, p. 6]. The outer product structure could obviate the need for the approximation. With the exact Hessian, it would also be possible to solve a problem like
to find the direction of maximum negative curvature and thereby escape from the saddle more quickly. For the sample network given in Section 2.1, we can calculate this as follows:
The expression in Equation 40 looks complicated, but evaluating it is actually relatively straightforward – it consists of a series of matrix multiplications where the matrices in question have a number of entries on the same order as the number of weights in a given layer. For example, let us consider the last term in Equation 40. We can show that this Hessian component of this term breaks down into three separate components, each with no more than two indices:
Multiplying this with the appropriate terms is then simpler than would otherwise be expected:
A similar decomposition can be performed for the other terms in Equation 40. Therefore, to perform this optimization, we need only calculate those components once. We can then re-evaluate repeatedly for different values of . We would not see this kind of simplification if the Hessian expression had some irreducible term – the lack of such terms is precisely what we mean by saying that the Hessian has an outer product structure.
3.2 Regularization and Robustness
Higher-order derivatives could also be useful as a regularization tool. Neural networks are known to be vulnerable to adversarial examples 
. The problem is that classifiers can have very large gradients, and as such, small pertubations in the inputs can cause large changes in the classifier. A common defensive technique is to make it difficult for an attacker to evaluate those gradients; this is known as obfuscated gradients. Athalye et al. showed that this approach may not be very robust, though. Some researchers have tried to regularize network training with respect to those gradients directly [26, 27, 28]. The gradient of this regularization then involves higher-order derivative terms, as Ross and Doshi-Velez note , which may make training infeasible. Gu and Rigazio , for example, are forced to use approximations. This may not be necessary, though. Consider a regularization with respect to the gradients of the classifier output with respect to the inputs:
Papernot et al.  claim that this kind of approach is only good when the perturbations are very small and that the calculations are too expensive to be feasible. We do want the network to be locally constant near our sample points, though , and that cost may not be so prohibitive, now. We can also extend this to a regularization on the curvature:
Both of these regularizations scale reasonably well as long as the number of classes is much smaller than the number of weights (which will generally be true). For example, if there are 1000 weights and 10 categories, will have entries – the same number of entries as the gradients used for training. The first-order regularization will require fewer multiplication operations and therefore be less computationally expensive, but there is not a significant difference in the amount of information that needs to be stored in order to calculate the gradients needed for training the weights. Moreover, even within a given sample’s calculation, the multiplications involved are highly parallelizable.
We can use this for training purposes, but we can also use this to provide estimated bounds on the size of perturbations needed for misclassification. With the first-order approximation, we have
For the example network in Section 2.1, we would have
A second-order approximation could function similarly, but there might be other ways to use that second-order information to provide an appropriate bound as well. The point is that, as Athalye et al. note, we need specific, guaranteed robustness claims . Explicitly evaluating these gradients could be a way to do this.
3.3 Compression and Generalizability
We may be able to use this outer product structure to create networks that are more compressible and generalizable. One common way to try and improve generalization is to regularize with respect to the weights (usually via some norm on the weights) . This is essentially a computable proxy for what we really want: what we really want is to avoid overfitting and ensure that small input changes produce small output changes. Moody  provides an approximate formula for relating training error to testing error. This formula includes a measure of the effective number of parameters in the network, and Moody measures this using higher-order derivative information. Reducing the magnitude of those derivatives essentially means reducing the effective number of parameters and bringing the testing and training errors closer together. Directly regularizing this may now be possible.
Network pruning  and compression , on the other hand, are both ways of reducing the actual number of parameters in a network to make the network more efficient (in some sense). Compressibility essentially entails taking a low-rank approximation to the layer-wise weight matrices. If we look at our gradient calculations, we can see that this does some interesting things. Consider the feedforward network described in Section 2.1. If we specify and (i.e., a rank-1, outer product weight structure), then
where indicates the presence of a scalar quantity.
Arora et al.  essentially say that these weight matrices become approximately low-rank near the end of training. What if we were to use a low-rank or outer product weight matrix structure from the beginning of training? The derivatives now become dense and much smaller (in terms of the number of entries), as can be seen in Equations 57-60. This could be a cost-efficient pre-training strategy.
Finally, this derivative structure may also suggest something about the loss function near critical points. If there are more weights than samples (as is often the case), then the gradient and Hessian likely have non-trivial nullspaces at critical points (saddles as well as optima). This would suggest that each critical point exists on a (locally) connected submanifold of critical points with (locally) constant loss function values. Neural network training never actually converges exactly to a critical point , but this still might tell us something about what to expect as we approach these critical points.
In this paper, we showed how neural network derivatives display outer product structure for a wide range of network architectures. This structure does not seem to have been widely appreciated or used – especially with regards to higher-order derivative calculations. However, taking advantage of this structure could be helpful for improving training methods, increasing network robustness and generalizability, and compressing layer weights.
This research was performed at the Pacific Northwest National Laboratory, a multi-program national laboratory operated by Battelle for the U.S. Department of Energy
-  D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, p. 533, 1986.
-  I. Goodfellow, Y. Bengio, and A. Courville, Deep learning, vol. 1. MIT press Cambridge, 2016.
-  E. Mizutani and S. E. Dreyfus, “Second-order stagewise backpropagation for hessian-matrix analyses and investigation of negative curvature,” Neural Networks, vol. 21, no. 2-3, pp. 193–203, 2008.
-  M. Naumov, “Feedforward and recurrent neural networks backward propagation and hessian in matrix form,” arXiv preprint arXiv:1709.06080, 2017.
-  P. Xie, J. K. Kim, Y. Zhou, Q. Ho, A. Kumar, Y. Yu, and E. Xing, “Distributed machine learning via sufficient factor broadcasting,” arXiv preprint arXiv:1511.08486, 2015.
-  H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, and E. P. Xing, “Poseidon: An efficient communication architecture for distributed deep learning on gpu clusters,” arXiv preprint arXiv:1706.03292, 2017.
C. Bishop, “Exact calculation of the hessian matrix for the multilayer perceptron,”Neural Computation, vol. 4, no. 4, pp. 494–501, 1992.
-  W. L. Buntine and A. S. Weigend, “Computing second derivatives in feed-forward networks: A review,” IEEE transactions on Neural Networks, vol. 5, no. 3, pp. 480–488, 1994.
-  V. G. Ivancevic and T. T. Ivancevic, Applied differential geometry: a modern introduction. World Scientific, 2007.
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015.Software available from tensorflow.org.
-  B. A. Pearlmutter, “Fast exact multiplication by the hessian,” Neural computation, vol. 6, no. 1, pp. 147–160, 1994.
-  S. Arora, R. Ge, B. Neyshabur, and Y. Zhang, “Stronger generalization bounds for deep nets via a compression approach,” arXiv preprint arXiv:1802.05296, 2018.
-  Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization,” in Advances in neural information processing systems, pp. 2933–2941, 2014.
-  Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in Advances in neural information processing systems, pp. 153–160, 2007.
-  J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?,” in Advances in neural information processing systems, pp. 3320–3328, 2014.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learning, pp. 1139–1147, 2013.
-  R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer, “A stochastic quasi-Newton method for large-scale optimization,” SIAM Journal on Optimization, vol. 26, no. 2, pp. 1008–1031, 2016.
-  P. Moritz, R. Nishihara, and M. Jordan, “A linearly-convergent stochastic L-BFGS algorithm,” in Artificial Intelligence and Statistics, pp. 249–258, 2016.
-  N. N. Schraudolph, J. Yu, and S. Günter, “A stochastic quasi-Newton method for online convex optimization,” in Artificial Intelligence and Statistics, pp. 436–443, 2007.
-  A. Mokhtari and A. Ribeiro, “Res: Regularized stochastic BFGS algorithm,” IEEE Transactions on Signal Processing, vol. 62, no. 23, pp. 6089–6104, 2014.
-  R. H. Byrd, G. M. Chin, W. Neveitt, and J. Nocedal, “On the use of stochastic Hessian information in optimization methods for machine learning,” SIAM Journal on Optimization, vol. 21, no. 3, pp. 977–995, 2011.
-  J. Martens, I. Sutskever, and K. Swersky, “Estimating the Hessian by back-propagating curvature,” in Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 963–970, Omnipress, 2012.
-  J. Martens, “Deep learning via hessian-free optimization,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 735–742, 2010.
-  A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples,” arXiv preprint arXiv:1802.00420, 2018.
-  H. Drucker and Y. Le Cun, “Improving generalization performance using double backpropagation,” IEEE Transactions on Neural Networks, vol. 3, no. 6, pp. 991–997, 1992.
-  S. Gu and L. Rigazio, “Towards deep neural network architectures robust to adversarial examples,” arXiv preprint arXiv:1412.5068, 2014.
-  A. S. Ross and F. Doshi-Velez, “Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients,” arXiv preprint arXiv:1711.09404, 2017.
-  N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597, IEEE, 2016.
-  J. E. Moody, “The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems,” in Advances in neural information processing systems, pp. 847–854, 1992.
-  C. Louizos, K. Ullrich, and M. Welling, “Bayesian compression for deep learning,” in Advances in Neural Information Processing Systems, pp. 3288–3298, 2017.
Appendix A Outer Product Derivations for Deep Networks
a.1 Convolutional and Recurrent Layers
Convolutional and recurrent layers preserve the outer product derivative structure of the fully connected feedforward layers considered above, and we will show this in the following sections. Because we are only considering a single layer of each, we calculate the derivatives of the layer outputs with respect to the layer inputs – in a larger network, those derivatives will be necessary for calculating total derivatives via back-propagation.
a.1.1 Convolutional Layer
We can define a convolutional layer as
where is the layer input,
is the vertical stride,is the horizontal stride, is the activation function, and is the layer output. A convolutional structure can make the expressions somewhat complicated when expressed in index notation, but we can simplify matters by using the simplification . The layer definition is then
The derivatives of the convolutional layer are
with no summation over and in any of the expressions above. Using the simplification with makes it easier to see the structure of these derivatives.
The conditional form of the expressions is more complicated, but it is also possible to see how the derivatives relate to and submatrices of .
a.1.2 Sequence of Convolutional Layers
Index notation becomes unwieldy with multiple convolutional layers in sequence because of the nature of the convolution operation. We describe a sequence of convolutional layers as
where indicates the layer, is the vertical stride, is the horizontal stride, is a layer output, and are layer weights. The first derivatives are as follows: